Overview#
Welcome to Daft!
Daft is a high-performance data engine providing simple and reliable data processing for any modality and scale, from local to petabyte-scale distributed workloads. The core engine is written in Rust and exposes both SQL and Python DataFrame interfaces as first-class citizens.
Why Daft?#
Unified multimodal data processing
Break down data silos with a single framework that handles structured tables, unstructured text, and rich media like images—all with the same intuitive API. Why juggle multiple tools when one can do it all?
Python-native, no JVM required
Built for modern AI/ML workflows with Python at its core and Rust under the hood. Skip the JVM complexity, version conflicts, and memory tuning to achieve 20x faster start times—get the performance without the Java tax.
Seamless scaling, from laptop to cluster
Start local, scale global—without changing a line of code. Daft's Rust-powered engine delivers blazing performance on a single machine and effortlessly extends to distributed clusters with when you need more horsepower.
Key Features#
-
Native Multimodal Processing: Process any data type—from structured tables to unstructured text and rich media—with native support for images, embeddings, and tensors in a single, unified framework.
-
Rust-Powered Performance: Experience breakthrough speed with our Rust foundation delivering vectorized execution and non-blocking I/O that processes the same queries with 5x less memory while consistently outperforming industry standards by an order of magnitude.
-
Seamless ML Ecosystem Integration: Slot directly into your existing ML workflows with zero friction—whether you're using PyTorch, NumPy, Pandas, or HuggingFace models, Daft works where you work.
-
Universal Data Connectivity: Access data anywhere it lives—cloud storage (S3, Azure, GCS), modern table formats (Apache Iceberg, Delta Lake, Apache Hudi), or enterprise catalogs (Unity, AWS Glue)—all with zero configuration.
-
Push your code to your data: Bring your Python functions directly to your data with zero-copy UDFs powered by Apache Arrow, eliminating data movement overhead and accelerating processing speeds.
-
Out of the Box reliability: Deploy with confidence—intelligent memory management prevents OOM errors while sensible defaults eliminate configuration headaches, letting you focus on results, not infrastructure.
Learning Daft#
This user guide aims to help Daft users master the usage of Daft for all your data needs.
Looking to get started with Daft ASAP?
The Daft User Guide is a useful resource to take deeper dives into specific Daft concepts, but if you are ready to jump into code you may wish to take a look at these resources:
-
Quickstart: Itching to run some Daft code? Hit the ground running with our 10 minute quickstart notebook.
-
API Documentation: Searchable documentation and reference material to Daft’s public API.
Get Started#
-
Install Daft from your terminal and discover more advanced installation options.
-
Install Daft, create your first DataFrame, and get started with common DataFrame operations.
-
Understand the different components to Daft under-the-hood.
Daft in Depth#
-
Learn how to perform core DataFrame operations in Daft, including selection, filtering, joining, and sorting.
-
Daft expressions enable computations on DataFrame columns using Python or SQL for various operations.
-
How to use Daft to read data from diverse sources like files, databases, and URLs.
-
How to use Daft to write data DataFrames to files or other destinations.
-
Daft DataTypes define the types of data in a DataFrame, from simple primitives to complex structures.
-
Daft supports SQL for constructing query plans and expressions, while integrating with Python expressions.
-
Daft supports aggregations and grouping across entire DataFrames and within grouped subsets of data.
-
Daft's window functions allow you to perform calculations across a set of rows related to the current row.
-
Daft allows you to define custom UDFs to process data at scale with flexibility in input and output.
-
Daft is built to work with multimodal data types, including URLs and images.
-
Daft's native support for Ray enables you to run distributed DataFrame workloads at scale.
More Resources#
Contribute to Daft#
If you're interested in hands-on learning about Daft internals and would like to contribute to our project, join us on Github 🚀
Take a look at the many issues tagged with good first issue
in our repo. If there are any that interest you, feel free to chime in on the issue itself or join us in our Distributed Data Slack Community and send us a message in #daft-dev. Daft team members will be happy to assign any issue to you and provide any guidance if needed!