Roadmap#
Last updated: May 2025
What is in store for Daft in 2025? This roadmap outlines the big picture of what the Daft team plans to work on in the coming year, as well as some of the features to expect from these.
Please note that items on this roadmap are subject to change any time. If there are features you would like to implement, we highly welcome and encourage open source contributions! Our team is happy to provide guidance, help scope the work, and review PRs. Feel free to open an issue or PR on Github or join our Daft Slack Community.
Multimodality#
- Support generic data source and data sink interfaces that can be implemented outside of Daft
- Enhanced support for JSON with a VARIANT data type and JSON_TABLE
- More in-built and optimized expressions for multimodal and nested datatypes
AI#
- Higher level abstractions for building AI applications on top of Daft
- Better AI-specific observability and metrics in AI functions
- Tokens per second
- Estimated API costs
- Better primitives for AI workloads (discussion #3547)
- Async UDFs
- Streaming UDFs
- Native LLM inference functions with Pydantic integration (discussion #2774)
Performance & Scalability#
- Incorporate our local streaming execution engine (Swordfish) into distributed ray runner
- Handle Map-only workloads at any scale factor (100TB+)
- Handle 10TB+ Shuffle workloads
- More powerful cost-based optimizer, implementing advanced optimizations
- Improve the join ordering algorithm to be dynamic-programming based
- Semi-join reduction
- Common subquery elimination
- To complement our blazing fast S3 readers, we aim to build the fastest S3 writes in the wild west
Out-of-the-box Experience#
- Continue expanding feature set and compatibility of Daft’s PySpark connector so that running Spark workloads on Daft is a simple plug-and-play (issue #3581)
- Ordinal column references (issue #4270)
- Window function support (issue #2108)
- Improve catalog and table integrations
- Support for Iceberg deletion vectors and upserts (see roadmap for Iceberg)
- Better Unity Catalog support (issue #2482)
- Improve observability tools (logging/metrics/traces) (issue #4380)
- Improve experience working with AI tools
- LLM context file (issue #4293)
Future Work#
The following features would be valuable additions to Daft, but are not currently on our immediate development roadmap. We're sharing these to highlight opportunities for open source contributions, invite discussion around implementation approaches, and provide visibility into longer-term possibilities. These features have been tagged with help wanted
and good first issue
on Daft repo.
- Improved Delta Lake support (see roadmap for Delta Lake)
- Support for reading tables with deletion vectors (issue #1954)
- Support for reading tables with column mappings (issue #1955)
- Improved Apache Hudi support (see roadmap for Apache Hudi)
- Expressions parity with PySpark: Temporal (issue #3798), Math (issue #3793), String (issue #3792)
If you are interested in working on any of these features, feel free to open an issue or start a discussion on Github or join our Daft Slack Community. Our team can provide technical direction and help scope the work appropriately. Thank you in advance 💜