Distributed Computing#
By default, Daft runs using your local machine's resources and your operations are thus limited by the CPUs, memory and GPUs available to you in your single local development machine. Daft's native support for Ray, an open-source framework for distributed computing, enables you to run distributed DataFrame workloads at scale across a cluster of machines.
Setting Up Ray with Daft#
You can run Daft on Ray in multiple ways:
Simple Local Setup#
If you want to start a single node ray cluster on your local machine, you can do the following:
1 2 |
|
This should output something like:
1 2 3 4 5 6 7 8 9 |
|
You can take the IP address and port and pass it to Daft with set_runner_ray
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
By default, if no address is specified, Daft will spin up a Ray cluster locally on your machine. If you are running Daft on a powerful machine (such as an AWS P3 machine which is equipped with multiple GPUs) this is already very useful because Daft can parallelize its execution of computation across your CPUs and GPUs.
Connecting to Remote Ray Clusters#
If you already have your own Ray cluster running remotely, you can connect Daft to it by supplying an address with set_runner_ray
:
1 |
|
For more information about the address
keyword argument, please see the Ray documentation on initialization.
Using Ray Client#
The Ray client is a quick way to get started with running tasks and retrieving their results on Ray using Python.
Warning
To run tasks using the Ray client, the version of Daft and the minor version (eg. 3.9, 3.10) of Python must match between client and server.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
Output | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Using Ray Jobs#
Ray jobs allow for more control and observability over using the Ray client. In addition, your entire code runs on Ray, which means it is not constrained by the compute, network, library versions, or availability of your local machine.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
To submit this script as a job, use the Ray CLI, which can be installed with pip install "ray[default]"
.
1 2 3 4 5 |
|
Note
The runtime env parameter specifies that Daft should be installed on the Ray workers. Alternative methods of including Daft in the worker dependencies can be found here.
For more information about Ray jobs, see Ray docs -> Ray Jobs Overview.
Daft CLI Overview#
Daft CLI is a convenient command-line tool that simplifies running Daft in distributed environments. It provides two modes of operation to suit different needs:
-
Provisioned Mode: Automatically provisions and manages Ray clusters in AWS. This is perfect for teams who want a turnkey solution with minimal setup.
-
BYOC (Bring Your Own Cluster) Mode: Connects to existing Kubernetes clusters and handles Ray/Daft setup for you. This is ideal for organizations with existing infrastructure or specific compliance requirements.
When to Choose Each Mode#
Choose Provisioned Mode if you: | Choose BYOC Mode if you: |
---|---|
• Want a fully managed solution with minimal setup | • Have existing Kubernetes infrastructure |
• Are using AWS (GCP and Azure support coming soon) | • Need multi-cloud support |
• Need quick deployment without existing infrastructure | • Have specific security or compliance requirements |
• Want to use local development clusters | |
• Want more control over your cluster configuration |
Prerequisites#
The following should be installed on your machine:
A python package manager. We recommend using uv
to manage everything (i.e., dependencies, as well as the python version itself)
Additional mode-specific requirements:
For Provisioned Mode: | For BYOC Mode: |
---|---|
• The AWS CLI tool | • Running Kubernetes cluster (local, cloud-managed, or on-premise) |
• AWS account with appropriate IAM permissions | • kubectl configured with correct context |
• SSH key pair for cluster access | • Appropriate namespace permissions |
Installation#
Run the following commands in your terminal to initialize your project:
1 2 3 4 5 6 7 8 9 10 11 |
|
In your virtual environment, you should have Daft CLI installed — you can verify this by running daft --version
.
Mode-Specific Setup#
Provisioned Mode Setup#
-
Configure AWS credentials:
1 2 3 4 5
# Configure your SSO aws configure sso # Login to your SSO aws sso login
-
Generate and configure SSH keys:
1 2 3 4 5 6 7 8 9 10
# Generate key pair ssh-keygen -t rsa -b 2048 -f ~/.ssh/daft-key # Import to AWS aws ec2 import-key-pair \ --key-name "daft-key" \ --public-key-material fileb://~/.ssh/daft-key.pub # Set permissions chmod 600 ~/.ssh/daft-key
BYOC Mode Setup#
Ensure your Kubernetes context is properly configured:
1 2 3 4 5 |
|
Configuration#
Initialize a configuration file based on your chosen mode:
1 2 3 4 5 |
|
Example Configurations#
Provisioned Mode (.daft.toml):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
BYOC Mode (.daft.toml):
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Cluster Operations#
Provisioned Mode#
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
BYOC Mode#
1 2 3 4 5 6 7 8 |
|
Job Management#
Jobs can be submitted and managed similarly in both modes:
1 2 3 4 5 6 7 8 |
|
Example Daft Script#
1 2 3 4 5 |
|
SQL Query Support#
Daft supports running SQL queries against your data using the postgres dialect:
1 2 |
|
Ray Dashboard Access#
The Ray dashboard provides insights into your cluster's performance and job status:
1 2 3 4 5 |
|
Note
For Provisioned Mode, you'll need your SSH key to access the dashboard. BYOC Mode uses your Kubernetes credentials.
Monitoring Cluster State#
For Provisioned Mode, daft provisioned list
shows cluster status:
1 2 3 4 |
|
For BYOC Mode, use standard Kubernetes tools:
1 |
|