DataFrame#

Most DataFrame methods are lazy, meaning that they do not execute computation immediately when invoked. Instead, these operations are enqueued in the DataFrame's internal query plan, and are only executed when Execution DataFrame methods are called.

DataFrame #

DataFrame(builder: LogicalPlanBuilder)

A Daft DataFrame is a table of data.

It has columns, where each column has a type and the same number of items (rows) as all other columns.

Constructs a DataFrame according to a given LogicalPlan.

Users are expected instead to call the classmethods on DataFrame to create a DataFrame.

Parameters:

Name	Type	Description	Default
`builder`	`LogicalPlanBuilder`	LogicalPlan describing the steps required to arrive at this DataFrame	required

Methods:

Name	Description
`__arrow_c_schema__`
`__arrow_c_stream__`	Export as an Arrow C stream (PyCapsule).
`__contains__`	Returns whether the column exists in the dataframe.
`__getitem__`	Gets a column from the DataFrame as an Expression (`df["mycol"]`).
`__iter__`	Alias of `self.iter_rows()` with default arguments for convenient access of data.
`__len__`	Returns the count of rows when dataframe is materialized.
`agg`	Perform aggregations on this DataFrame.
`agg_concat`	Performs a global concatenation agg on the DataFrame.
`agg_list`	Performs a global list agg on the DataFrame.
`agg_set`	Performs a global set agg on the DataFrame (ignoring nulls).
`any_value`	Returns an arbitrary value on this DataFrame.
`collect`	Executes the entire DataFrame and materializes the results.
`concat`	Concatenates two DataFrames together in a "vertical" concatenation.
`count`	Performs a global count on the DataFrame.
`count_distinct`	Performs a global count of distinct values on the DataFrame.
`count_rows`	Executes the Dataframe to count the number of rows.
`describe`	Returns the Schema of the DataFrame, which provides information about each column, as a new DataFrame.
`distinct`	Computes distinct rows, dropping duplicates.
`drop_duplicates`	Computes distinct rows, dropping duplicates.
`drop_nan`	Drops rows that contains NaNs. If cols is None it will drop rows with any NaN value.
`drop_null`	Drops rows that contains NaNs or NULLs. If cols is None it will drop rows with any NULL value.
`except_all`	Returns the set difference of two DataFrames, considering duplicates.
`except_distinct`	Returns the set difference of two DataFrames.
`exclude`	Drops columns from the current DataFrame by name.
`explain`	Prints the (logical and physical) plans that will be executed to produce this DataFrame.
`explode`	Explodes a List column, where every element in each row's List becomes its own row, and all other columns in the DataFrame are duplicated across rows.
`filter`	Filters rows via a predicate expression, similar to SQL `WHERE`.
`groupby`	Performs a GroupBy on the DataFrame for aggregation.
`intersect`	Returns the intersection of two DataFrames.
`intersect_all`	Returns the intersection of two DataFrames, including duplicates.
`into_batches`	Splits or coalesces DataFrame to partitions of size `batch_size`.
`into_partitions`	Splits or coalesces DataFrame to `num` partitions. Order is preserved.
`iter_partitions`	Begin executing this dataframe and return an iterator over the partitions.
`iter_rows`	Return an iterator of rows for this dataframe.
`join`	Column-wise join of the current DataFrame with an `other` DataFrame, similar to a SQL `JOIN`.
`join_asof`	Point-in-time (asof) join: each left row matches the nearest right row according to the chosen strategy.
`limit`	Limits the rows in the DataFrame to the first `N` rows, similar to a SQL `LIMIT`.
`max`	Performs a global max on the DataFrame.
`mean`	Performs a global mean on the DataFrame.
`melt`	Alias for unpivot.
`min`	Performs a global min on the DataFrame.
`num_partitions`	Returns the number of partitions that will be used to execute this DataFrame.
`offset`	Returns a new DataFrame by skipping the first `N` rows, similar to a SQL `Offset`.
`pipe`	Apply the function to this DataFrame.
`pivot`	Pivots a column of the DataFrame and performs an aggregation on the values.
`product`	Performs a global product on the DataFrame.
`repartition`	Repartitions DataFrame to `num` partitions.
`sample`	Samples rows from the DataFrame.
`schema`	Returns the Schema of the DataFrame, which provides information about each column, as a Python object.
`select`	Creates a new DataFrame from the provided expressions, similar to a SQL `SELECT`.
`show`	Executes enough of the DataFrame in order to display the first `n` rows.
`shuffle`	Randomly reorders rows of the DataFrame.
`skew`	Performs a global skew on the DataFrame.
`skip_existing`	Filter out rows whose key(s) already exist in existing data (i.e., already processed rows).
`sort`	Sorts DataFrame globally.
`stddev`	Performs a global standard deviation on the DataFrame.
`sum`	Performs a global sum on the DataFrame.
`summarize`	Returns column statistics for the DataFrame.
`to_arrow`	Converts the current DataFrame to a pyarrow Table.
`to_arrow_iter`	Return an iterator of pyarrow recordbatches for this dataframe.
`to_dask_dataframe`	Converts the current Daft DataFrame to a Dask DataFrame.
`to_pandas`	Converts the current DataFrame to a pandas DataFrame.
`to_pydict`	Converts the current DataFrame to a python dictionary. The dictionary contains Python lists of Python objects for each column.
`to_pylist`	Converts the current Dataframe into a python list.
`to_ray_dataset`	Converts the current DataFrame to a Ray Dataset which is useful for running distributed ML model training in Ray.
`to_torch_iter_dataset`	Convert the current DataFrame into a `Torch IterableDataset <https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset>`__ for use with PyTorch.
`to_torch_map_dataset`	Convert the current DataFrame into a map-style Torch Dataset for use with PyTorch.
`transform`	Apply a function that takes and returns a DataFrame.
`union`	Returns the distinct union of two DataFrames.
`union_all`	Returns the union of two DataFrames, including duplicates.
`union_all_by_name`	Returns the union of two DataFrames, including duplicates, with columns matched by name.
`union_by_name`	Returns the distinct union by name.
`unique`	Computes distinct rows, dropping duplicates.
`unpivot`	Unpivots a DataFrame from wide to long format.
`var`	Performs a global variance on the DataFrame.
`where`	Filters rows via a predicate expression, similar to SQL `WHERE`.
`with_column`	Adds a column to the current DataFrame with an Expression, equivalent to a `select` with all current columns and the new one.
`with_column_renamed`	Renames a column in the current DataFrame.
`with_columns`	Adds columns to the current DataFrame with Expressions, equivalent to a `select` with all current columns and the new ones.
`with_columns_renamed`	Renames multiple columns in the current DataFrame.
`write_bigtable`	Write a DataFrame into a Google Cloud Bigtable table.
`write_clickhouse`	Writes the DataFrame to a ClickHouse table.
`write_csv`	Writes the DataFrame as CSV files, returning a new DataFrame with paths to the files that were written.
`write_deltalake`	Writes the DataFrame to a Delta Lake table, returning a new DataFrame with the operations that occurred.
`write_huggingface`	Write a DataFrame into a Hugging Face dataset.
`write_iceberg`	Writes the DataFrame to an Iceberg table, returning a new DataFrame with the operations that occurred.
`write_json`	Writes the DataFrame as JSON files, returning a new DataFrame with paths to the files that were written.
`write_lance`	Writes the DataFrame to a Lance table.
`write_paimon`	Writes the DataFrame to an Apache Paimon table, returning a summary DataFrame.
`write_parquet`	Writes the DataFrame as parquet files, returning a new DataFrame with paths to the files that were written.
`write_sink`	Writes the DataFrame to the given DataSink.
`write_sql`	Write the DataFrame to a SQL database and return write metrics.
`write_turbopuffer`	Writes the DataFrame to a Turbopuffer namespace.

Attributes:

Name	Type	Description
`column_names`	`list[str]`	Returns column names of DataFrame as a list of strings.
`columns`	`list[Expression]`	Returns column of DataFrame as a list of Expressions.
`metrics`	`RecordBatch \| None`

Source code in daft/dataframe/dataframe.py

def __init__(self, builder: LogicalPlanBuilder) -> None:
    """Constructs a DataFrame according to a given LogicalPlan.

    Users are expected instead to call the classmethods on DataFrame to create a DataFrame.

    Args:
        builder: LogicalPlan describing the steps required to arrive at this DataFrame
    """
    if not isinstance(builder, LogicalPlanBuilder):
        if isinstance(builder, dict):
            raise ValueError(
                "DataFrames should be constructed with a dictionary of columns using `daft.from_pydict`"
            )
        if isinstance(builder, list):
            raise ValueError(
                "DataFrames should be constructed with a list of dictionaries using `daft.from_pylist`"
            )
        raise ValueError(f"Expected DataFrame to be constructed with a LogicalPlanBuilder, received: {builder}")

    self.__builder = builder
    self._result_cache: PartitionCacheEntry | None = None
    self._preview = Preview(partition=None, total_rows=None)
    self._metadata: ExecutionMetadata | None = None
    self._num_preview_rows = get_context().daft_execution_config.num_preview_rows

column_names #

column_names: list[str]

Returns column names of DataFrame as a list of strings.

Returns:

Type	Description
`list[str]`	List[str]: Column names of this DataFrame.

columns #

columns: list[Expression]

Returns column of DataFrame as a list of Expressions.

Returns:

Type	Description
`list[Expression]`	List[Expression]: Columns of this DataFrame.

metrics #

metrics: RecordBatch | None

__arrow_c_schema__ #

__arrow_c_schema__() -> Any

Source code in daft/dataframe/dataframe.py

def __arrow_c_schema__(self) -> Any:
    return self.schema().to_pyarrow_schema().__arrow_c_schema__()

__arrow_c_stream__ #

__arrow_c_stream__(requested_schema: Any = None) -> Any

Export as an Arrow C stream (PyCapsule).

This triggers materialization of the DataFrame. Enables pa.table(daft_df) and other Arrow PyCapsule consumers.

Source code in daft/dataframe/dataframe.py

def __arrow_c_stream__(self, requested_schema: Any = None) -> Any:
    """Export as an Arrow C stream (PyCapsule).

    This triggers materialization of the DataFrame.
    Enables ``pa.table(daft_df)`` and other Arrow PyCapsule consumers.
    """
    self.collect()
    assert self._result is not None
    mp = self._result._get_merged_micropartition(self.schema())
    return mp._micropartition.__arrow_c_stream__(requested_schema)

contains #

__contains__(col_name: str) -> bool

Returns whether the column exists in the dataframe.

Parameters:

Name	Type	Description	Default
`col_name`	`str`	column name	required

Returns:

Name	Type	Description
`bool`	`bool`	whether the column exists in the dataframe.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
>>> "x" in df

True

Source code in daft/dataframe/dataframe.py

def __contains__(self, col_name: str) -> bool:
    """Returns whether the column exists in the dataframe.

    Args:
        col_name (str): column name

    Returns:
        bool: whether the column exists in the dataframe.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
        >>> "x" in df
        True

    """
    return col_name in self.column_names

getitem #

__getitem__(item: int) -> Expression

__getitem__(item: str) -> Expression

__getitem__(item: slice) -> DataFrame

__getitem__(item: Iterable) -> DataFrame

__getitem__(item: int | str | slice | Iterable[str | int]) -> Union[Expression, DataFrame]

Gets a column from the DataFrame as an Expression (df["mycol"]).

Parameters:

Name	Type	Description	Default
`item`	`Union[int, str, slice, Iterable[Union[str, int]]]`	The column to get. Can be an integer index, a string column name, a slice for multiple columns, or an iterable of column names or indices.	required

Returns:

Type	Description
`Union[Expression, DataFrame]`	Union[Expression, DataFrame]: If a single column is requested, returns an Expression representing that column.
`Union[Expression, DataFrame]`	If multiple columns are requested (via a slice or iterable), returns a new DataFrame containing those columns.

Examples:

>>> import daft
>>> df = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})
>>> df["a"]  # Get a single column
>>> df["b"]  # Get another single column
>>> df[0]  # Get the first column by index
>>> df[1:3]  # Get a slice of columns
>>> df[["a", "c"]]  # Get multiple columns by name
>>> df[["a", 1]]  # Get multiple columns by name and index
>>> df[0:2]  # Get a slice of columns by index
>>> df[["a", "b", 2]]  # Get a mix of column names and indices

col(a)
col(b)
col(a)
╭───────┬───────╮
│ b     ┆ c     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╰───────┴───────╯
(No data to display: Dataframe not materialized, use .collect() to materialize)
╭───────┬───────╮
│ a     ┆ c     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╰───────┴───────╯
(No data to display: Dataframe not materialized, use .collect() to materialize)
╭───────┬───────╮
│ a     ┆ b     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╰───────┴───────╯
(No data to display: Dataframe not materialized, use .collect() to materialize)
╭───────┬───────╮
│ a     ┆ b     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╰───────┴───────╯
(No data to display: Dataframe not materialized, use .collect() to materialize)
╭───────┬───────┬───────╮
│ a     ┆ b     ┆ c     │
│ ---   ┆ ---   ┆ ---   │
│ Int64 ┆ Int64 ┆ Int64 │
╰───────┴───────┴───────╯
(No data to display: Dataframe not materialized, use .collect() to materialize)

Source code in daft/dataframe/dataframe.py

def __getitem__(self, item: int | str | slice | Iterable[str | int]) -> Union[Expression, "DataFrame"]:
    """Gets a column from the DataFrame as an Expression (``df["mycol"]``).

    Args:
        item (Union[int, str, slice, Iterable[Union[str, int]]]): The column to get. Can be an integer index, a string column name, a slice for multiple columns, or an iterable of column names or indices.

    Returns:
        Union[Expression, DataFrame]: If a single column is requested, returns an Expression representing that column.
        If multiple columns are requested (via a slice or iterable), returns a new DataFrame containing those columns.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})
        >>> df["a"]  # Get a single column
        col(a)
        >>> df["b"]  # Get another single column
        col(b)
        >>> df[0]  # Get the first column by index
        col(a)
        >>> df[1:3]  # Get a slice of columns
        ╭───────┬───────╮
        │ b     ┆ c     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╰───────┴───────╯
        <BLANKLINE>
        (No data to display: Dataframe not materialized, use .collect() to materialize)
        >>> df[["a", "c"]]  # Get multiple columns by name
        ╭───────┬───────╮
        │ a     ┆ c     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╰───────┴───────╯
        <BLANKLINE>
        (No data to display: Dataframe not materialized, use .collect() to materialize)
        >>> df[["a", 1]]  # Get multiple columns by name and index
        ╭───────┬───────╮
        │ a     ┆ b     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╰───────┴───────╯
        <BLANKLINE>
        (No data to display: Dataframe not materialized, use .collect() to materialize)
        >>> df[0:2]  # Get a slice of columns by index
        ╭───────┬───────╮
        │ a     ┆ b     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╰───────┴───────╯
        <BLANKLINE>
        (No data to display: Dataframe not materialized, use .collect() to materialize)
        >>> df[["a", "b", 2]]  # Get a mix of column names and indices
        ╭───────┬───────┬───────╮
        │ a     ┆ b     ┆ c     │
        │ ---   ┆ ---   ┆ ---   │
        │ Int64 ┆ Int64 ┆ Int64 │
        ╰───────┴───────┴───────╯
        <BLANKLINE>
        (No data to display: Dataframe not materialized, use .collect() to materialize)

    """
    result: Expression | None

    if isinstance(item, int):
        schema = self._builder.schema()
        if item < -len(schema) or item >= len(schema):
            raise ValueError(f"{item} out of bounds for {schema}")
        result = ExpressionsProjection.from_schema(schema)[item]
        assert result is not None
        return result
    elif isinstance(item, str):
        schema = self._builder.schema()
        if item not in schema.column_names() and item != "*":
            raise ValueError(f"{item} does not exist in schema {schema}")

        return col(item)
    elif isinstance(item, Iterable):
        schema = self._builder.schema()

        columns = []
        for it in item:
            if isinstance(it, str):
                result = col(schema[it].name)
                columns.append(result)
            elif isinstance(it, int):
                if it < -len(schema) or it >= len(schema):
                    raise ValueError(f"{it} out of bounds for {schema}")
                field = list(self._builder.schema())[it]
                columns.append(col(field.name))
            else:
                raise ValueError(f"unknown indexing type: {type(it)}")
        return self.select(*columns)
    elif isinstance(item, slice):
        schema = self._builder.schema()
        columns_exprs: ExpressionsProjection = ExpressionsProjection.from_schema(schema)
        selected_columns = columns_exprs[item]
        return self.select(*[typing.cast("ColumnInputType", c) for c in selected_columns])
    else:
        raise ValueError(f"unknown indexing type: {type(item)}")

iter #

__iter__() -> Iterator[dict[str, Any]]

Alias of self.iter_rows() with default arguments for convenient access of data.

Returns:

Type	Description
`Iterator[dict[str, Any]]`	Iterator[dict[str, Any]]: An iterator over the rows of the DataFrame, where each row is a dictionary
`Iterator[dict[str, Any]]`	mapping column names to values.

Examples:

>>> import daft
>>> df = daft.from_pydict({"foo": [1, 2, 3], "bar": ["a", "b", "c"]})
>>> for row in df:
...     print(row)

{'foo': 1, 'bar': 'a'}
{'foo': 2, 'bar': 'b'}
{'foo': 3, 'bar': 'c'}

Tip

See also df.iter_rows(): iterator over rows with more options

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def __iter__(self) -> Iterator[dict[str, Any]]:
    """Alias of `self.iter_rows()` with default arguments for convenient access of data.

    Returns:
        Iterator[dict[str, Any]]: An iterator over the rows of the DataFrame, where each row is a dictionary
        mapping column names to values.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"foo": [1, 2, 3], "bar": ["a", "b", "c"]})
        >>> for row in df:
        ...     print(row)
        {'foo': 1, 'bar': 'a'}
        {'foo': 2, 'bar': 'b'}
        {'foo': 3, 'bar': 'c'}

    Tip:
        See also [`df.iter_rows()`][daft.DataFrame.iter_rows]: iterator over rows with more options
    """
    return self.iter_rows(results_buffer_size=None)

len #

__len__() -> int

Returns the count of rows when dataframe is materialized.

If dataframe is not materialized yet, raises a runtime error.

Returns:

Name	Type	Description
`int`	`int`	count of rows.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> df = df.collect()
>>> len(df)

Source code in daft/dataframe/dataframe.py

def __len__(self) -> int:
    """Returns the count of rows when dataframe is materialized.

    If dataframe is not materialized yet, raises a runtime error.

    Returns:
        int: count of rows.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
        >>> df = df.collect()
        >>> len(df)
        3

    """
    if self._result is not None:
        return len(self._result)

    message = (
        "Cannot call len() on an unmaterialized dataframe:"
        " either materialize your dataframe with df.collect() first before calling len(),"
        " or use `df.count_rows()` instead which will calculate the total number of rows."
    )
    raise RuntimeError(message)

agg #

agg(*to_agg: Expression | Iterable[Expression]) -> DataFrame

Perform aggregations on this DataFrame.

Allows for mixed aggregations for multiple columns and will return a single row that aggregated the entire DataFrame.

Parameters:

Name	Type	Description	Default
`*to_agg`	`Expression`	aggregation expressions	`()`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with aggregated results

Examples:

>>> import daft
>>> from daft import col
>>> df = daft.from_pydict(
...     {"student_id": [1, 2, 3, 4], "test1": [0.5, 0.4, 0.6, 0.7], "test2": [0.9, 0.8, 0.7, 1.0]}
... )
>>> agg_df = df.agg(
...     df["test1"].mean(),
...     df["test2"].mean(),
...     ((df["test1"] + df["test2"]) / 2).min().alias("total_min"),
...     ((df["test1"] + df["test2"]) / 2).max().alias("total_max"),
... )
>>> agg_df.show()

╭─────────┬────────────────────┬────────────────────┬───────────╮
│ test1   ┆ test2              ┆ total_min          ┆ total_max │
│ ---     ┆ ---                ┆ ---                ┆ ---       │
│ Float64 ┆ Float64            ┆ Float64            ┆ Float64   │
╞═════════╪════════════════════╪════════════════════╪═══════════╡
│ 0.55    ┆ 0.8500000000000001 ┆ 0.6000000000000001 ┆ 0.85      │
╰─────────┴────────────────────┴────────────────────┴───────────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def agg(self, *to_agg: Expression | Iterable[Expression]) -> "DataFrame":
    """Perform aggregations on this DataFrame.

    Allows for mixed aggregations for multiple columns and will return a single row that aggregated the entire DataFrame.

    Args:
        *to_agg (Expression): aggregation expressions

    Returns:
        DataFrame: DataFrame with aggregated results

    Examples:
        >>> import daft
        >>> from daft import col
        >>> df = daft.from_pydict(
        ...     {"student_id": [1, 2, 3, 4], "test1": [0.5, 0.4, 0.6, 0.7], "test2": [0.9, 0.8, 0.7, 1.0]}
        ... )
        >>> agg_df = df.agg(
        ...     df["test1"].mean(),
        ...     df["test2"].mean(),
        ...     ((df["test1"] + df["test2"]) / 2).min().alias("total_min"),
        ...     ((df["test1"] + df["test2"]) / 2).max().alias("total_max"),
        ... )
        >>> agg_df.show()
        ╭─────────┬────────────────────┬────────────────────┬───────────╮
        │ test1   ┆ test2              ┆ total_min          ┆ total_max │
        │ ---     ┆ ---                ┆ ---                ┆ ---       │
        │ Float64 ┆ Float64            ┆ Float64            ┆ Float64   │
        ╞═════════╪════════════════════╪════════════════════╪═══════════╡
        │ 0.55    ┆ 0.8500000000000001 ┆ 0.6000000000000001 ┆ 0.85      │
        ╰─────────┴────────────────────┴────────────────────┴───────────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)

    """
    to_agg_list = (
        list(to_agg[0])
        if (len(to_agg) == 1 and not isinstance(to_agg[0], Expression))
        else list(typing.cast("tuple[Expression]", to_agg))
    )

    for expr in to_agg_list:
        if not isinstance(expr, Expression):
            raise ValueError(f"DataFrame.agg() only accepts expression type, received: {type(expr)}")

    return self._agg(to_agg_list, group_by=None)

agg_concat #

agg_concat(*cols: ColumnInputType, delimiter: str | None = None) -> DataFrame

Performs a global concatenation agg on the DataFrame.

Parameters:

Name	Type	Description	Default
`*cols`	`Union[str, Expression]`	columns that are lists or strings to concatenate	`()`
`delimiter`	`str \| None`	Optional delimiter to insert between concatenated string values. Only supported for string columns.	`None`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Globally aggregated list or string. Should be a single row.

Examples:

>>> import daft
>>> from daft import col
>>> df = daft.from_pydict({"col_a": [[1, 2], [3, 4]]})
>>> df = df.agg_concat("col_a")
>>> df.show()

╭──────────────╮
│ col_a        │
│ ---          │
│ List[Int64]  │
╞══════════════╡
│ [1, 2, 3, 4] │
╰──────────────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def agg_concat(self, *cols: ColumnInputType, delimiter: str | None = None) -> "DataFrame":
    """Performs a global concatenation agg on the DataFrame.

    Args:
        *cols (Union[str, Expression]): columns that are lists or strings to concatenate
        delimiter: Optional delimiter to insert between concatenated string values. Only supported for string
            columns.

    Returns:
        DataFrame: Globally aggregated list or string. Should be a single row.

    Examples:
        >>> import daft
        >>> from daft import col
        >>> df = daft.from_pydict({"col_a": [[1, 2], [3, 4]]})
        >>> df = df.agg_concat("col_a")
        >>> df.show()
        ╭──────────────╮
        │ col_a        │
        │ ---          │
        │ List[Int64]  │
        ╞══════════════╡
        │ [1, 2, 3, 4] │
        ╰──────────────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)
    """
    return self._apply_agg_fn(lambda expr: Expression.string_agg(expr, delimiter=delimiter), cols)

agg_list #

agg_list(*cols: ColumnInputType) -> DataFrame

Performs a global list agg on the DataFrame.

Parameters:

Name	Type	Description	Default
`*cols`	`Union[str, Expression]`	columns to form into a list	`()`

Returns: DataFrame: Globally aggregated list. Should be a single row.

Examples:

>>> import daft
>>> from daft import col
>>> df = daft.from_pydict({"col_a": [1, 2, 3]})
>>> df = df.agg_list("col_a")
>>> df.show()

╭─────────────╮
│ col_a       │
│ ---         │
│ List[Int64] │
╞═════════════╡
│ [1, 2, 3]   │
╰─────────────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def agg_list(self, *cols: ColumnInputType) -> "DataFrame":
    """Performs a global list agg on the DataFrame.

    Args:
        *cols (Union[str, Expression]): columns to form into a list
    Returns:
        DataFrame: Globally aggregated list. Should be a single row.

    Examples:
        >>> import daft
        >>> from daft import col
        >>> df = daft.from_pydict({"col_a": [1, 2, 3]})
        >>> df = df.agg_list("col_a")
        >>> df.show()
        ╭─────────────╮
        │ col_a       │
        │ ---         │
        │ List[Int64] │
        ╞═════════════╡
        │ [1, 2, 3]   │
        ╰─────────────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)
    """
    return self._apply_agg_fn(Expression.list_agg, cols)

agg_set #

agg_set(*cols: ColumnInputType) -> DataFrame

Performs a global set agg on the DataFrame (ignoring nulls).

Parameters:

Name	Type	Description	Default
`*cols`	`Union[str, Expression]`	columns to form into a set	`()`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Globally aggregated set. Should be a single row.

Examples:

>>> import daft
>>> from daft import col
>>> df = daft.from_pydict({"col_a": [1, 2, 2, 3]})
>>> df = df.agg_set("col_a")
>>> df.show()

╭─────────────╮
│ col_a       │
│ ---         │
│ List[Int64] │
╞═════════════╡
│ [1, 2, 3]   │
╰─────────────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def agg_set(self, *cols: ColumnInputType) -> "DataFrame":
    """Performs a global set agg on the DataFrame (ignoring nulls).

    Args:
        *cols (Union[str, Expression]): columns to form into a set

    Returns:
        DataFrame: Globally aggregated set. Should be a single row.

    Examples:
        >>> import daft
        >>> from daft import col
        >>> df = daft.from_pydict({"col_a": [1, 2, 2, 3]})
        >>> df = df.agg_set("col_a")
        >>> df.show()
        ╭─────────────╮
        │ col_a       │
        │ ---         │
        │ List[Int64] │
        ╞═════════════╡
        │ [1, 2, 3]   │
        ╰─────────────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)
    """
    return self._apply_agg_fn(Expression.list_agg_distinct, cols)

any_value #

any_value(*cols: ColumnInputType) -> DataFrame

Returns an arbitrary value on this DataFrame.

Values for each column are not guaranteed to be from the same row.

Parameters:

Name	Type	Description	Default
`*cols`	`Union[str, Expression]`	columns to get an arbitrary value from	`()`

Returns: DataFrame: DataFrame with any values.

Examples:

>>> import daft
>>> df = daft.from_pydict({"col_a": [1, 2, 3]})
>>> df = df.any_value("col_a")
>>> df.show()

╭───────╮
│ col_a │
│ ---   │
│ Int64 │
╞═══════╡
│ 1     │
╰───────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def any_value(self, *cols: ColumnInputType) -> "DataFrame":
    """Returns an arbitrary value on this DataFrame.

    Values for each column are not guaranteed to be from the same row.

    Args:
        *cols (Union[str, Expression]): columns to get an arbitrary value from
    Returns:
        DataFrame: DataFrame with any values.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"col_a": [1, 2, 3]})
        >>> df = df.any_value("col_a")
        >>> df.show()
        ╭───────╮
        │ col_a │
        │ ---   │
        │ Int64 │
        ╞═══════╡
        │ 1     │
        ╰───────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)
    """
    return self._apply_agg_fn(Expression.any_value, cols)

collect #

collect(num_preview_rows: int | None = 8) -> DataFrame

Executes the entire DataFrame and materializes the results.

Parameters:

Name	Type	Description	Default
`num_preview_rows`	`int \| None`	Number of rows to preview. Defaults to 8.	`8`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with materialized results.

Note

This call is blocking and will execute the DataFrame when called

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> df = df.collect()
>>> df.show()

╭───────┬───────╮
│ x     ┆ y     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 1     ┆ 4     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 6     │
╰───────┴───────╯
(Showing first 3 of 3 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def collect(self, num_preview_rows: int | None = 8) -> "DataFrame":
    """Executes the entire DataFrame and materializes the results.

    Args:
        num_preview_rows: Number of rows to preview. Defaults to 8.

    Returns:
        DataFrame: DataFrame with materialized results.

    Note:
        This call is **blocking** and will execute the DataFrame when called

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
        >>> df = df.collect()
        >>> df.show()
        ╭───────┬───────╮
        │ x     ┆ y     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ 1     ┆ 4     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 5     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 3     ┆ 6     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)
    """
    self._materialize_results()
    assert self._result is not None
    dataframe_len = len(self._result)
    if num_preview_rows is not None:
        self._num_preview_rows = num_preview_rows
    else:
        self._num_preview_rows = dataframe_len
    return self

concat #

concat(other: DataFrame) -> DataFrame

Concatenates two DataFrames together in a "vertical" concatenation.

The resulting DataFrame has number of rows equal to the sum of the number of rows of the input DataFrames.

Parameters:

Name	Type	Description	Default
`other`	`DataFrame`	other DataFrame to concatenate	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with rows from `self` on top and rows from `other` at the bottom.

Note

DataFrames being concatenated must have exactly the same schema. You may wish to use the df.select() and expr.cast() methods to ensure schema compatibility before concatenation.

Examples:

>>> import daft
>>> df1 = daft.from_pydict({"a": [1, 2], "b": [3, 4]})
>>> df2 = daft.from_pydict({"a": [5, 6], "b": [7, 8]})
>>> concatenated_df = df1.concat(df2)
>>> concatenated_df.show()

╭───────┬───────╮
│ a     ┆ b     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 1     ┆ 3     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 4     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 5     ┆ 7     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 6     ┆ 8     │
╰───────┴───────╯
(Showing first 4 of 4 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def concat(self, other: "DataFrame") -> "DataFrame":
    """Concatenates two DataFrames together in a "vertical" concatenation.

    The resulting DataFrame has number of rows equal to the sum of the number of rows of the input DataFrames.

    Args:
        other (DataFrame): other DataFrame to concatenate

    Returns:
        DataFrame: DataFrame with rows from `self` on top and rows from `other` at the bottom.

    Note:
        DataFrames being concatenated **must have exactly the same schema**. You may wish to use the
        [df.select()][daft.DataFrame.select] and [expr.cast()][daft.expressions.Expression.cast] methods
        to ensure schema compatibility before concatenation.

    Examples:
        >>> import daft
        >>> df1 = daft.from_pydict({"a": [1, 2], "b": [3, 4]})
        >>> df2 = daft.from_pydict({"a": [5, 6], "b": [7, 8]})
        >>> concatenated_df = df1.concat(df2)
        >>> concatenated_df.show()
        ╭───────┬───────╮
        │ a     ┆ b     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ 1     ┆ 3     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 4     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 5     ┆ 7     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 6     ┆ 8     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 4 of 4 rows)
    """
    if self.schema() != other.schema():
        raise ValueError(
            f"DataFrames must have exactly the same schema for concatenation!\nExpected:\n{self.schema()}\n\nReceived:\n{other.schema()}"
        )
    builder = self._builder.concat(other._builder)
    return DataFrame(builder)

count #

count(*cols: ColumnInputType | int) -> DataFrame

Performs a global count on the DataFrame.

Parameters:

Name	Type	Description	Default
`*cols`	`Union[str, Expression, int]`	columns to count	`()`

Returns: DataFrame: Globally aggregated count. Should be a single row.

Examples:

If no columns are specified (i.e. in the case you call df.count()), or only the literal string "", this functions very similarly to a COUNT() operation in SQL and will return a new dataframe with a single column with the name "count".

>>> import daft
>>> from daft import col
>>> df = daft.from_pydict({"foo": [1, None, None], "bar": [None, 2, 2], "baz": [3, 4, 5]})
>>> df.count().show()  # equivalent to df.count("*").show()

╭────────╮
│ count  │
│ ---    │
│ UInt64 │
╞════════╡
│ 3      │
╰────────╯
(Showing first 1 of 1 rows)

However, specifying some column names would instead change the behavior to count all non-null values, similar to a SQL command for SELECT COUNT(foo), COUNT(bar) FROM df. Also, using df.count(col("*")) will expand out into count() for each column.

>>> df.count("foo", "bar").show()

╭────────┬────────╮
│ foo    ┆ bar    │
│ ---    ┆ ---    │
│ UInt64 ┆ UInt64 │
╞════════╪════════╡
│ 1      ┆ 2      │
╰────────┴────────╯
(Showing first 1 of 1 rows)

>>> df.count(df["*"]).show()

╭────────┬────────┬────────╮
│ foo    ┆ bar    ┆ baz    │
│ ---    ┆ ---    ┆ ---    │
│ UInt64 ┆ UInt64 ┆ UInt64 │
╞════════╪════════╪════════╡
│ 1      ┆ 2      ┆ 3      │
╰────────┴────────┴────────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def count(self, *cols: ColumnInputType | int) -> "DataFrame":
    """Performs a global count on the DataFrame.

    Args:
        *cols (Union[str, Expression, int]): columns to count
    Returns:
        DataFrame: Globally aggregated count. Should be a single row.


    Examples:
        If no columns are specified (i.e. in the case you call `df.count()`), or only the literal string "*",
        this functions very similarly to a COUNT(*) operation in SQL and will return a new dataframe with a
        single column with the name "count".

        >>> import daft
        >>> from daft import col
        >>> df = daft.from_pydict({"foo": [1, None, None], "bar": [None, 2, 2], "baz": [3, 4, 5]})
        >>> df.count().show()  # equivalent to df.count("*").show()
        ╭────────╮
        │ count  │
        │ ---    │
        │ UInt64 │
        ╞════════╡
        │ 3      │
        ╰────────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)

        However, specifying some column names would instead change the behavior to count all non-null values,
        similar to a SQL command for `SELECT COUNT(foo), COUNT(bar) FROM df`. Also, using `df.count(col("*"))`
        will expand out into count() for each column.

        >>> df.count("foo", "bar").show()
        ╭────────┬────────╮
        │ foo    ┆ bar    │
        │ ---    ┆ ---    │
        │ UInt64 ┆ UInt64 │
        ╞════════╪════════╡
        │ 1      ┆ 2      │
        ╰────────┴────────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)

        >>> df.count(df["*"]).show()
        ╭────────┬────────┬────────╮
        │ foo    ┆ bar    ┆ baz    │
        │ ---    ┆ ---    ┆ ---    │
        │ UInt64 ┆ UInt64 ┆ UInt64 │
        ╞════════╪════════╪════════╡
        │ 1      ┆ 2      ┆ 3      │
        ╰────────┴────────┴────────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)

    """
    # Special case: treat this as a COUNT(*) operation which is likely what most people would expect
    # If user passes in "*", also do this behavior (by default it would count each column individually)
    if (
        len(cols) == 0
        or (len(cols) == 1 and isinstance(cols[0], str) and cols[0] == "*")
        or (len(cols) == 1 and isinstance(cols[0], int))
    ):
        builder = self._builder.count()
        return DataFrame(builder)

    if any(isinstance(c, str) and c == "*" for c in cols):
        # we do not support hybrid count-all and count-nonnull
        raise ValueError("Cannot call count() with both * and column names")

    # Otherwise, perform a column-wise count on the specified columns
    return self._apply_agg_fn(Expression.count, typing.cast("tuple[ColumnInputType, ...]", cols))

count_distinct #

count_distinct(*cols: ColumnInputType) -> DataFrame

Performs a global count of distinct values on the DataFrame.

Parameters:

Name	Type	Description	Default
`*cols`	`Union[str, Expression]`	columns to count distinct values	`()`

Returns: DataFrame: Globally aggregated count of distinct values. Should be a single row.

Examples:

>>> import daft
>>> df = daft.from_pydict({"col_a": [1, 2, 2, 3, 3, 3]})
>>> df = df.count_distinct("col_a")
>>> df.show()

╭────────╮
│ col_a  │
│ ---    │
│ UInt64 │
╞════════╡
│ 3      │
╰────────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def count_distinct(self, *cols: ColumnInputType) -> "DataFrame":
    """Performs a global count of distinct values on the DataFrame.

    Args:
        *cols (Union[str, Expression]): columns to count distinct values
    Returns:
        DataFrame: Globally aggregated count of distinct values. Should be a single row.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"col_a": [1, 2, 2, 3, 3, 3]})
        >>> df = df.count_distinct("col_a")
        >>> df.show()
        ╭────────╮
        │ col_a  │
        │ ---    │
        │ UInt64 │
        ╞════════╡
        │ 3      │
        ╰────────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)
    """
    return self._apply_agg_fn(Expression.count_distinct, cols)

count_rows #

count_rows() -> int

Executes the Dataframe to count the number of rows.

Returns:

Name	Type	Description
`int`	`int`	count of the number of rows in this DataFrame.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
>>> df.count_rows()

Note

This will execute the DataFrame and return the number of rows in it.

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def count_rows(self) -> int:
    """Executes the Dataframe to count the number of rows.

    Returns:
        int: count of the number of rows in this DataFrame.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
        >>> df.count_rows()
        3

    Note:
        This will execute the DataFrame and return the number of rows in it.

    """
    if self._result is not None:
        return len(self._result)
    builder = self._builder.count()
    count_df = DataFrame(builder)
    # Expects builder to produce a single-partition, single-row DataFrame containing
    # a "count" column, where the lone value represents the row count for the DataFrame.
    return count_df.to_pydict()["count"][0]

describe #

describe() -> DataFrame

Returns the Schema of the DataFrame, which provides information about each column, as a new DataFrame.

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	A dataframe where each row is a column name and its corresponding type.

Examples:

>>> import daft
>>> df = daft.from_pydict({"a": [1, 2, 3], "b": ["x", "y", "z"]})
>>> df.describe().show()

╭─────────────┬────────╮
│ column_name ┆ type   │
│ ---         ┆ ---    │
│ String      ┆ String │
╞═════════════╪════════╡
│ a           ┆ Int64  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b           ┆ String │
╰─────────────┴────────╯
(Showing first 2 of 2 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def describe(self) -> "DataFrame":
    """Returns the Schema of the DataFrame, which provides information about each column, as a new DataFrame.

    Returns:
        DataFrame: A dataframe where each row is a column name and its corresponding type.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"a": [1, 2, 3], "b": ["x", "y", "z"]})
        >>> df.describe().show()
        ╭─────────────┬────────╮
        │ column_name ┆ type   │
        │ ---         ┆ ---    │
        │ String      ┆ String │
        ╞═════════════╪════════╡
        │ a           ┆ Int64  │
        ├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ b           ┆ String │
        ╰─────────────┴────────╯
        <BLANKLINE>
        (Showing first 2 of 2 rows)
    """
    builder = self.__builder.describe()
    return DataFrame(builder)

distinct #

distinct(*on: ColumnInputType) -> DataFrame

Computes distinct rows, dropping duplicates.

Optionally, specify a subset of columns to perform distinct on.

Parameters:

Name	Type	Description	Default
`*on`	`Union[str, Expression]`	columns to perform distinct on. Defaults to all columns.	`()`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame that has only distinct rows.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 2], "y": [4, 5, 5], "z": [7, 8, 8]})
>>> distinct_df = df.distinct()
>>> distinct_df = distinct_df.sort("x")
>>> distinct_df.show()
>>> # Pass a subset of columns to perform distinct on
>>> # Note that output for z is non-deterministic. Both 8 and 9 are possible.
>>> df = daft.from_pydict({"x": [1, 2, 2], "y": [4, 5, 5], "z": [7, 8, 9]})
>>> df.distinct("x", daft.col("y")).sort("x").show()

╭───────┬───────┬───────╮
│ x     ┆ y     ┆ z     │
│ ---   ┆ ---   ┆ ---   │
│ Int64 ┆ Int64 ┆ Int64 │
╞═══════╪═══════╪═══════╡
│ 1     ┆ 4     ┆ 7     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     ┆ 8     │
╰───────┴───────┴───────╯
(Showing first 2 of 2 rows)
╭───────┬───────┬───────╮
│ x     ┆ y     ┆ z     │
│ ---   ┆ ---   ┆ ---   │
│ Int64 ┆ Int64 ┆ Int64 │
╞═══════╪═══════╪═══════╡
│ 1     ┆ 4     ┆ 7     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     ┆ 8     │
╰───────┴───────┴───────╯
(Showing first 2 of 2 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def distinct(self, *on: ColumnInputType) -> "DataFrame":
    """Computes distinct rows, dropping duplicates.

    Optionally, specify a subset of columns to perform distinct on.

    Args:
        *on (Union[str, Expression]): columns to perform distinct on. Defaults to all columns.

    Returns:
        DataFrame: DataFrame that has only distinct rows.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 2], "y": [4, 5, 5], "z": [7, 8, 8]})
        >>> distinct_df = df.distinct()
        >>> distinct_df = distinct_df.sort("x")
        >>> distinct_df.show()
        ╭───────┬───────┬───────╮
        │ x     ┆ y     ┆ z     │
        │ ---   ┆ ---   ┆ ---   │
        │ Int64 ┆ Int64 ┆ Int64 │
        ╞═══════╪═══════╪═══════╡
        │ 1     ┆ 4     ┆ 7     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 5     ┆ 8     │
        ╰───────┴───────┴───────╯
        <BLANKLINE>
        (Showing first 2 of 2 rows)
        >>> # Pass a subset of columns to perform distinct on
        >>> # Note that output for z is non-deterministic. Both 8 and 9 are possible.
        >>> df = daft.from_pydict({"x": [1, 2, 2], "y": [4, 5, 5], "z": [7, 8, 9]})
        >>> df.distinct("x", daft.col("y")).sort("x").show()
        ╭───────┬───────┬───────╮
        │ x     ┆ y     ┆ z     │
        │ ---   ┆ ---   ┆ ---   │
        │ Int64 ┆ Int64 ┆ Int64 │
        ╞═══════╪═══════╪═══════╡
        │ 1     ┆ 4     ┆ 7     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 5     ┆ 8     │
        ╰───────┴───────┴───────╯
        <BLANKLINE>
        (Showing first 2 of 2 rows)
    """
    builder = self._builder.distinct(column_inputs_to_expressions(on))
    return DataFrame(builder)

drop_duplicates #

drop_duplicates(*subset: ColumnInputType) -> DataFrame

Computes distinct rows, dropping duplicates.

Alias for DataFrame.distinct.

Parameters:

Name	Type	Description	Default
`*subset`	`Union[str, Expression]`	columns to perform distinct on. Defaults to all columns.	`()`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame that has only distinct rows.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 2], "y": [4, 5, 5], "z": [7, 8, 8]})
>>> distinct_df = df.drop_duplicates()
>>> distinct_df = distinct_df.sort("x")
>>> distinct_df.show()

╭───────┬───────┬───────╮
│ x     ┆ y     ┆ z     │
│ ---   ┆ ---   ┆ ---   │
│ Int64 ┆ Int64 ┆ Int64 │
╞═══════╪═══════╪═══════╡
│ 1     ┆ 4     ┆ 7     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     ┆ 8     │
╰───────┴───────┴───────╯
(Showing first 2 of 2 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def drop_duplicates(self, *subset: ColumnInputType) -> "DataFrame":
    """Computes distinct rows, dropping duplicates.

    Alias for [DataFrame.distinct][daft.DataFrame.distinct].

    Args:
        *subset (Union[str, Expression]): columns to perform distinct on. Defaults to all columns.

    Returns:
        DataFrame: DataFrame that has only distinct rows.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 2], "y": [4, 5, 5], "z": [7, 8, 8]})
        >>> distinct_df = df.drop_duplicates()
        >>> distinct_df = distinct_df.sort("x")
        >>> distinct_df.show()
        ╭───────┬───────┬───────╮
        │ x     ┆ y     ┆ z     │
        │ ---   ┆ ---   ┆ ---   │
        │ Int64 ┆ Int64 ┆ Int64 │
        ╞═══════╪═══════╪═══════╡
        │ 1     ┆ 4     ┆ 7     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 5     ┆ 8     │
        ╰───────┴───────┴───────╯
        <BLANKLINE>
        (Showing first 2 of 2 rows)
    """
    return self.distinct(*subset)

drop_nan #

drop_nan(*cols: ColumnInputType) -> DataFrame

Drops rows that contains NaNs. If cols is None it will drop rows with any NaN value.

If column names are supplied, it will drop only those rows that contains NaNs in one of these columns.

Parameters:

Name	Type	Description	Default
`*cols`	`str`	column names by which rows containing nans/NULLs should be filtered	`()`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame without NaNs in specified/all columns

Examples:

>>> import daft
>>> df = daft.from_pydict({"a": [1.0, 2.2, 3.5, float("nan")]})
>>> df.drop_nan().collect()  # drops rows where any column contains NaN values

╭─────────╮
│ a       │
│ ---     │
│ Float64 │
╞═════════╡
│ 1       │
├╌╌╌╌╌╌╌╌╌┤
│ 2.2     │
├╌╌╌╌╌╌╌╌╌┤
│ 3.5     │
╰─────────╯
(Showing first 3 of 3 rows)

>>> import daft
>>> df = daft.from_pydict({"a": [1.6, 2.5, 3.3, float("nan")]})
>>> df.drop_nan("a").collect()  # drops rows where column `a` contains NaN values

╭─────────╮
│ a       │
│ ---     │
│ Float64 │
╞═════════╡
│ 1.6     │
├╌╌╌╌╌╌╌╌╌┤
│ 2.5     │
├╌╌╌╌╌╌╌╌╌┤
│ 3.3     │
╰─────────╯
(Showing first 3 of 3 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def drop_nan(self, *cols: ColumnInputType) -> "DataFrame":
    """Drops rows that contains NaNs. If cols is None it will drop rows with any NaN value.

    If column names are supplied, it will drop only those rows that contains NaNs in one of these columns.

    Args:
        *cols (str): column names by which rows containing nans/NULLs should be filtered

    Returns:
        DataFrame: DataFrame without NaNs in specified/all columns

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"a": [1.0, 2.2, 3.5, float("nan")]})
        >>> df.drop_nan().collect()  # drops rows where any column contains NaN values
        ╭─────────╮
        │ a       │
        │ ---     │
        │ Float64 │
        ╞═════════╡
        │ 1       │
        ├╌╌╌╌╌╌╌╌╌┤
        │ 2.2     │
        ├╌╌╌╌╌╌╌╌╌┤
        │ 3.5     │
        ╰─────────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)

        >>> import daft
        >>> df = daft.from_pydict({"a": [1.6, 2.5, 3.3, float("nan")]})
        >>> df.drop_nan("a").collect()  # drops rows where column `a` contains NaN values
        ╭─────────╮
        │ a       │
        │ ---     │
        │ Float64 │
        ╞═════════╡
        │ 1.6     │
        ├╌╌╌╌╌╌╌╌╌┤
        │ 2.5     │
        ├╌╌╌╌╌╌╌╌╌┤
        │ 3.3     │
        ╰─────────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)

    """
    if len(cols) == 0:
        columns = column_inputs_to_expressions(self.column_names)
    else:
        columns = column_inputs_to_expressions(cols)
    float_columns = [
        column
        for column in columns
        if (
            column._to_field(self.schema()).dtype == DataType.float32()
            or column._to_field(self.schema()).dtype == DataType.float64()
        )
    ]

    # avoid superfluous .where with empty iterable when nothing to filter.
    if not float_columns:
        return self

    from daft.functions import is_nan, when

    return self.where(
        ~reduce(
            lambda x, y: when(x.is_null(), lit(False)).otherwise(x) | when(y.is_null(), lit(False)).otherwise(y),
            (is_nan(x) for x in float_columns),
        )
    )

drop_null #

drop_null(*cols: ColumnInputType) -> DataFrame

Drops rows that contains NaNs or NULLs. If cols is None it will drop rows with any NULL value.

If column names are supplied, it will drop only those rows that contains NULLs in one of these columns.

Parameters:

Name	Type	Description	Default
`*cols`	`str`	column names by which rows containing nans should be filtered	`()`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame without missing values in specified/all columns

Examples:

>>> import daft
>>> df = daft.from_pydict({"a": [1.6, 2.5, None, float("NaN")]})
>>> df.drop_null("a").collect()

╭─────────╮
│ a       │
│ ---     │
│ Float64 │
╞═════════╡
│ 1.6     │
├╌╌╌╌╌╌╌╌╌┤
│ 2.5     │
├╌╌╌╌╌╌╌╌╌┤
│ NaN     │
╰─────────╯
(Showing first 3 of 3 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def drop_null(self, *cols: ColumnInputType) -> "DataFrame":
    """Drops rows that contains NaNs or NULLs. If cols is None it will drop rows with any NULL value.

    If column names are supplied, it will drop only those rows that contains NULLs in one of these columns.

    Args:
        *cols (str): column names by which rows containing nans should be filtered

    Returns:
        DataFrame: DataFrame without missing values in specified/all columns

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"a": [1.6, 2.5, None, float("NaN")]})
        >>> df.drop_null("a").collect()
        ╭─────────╮
        │ a       │
        │ ---     │
        │ Float64 │
        ╞═════════╡
        │ 1.6     │
        ├╌╌╌╌╌╌╌╌╌┤
        │ 2.5     │
        ├╌╌╌╌╌╌╌╌╌┤
        │ NaN     │
        ╰─────────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)


    """
    if len(cols) == 0:
        columns = column_inputs_to_expressions(self.column_names)
    else:
        columns = column_inputs_to_expressions(cols)
    return self.where(~reduce(lambda x, y: x | y, (x.is_null() for x in columns)))

except_all #

except_all(other: DataFrame) -> DataFrame

Returns the set difference of two DataFrames, considering duplicates.

Parameters:

Name	Type	Description	Default
`other`	`DataFrame`	DataFrame to except with	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with the set difference of the two DataFrames, considering duplicates

Examples:

>>> import daft
>>> df1 = daft.from_pydict({"a": [1, 1, 2, 2], "b": [4, 4, 6, 6]})
>>> df2 = daft.from_pydict({"a": [1, 2, 2], "b": [4, 6, 6]})
>>> df1.except_all(df2).collect()

╭───────┬───────╮
│ a     ┆ b     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 1     ┆ 4     │
╰───────┴───────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def except_all(self, other: "DataFrame") -> "DataFrame":
    """Returns the set difference of two DataFrames, considering duplicates.

    Args:
        other (DataFrame): DataFrame to except with

    Returns:
        DataFrame: DataFrame with the set difference of the two DataFrames, considering duplicates

    Examples:
        >>> import daft
        >>> df1 = daft.from_pydict({"a": [1, 1, 2, 2], "b": [4, 4, 6, 6]})
        >>> df2 = daft.from_pydict({"a": [1, 2, 2], "b": [4, 6, 6]})
        >>> df1.except_all(df2).collect()
        ╭───────┬───────╮
        │ a     ┆ b     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ 1     ┆ 4     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)

    """
    builder = self._builder.except_all(other._builder)
    return DataFrame(builder)

except_distinct #

except_distinct(other: DataFrame) -> DataFrame

Returns the set difference of two DataFrames.

Parameters:

Name	Type	Description	Default
`other`	`DataFrame`	DataFrame to except with	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with the set difference of the two DataFrames

Examples:

>>> import daft
>>> df1 = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> df2 = daft.from_pydict({"a": [1, 2, 3], "b": [4, 8, 6]})
>>> df1.except_distinct(df2).collect()

╭───────┬───────╮
│ a     ┆ b     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 2     ┆ 5     │
╰───────┴───────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def except_distinct(self, other: "DataFrame") -> "DataFrame":
    """Returns the set difference of two DataFrames.

    Args:
        other (DataFrame): DataFrame to except with

    Returns:
        DataFrame: DataFrame with the set difference of the two DataFrames

    Examples:
        >>> import daft
        >>> df1 = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
        >>> df2 = daft.from_pydict({"a": [1, 2, 3], "b": [4, 8, 6]})
        >>> df1.except_distinct(df2).collect()
        ╭───────┬───────╮
        │ a     ┆ b     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ 2     ┆ 5     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)

    """
    builder = self._builder.except_distinct(other._builder)
    return DataFrame(builder)

exclude #

exclude(*names: str) -> DataFrame

Drops columns from the current DataFrame by name.

This is equivalent of performing a select with all the columns but the ones excluded.

Parameters:

Name	Type	Description	Default
`*names`	`str`	names to exclude	`()`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with some columns excluded.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
>>> df_without_x = df.exclude("x")
>>> df_without_x.show()

╭───────┬───────╮
│ y     ┆ z     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 4     ┆ 7     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 5     ┆ 8     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 6     ┆ 9     │
╰───────┴───────╯
(Showing first 3 of 3 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def exclude(self, *names: str) -> "DataFrame":
    """Drops columns from the current DataFrame by name.

    This is equivalent of performing a select with all the columns but the ones excluded.

    Args:
        *names (str): names to exclude

    Returns:
        DataFrame: DataFrame with some columns excluded.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
        >>> df_without_x = df.exclude("x")
        >>> df_without_x.show()
        ╭───────┬───────╮
        │ y     ┆ z     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ 4     ┆ 7     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 5     ┆ 8     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 6     ┆ 9     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)
    """
    builder = self._builder.exclude(list(names))
    return DataFrame(builder)

explain #

explain(show_all: bool = False, format: str = 'ascii', simple: bool = False, file: IOBase | None = None) -> Any

Prints the (logical and physical) plans that will be executed to produce this DataFrame.

Defaults to showing the unoptimized logical plan. Use show_all=True to show the unoptimized logical plan, the optimized logical plan, and the physical plan.

Parameters:

Name	Type	Description	Default
`show_all`	`bool`	Whether to show the optimized logical plan and the physical plan in addition to the unoptimized logical plan.	`False`
`format`	`str`	The format to print the plan in. one of 'ascii' or 'mermaid'	`'ascii'`
`simple`	`bool`	Whether to only show the type of op for each node in the plan, rather than showing details of how each op is configured.	`False`
`file`	`Optional[IOBase]`	Location to print the output to, or defaults to None which defaults to the default location for print (in Python, that should be sys.stdout)	`None`

Returns:

Type	Description
`Any`	Union[None, str, MermaidFormatter]: - If `format="mermaid"` and running in a notebook, returns a `MermaidFormatter` instance for rich rendering. - If `format="mermaid"` and not in a notebook, returns a string representation of the plan. - Otherwise, prints the plan(s) to the specified file or stdout and returns `None`.

Examples:

>>> import daft
>>>
>>> df = daft.from_pydict({"x": [1, 2, 3]})
>>>
>>> def double(df, column: str):
...     return df.select((df[column] * df[column]).alias(column))
>>>
>>> df = df.pipe(double, "x")
>>>
>>> df.explain()

== Unoptimized Logical Plan ==
* Project: col(x) * col(x) as x
|
* Source:
|   Number of partitions = 1
|   Output schema = x#Int64
Set `show_all=True` to also see the Optimized and Physical plans. This will run the query optimizer.

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def explain(
    self, show_all: bool = False, format: str = "ascii", simple: bool = False, file: io.IOBase | None = None
) -> Any:
    r"""Prints the (logical and physical) plans that will be executed to produce this DataFrame.

    Defaults to showing the unoptimized logical plan. Use `show_all=True` to show the unoptimized logical plan,
    the optimized logical plan, and the physical plan.

    Args:
        show_all (bool): Whether to show the optimized logical plan and the physical plan in addition to the
            unoptimized logical plan.
        format (str): The format to print the plan in. one of 'ascii' or 'mermaid'
        simple (bool): Whether to only show the type of op for each node in the plan, rather than showing details
            of how each op is configured.

        file (Optional[io.IOBase]): Location to print the output to, or defaults to None which defaults to the default location for
            print (in Python, that should be sys.stdout)

    Returns:
        Union[None, str, MermaidFormatter]:
            - If `format="mermaid"` and running in a notebook, returns a `MermaidFormatter` instance for rich rendering.
            - If `format="mermaid"` and not in a notebook, returns a string representation of the plan.
            - Otherwise, prints the plan(s) to the specified file or stdout and returns `None`.

    Examples:
        >>> import daft
        >>>
        >>> df = daft.from_pydict({"x": [1, 2, 3]})
        >>>
        >>> def double(df, column: str):
        ...     return df.select((df[column] * df[column]).alias(column))
        >>>
        >>> df = df.pipe(double, "x")
        >>>
        >>> df.explain()
        == Unoptimized Logical Plan ==
        <BLANKLINE>
        * Project: col(x) * col(x) as x
        |
        * Source:
        |   Number of partitions = 1
        |   Output schema = x#Int64
        <BLANKLINE>
        <BLANKLINE>
        <BLANKLINE>
        Set `show_all=True` to also see the Optimized and Physical plans. This will run the query optimizer.

    """
    is_cached = self._result_cache is not None
    if format == "mermaid":
        from daft.dataframe.display import MermaidFormatter
        from daft.utils import in_notebook

        instance = MermaidFormatter(self.__builder, show_all, simple, is_cached)
        if file is not None:
            # if we are printing to a file, we print the markdown representation of the plan
            text = instance._repr_markdown_()
            print(text, file=file)
        if in_notebook():
            # if in a notebook, we return the class instance and let jupyter display it
            return instance
        else:
            # if we are not in a notebook, we return the raw markdown instead of the class instance
            return repr(instance)

    print_to_file = partial(print, file=file)

    if self._result_cache is not None:
        print_to_file("Result is cached and will skip computation\n")
        print_to_file(self._builder.pretty_print(simple, format=format))

        print_to_file("However here is the logical plan used to produce this result:\n", file=file)

    builder = self.__builder
    print_to_file("== Unoptimized Logical Plan ==\n")
    print_to_file(builder.pretty_print(simple, format=format))
    if show_all:
        print_to_file("\n== Optimized Logical Plan ==\n")
        execution_config = get_context().daft_execution_config
        builder = builder.optimize(execution_config)
        print_to_file(builder.pretty_print(simple))
        print_to_file("\n== Physical Plan ==\n")
        if get_or_create_runner().name != "native":
            from daft.daft import DistributedPhysicalPlan

            distributed_plan = DistributedPhysicalPlan.from_logical_plan_builder(
                builder._builder, "<tmp>", execution_config
            )
            if format == "ascii":
                print_to_file(distributed_plan.repr_ascii(simple))
            elif format == "mermaid":
                print_to_file(distributed_plan.repr_mermaid(MermaidOptions(simple)))
        else:
            native_executor = NativeExecutor()
            print_to_file(
                native_executor.pretty_print(builder, get_context().daft_execution_config, simple, format=format)
            )
    else:
        print_to_file(
            "\n \nSet `show_all=True` to also see the Optimized and Physical plans. This will run the query optimizer.",
        )
    return None

explode #

explode(*columns: ColumnInputType, index_column: ColumnInputType | None = None, ignore_empty_and_null: bool = False) -> DataFrame

Explodes a List column, where every element in each row's List becomes its own row, and all other columns in the DataFrame are duplicated across rows.

If multiple columns are specified, each row must contain the same number of items in each specified column.

By default, exploding Null values or empty lists will create a single Null entry (see example below). Set ignore_empty_and_null=True to drop these rows instead.

Parameters:

Name	Type	Description	Default
`*columns`	`ColumnInputType`	columns to explode	`()`
`index_column`	`ColumnInputType \| None`	optional name for an index column that tracks the position of each element within its original list	`None`
`ignore_empty_and_null`	`bool`	If True, drops rows where the list is empty or null. If False (default), empty lists and null values each produce a single row with a null value.	`False`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with exploded column

Examples:

>>> import daft
>>> df = daft.from_pydict(
...     {
...         "x": [[1], [2, 3]],
...         "y": [["a"], ["b", "c"]],
...         "z": [
...             [1.0],
...             [2.0, 2.0],
...         ],
...     }
... )
>>> df.collect()
>>> df.explode(df["x"], df["y"]).collect()

╭─────────────┬──────────────┬───────────────╮
│ x           ┆ y            ┆ z             │
│ ---         ┆ ---          ┆ ---           │
│ List[Int64] ┆ List[String] ┆ List[Float64] │
╞═════════════╪══════════════╪═══════════════╡
│ [1]         ┆ [a]          ┆ [1]           │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [2, 3]      ┆ [b, c]       ┆ [2, 2]        │
╰─────────────┴──────────────┴───────────────╯
(Showing first 2 of 2 rows)
╭───────┬────────┬───────────────╮
│ x     ┆ y      ┆ z             │
│ ---   ┆ ---    ┆ ---           │
│ Int64 ┆ String ┆ List[Float64] │
╞═══════╪════════╪═══════════════╡
│ 1     ┆ a      ┆ [1]           │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2     ┆ b      ┆ [2, 2]        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3     ┆ c      ┆ [2, 2]        │
╰───────┴────────┴───────────────╯
(Showing first 3 of 3 rows)

Example with Null values and empty lists:

>>> df2 = daft.from_pydict(
...     {"id": [1, 2, 3, 4], "values": [[1, 2], [], None, [3]], "labels": [["a", "b"], [], None, ["c"]]}
... )
>>> df2.collect()
>>> df2.explode(df2["values"], df2["labels"]).collect()

╭───────┬─────────────┬──────────────╮
│ id    ┆ values      ┆ labels       │
│ ---   ┆ ---         ┆ ---          │
│ Int64 ┆ List[Int64] ┆ List[String] │
╞═══════╪═════════════╪══════════════╡
│ 1     ┆ [1, 2]      ┆ [a, b]       │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2     ┆ []          ┆ []           │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3     ┆ None        ┆ None         │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4     ┆ [3]         ┆ [c]          │
╰───────┴─────────────┴──────────────╯
(Showing first 4 of 4 rows)
╭───────┬────────┬────────╮
│ id    ┆ values ┆ labels │
│ ---   ┆ ---    ┆ ---    │
│ Int64 ┆ Int64  ┆ String │
╞═══════╪════════╪════════╡
│ 1     ┆ 1      ┆ a      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1     ┆ 2      ┆ b      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2     ┆ None   ┆ None   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3     ┆ None   ┆ None   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4     ┆ 3      ┆ c      │
╰───────┴────────┴────────╯
(Showing first 5 of 5 rows)

Example with ignore_empty_and_null=True:

>>> df2.explode(df2["values"], df2["labels"], ignore_empty_and_null=True).collect()

╭───────┬────────┬────────╮
│ id    ┆ values ┆ labels │
│ ---   ┆ ---    ┆ ---    │
│ Int64 ┆ Int64  ┆ String │
╞═══════╪════════╪════════╡
│ 1     ┆ 1      ┆ a      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1     ┆ 2      ┆ b      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4     ┆ 3      ┆ c      │
╰───────┴────────┴────────╯
(Showing first 3 of 3 rows)

Example with index_column to track element positions:

>>> df3 = daft.from_pydict({"a": [[1, 2], [3, 4, 3]]})
>>> df3.explode("a", index_column="idx").collect()

╭───────┬────────╮
│ a     ┆ idx    │
│ ---   ┆ ---    │
│ Int64 ┆ UInt64 │
╞═══════╪════════╡
│ 1     ┆ 0      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2     ┆ 1      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3     ┆ 0      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4     ┆ 1      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3     ┆ 2      │
╰───────┴────────╯
(Showing first 5 of 5 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def explode(
    self,
    *columns: ColumnInputType,
    index_column: ColumnInputType | None = None,
    ignore_empty_and_null: bool = False,
) -> "DataFrame":
    """Explodes a List column, where every element in each row's List becomes its own row, and all other columns in the DataFrame are duplicated across rows.

    If multiple columns are specified, each row must contain the same number of items in each specified column.

    By default, exploding Null values or empty lists will create a single Null entry (see example below).
    Set ``ignore_empty_and_null=True`` to drop these rows instead.

    Args:
        *columns (ColumnInputType): columns to explode
        index_column (ColumnInputType | None): optional name for an index column that tracks the position of each element within its original list
        ignore_empty_and_null (bool): If True, drops rows where the list is empty or null.
            If False (default), empty lists and null values each produce a single row with a null value.

    Returns:
        DataFrame: DataFrame with exploded column

    Examples:
        >>> import daft
        >>> df = daft.from_pydict(
        ...     {
        ...         "x": [[1], [2, 3]],
        ...         "y": [["a"], ["b", "c"]],
        ...         "z": [
        ...             [1.0],
        ...             [2.0, 2.0],
        ...         ],
        ...     }
        ... )
        >>> df.collect()
        ╭─────────────┬──────────────┬───────────────╮
        │ x           ┆ y            ┆ z             │
        │ ---         ┆ ---          ┆ ---           │
        │ List[Int64] ┆ List[String] ┆ List[Float64] │
        ╞═════════════╪══════════════╪═══════════════╡
        │ [1]         ┆ [a]          ┆ [1]           │
        ├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ [2, 3]      ┆ [b, c]       ┆ [2, 2]        │
        ╰─────────────┴──────────────┴───────────────╯
        <BLANKLINE>
        (Showing first 2 of 2 rows)
        >>> df.explode(df["x"], df["y"]).collect()
        ╭───────┬────────┬───────────────╮
        │ x     ┆ y      ┆ z             │
        │ ---   ┆ ---    ┆ ---           │
        │ Int64 ┆ String ┆ List[Float64] │
        ╞═══════╪════════╪═══════════════╡
        │ 1     ┆ a      ┆ [1]           │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ 2     ┆ b      ┆ [2, 2]        │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ 3     ┆ c      ┆ [2, 2]        │
        ╰───────┴────────┴───────────────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)

        Example with Null values and empty lists:

        >>> df2 = daft.from_pydict(
        ...     {"id": [1, 2, 3, 4], "values": [[1, 2], [], None, [3]], "labels": [["a", "b"], [], None, ["c"]]}
        ... )
        >>> df2.collect()
        ╭───────┬─────────────┬──────────────╮
        │ id    ┆ values      ┆ labels       │
        │ ---   ┆ ---         ┆ ---          │
        │ Int64 ┆ List[Int64] ┆ List[String] │
        ╞═══════╪═════════════╪══════════════╡
        │ 1     ┆ [1, 2]      ┆ [a, b]       │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ 2     ┆ []          ┆ []           │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ 3     ┆ None        ┆ None         │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ 4     ┆ [3]         ┆ [c]          │
        ╰───────┴─────────────┴──────────────╯
        <BLANKLINE>
        (Showing first 4 of 4 rows)
        >>> df2.explode(df2["values"], df2["labels"]).collect()
        ╭───────┬────────┬────────╮
        │ id    ┆ values ┆ labels │
        │ ---   ┆ ---    ┆ ---    │
        │ Int64 ┆ Int64  ┆ String │
        ╞═══════╪════════╪════════╡
        │ 1     ┆ 1      ┆ a      │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 1     ┆ 2      ┆ b      │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 2     ┆ None   ┆ None   │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 3     ┆ None   ┆ None   │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 4     ┆ 3      ┆ c      │
        ╰───────┴────────┴────────╯
        <BLANKLINE>
        (Showing first 5 of 5 rows)

        Example with ignore_empty_and_null=True:

        >>> df2.explode(df2["values"], df2["labels"], ignore_empty_and_null=True).collect()
        ╭───────┬────────┬────────╮
        │ id    ┆ values ┆ labels │
        │ ---   ┆ ---    ┆ ---    │
        │ Int64 ┆ Int64  ┆ String │
        ╞═══════╪════════╪════════╡
        │ 1     ┆ 1      ┆ a      │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 1     ┆ 2      ┆ b      │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 4     ┆ 3      ┆ c      │
        ╰───────┴────────┴────────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)

        Example with index_column to track element positions:

        >>> df3 = daft.from_pydict({"a": [[1, 2], [3, 4, 3]]})
        >>> df3.explode("a", index_column="idx").collect()
        ╭───────┬────────╮
        │ a     ┆ idx    │
        │ ---   ┆ ---    │
        │ Int64 ┆ UInt64 │
        ╞═══════╪════════╡
        │ 1     ┆ 0      │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 2     ┆ 1      │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 3     ┆ 0      │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 4     ┆ 1      │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 3     ┆ 2      │
        ╰───────┴────────╯
        <BLANKLINE>
        (Showing first 5 of 5 rows)

    """
    parsed_exprs = column_inputs_to_expressions(columns)
    index_col_name = column_input_to_expression(index_column).name() if index_column is not None else None
    builder = self._builder.explode(
        parsed_exprs, ignore_empty_and_null=ignore_empty_and_null, index_column=index_col_name
    )
    return DataFrame(builder)

filter #

filter(predicate: Expression | str) -> DataFrame

Filters rows via a predicate expression, similar to SQL WHERE.

Alias for daft.DataFrame.where.

Parameters:

Name	Type	Description	Default
`predicate`	`Expression`	expression that keeps row if evaluates to True.	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Filtered DataFrame.

Tip

groupby #

groupby(*group_by: ManyColumnsInputType) -> GroupedDataFrame

Performs a GroupBy on the DataFrame for aggregation.

Parameters:

Name	Type	Description	Default
`*group_by`	`Union[str, Expression]`	columns to group by	`()`

Returns:

Name	Type	Description
`GroupedDataFrame`	`GroupedDataFrame`	DataFrame to Aggregate

Examples:

>>> import daft
>>> from daft import col
>>> df = daft.from_pydict(
...     {
...         "pet": ["cat", "dog", "dog", "cat"],
...         "age": [1, 2, 3, 4],
...         "name": ["Alex", "Jordan", "Sam", "Riley"],
...     }
... )
>>> grouped_df = df.groupby("pet").agg(
...     df["age"].min().alias("min_age"),
...     df["age"].max().alias("max_age"),
...     df["pet"].count().alias("count"),
...     df["name"].any_value(),
... )
>>> grouped_df = grouped_df.sort("pet")
>>> grouped_df.show()

╭────────┬─────────┬─────────┬────────┬────────╮
│ pet    ┆ min_age ┆ max_age ┆ count  ┆ name   │
│ ---    ┆ ---     ┆ ---     ┆ ---    ┆ ---    │
│ String ┆ Int64   ┆ Int64   ┆ UInt64 ┆ String │
╞════════╪═════════╪═════════╪════════╪════════╡
│ cat    ┆ 1       ┆ 4       ┆ 2      ┆ Alex   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ dog    ┆ 2       ┆ 3       ┆ 2      ┆ Jordan │
╰────────┴─────────┴─────────┴────────┴────────╯
(Showing first 2 of 2 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def groupby(self, *group_by: ManyColumnsInputType) -> "GroupedDataFrame":
    """Performs a GroupBy on the DataFrame for aggregation.

    Args:
        *group_by (Union[str, Expression]): columns to group by

    Returns:
        GroupedDataFrame: DataFrame to Aggregate

    Examples:
        >>> import daft
        >>> from daft import col
        >>> df = daft.from_pydict(
        ...     {
        ...         "pet": ["cat", "dog", "dog", "cat"],
        ...         "age": [1, 2, 3, 4],
        ...         "name": ["Alex", "Jordan", "Sam", "Riley"],
        ...     }
        ... )
        >>> grouped_df = df.groupby("pet").agg(
        ...     df["age"].min().alias("min_age"),
        ...     df["age"].max().alias("max_age"),
        ...     df["pet"].count().alias("count"),
        ...     df["name"].any_value(),
        ... )
        >>> grouped_df = grouped_df.sort("pet")
        >>> grouped_df.show()
        ╭────────┬─────────┬─────────┬────────┬────────╮
        │ pet    ┆ min_age ┆ max_age ┆ count  ┆ name   │
        │ ---    ┆ ---     ┆ ---     ┆ ---    ┆ ---    │
        │ String ┆ Int64   ┆ Int64   ┆ UInt64 ┆ String │
        ╞════════╪═════════╪═════════╪════════╪════════╡
        │ cat    ┆ 1       ┆ 4       ┆ 2      ┆ Alex   │
        ├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ dog    ┆ 2       ┆ 3       ┆ 2      ┆ Jordan │
        ╰────────┴─────────┴─────────┴────────┴────────╯
        <BLANKLINE>
        (Showing first 2 of 2 rows)

    """
    return GroupedDataFrame(self, ExpressionsProjection(self._wildcard_inputs_to_expressions(group_by)))

intersect #

intersect(other: DataFrame) -> DataFrame

Returns the intersection of two DataFrames.

Parameters:

Name	Type	Description	Default
`other`	`DataFrame`	DataFrame to intersect with	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with the intersection of the two DataFrames

Examples:

>>> import daft
>>> df1 = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> df2 = daft.from_pydict({"a": [1, 2, 3], "b": [4, 8, 6]})
>>> df = df1.intersect(df2)
>>> df = df.sort("a")
>>> df.show()

╭───────┬───────╮
│ a     ┆ b     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 1     ┆ 4     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 6     │
╰───────┴───────╯
(Showing first 2 of 2 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def intersect(self, other: "DataFrame") -> "DataFrame":
    """Returns the intersection of two DataFrames.

    Args:
        other (DataFrame): DataFrame to intersect with

    Returns:
        DataFrame: DataFrame with the intersection of the two DataFrames

    Examples:
        >>> import daft
        >>> df1 = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
        >>> df2 = daft.from_pydict({"a": [1, 2, 3], "b": [4, 8, 6]})
        >>> df = df1.intersect(df2)
        >>> df = df.sort("a")
        >>> df.show()
        ╭───────┬───────╮
        │ a     ┆ b     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ 1     ┆ 4     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 3     ┆ 6     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 2 of 2 rows)

    """
    builder = self._builder.intersect(other._builder)
    return DataFrame(builder)

intersect_all #

intersect_all(other: DataFrame) -> DataFrame

Returns the intersection of two DataFrames, including duplicates.

Parameters:

Name	Type	Description	Default
`other`	`DataFrame`	DataFrame to intersect with	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with the intersection of the two DataFrames, including duplicates

Examples:

>>> import daft
>>> df1 = daft.from_pydict({"a": [1, 2, 2], "b": [4, 6, 6]})
>>> df2 = daft.from_pydict({"a": [1, 1, 2, 2], "b": [4, 4, 6, 6]})
>>> df1.intersect_all(df2).sort("a").collect()

╭───────┬───────╮
│ a     ┆ b     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 1     ┆ 4     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 6     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 6     │
╰───────┴───────╯
(Showing first 3 of 3 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def intersect_all(self, other: "DataFrame") -> "DataFrame":
    """Returns the intersection of two DataFrames, including duplicates.

    Args:
        other (DataFrame): DataFrame to intersect with

    Returns:
        DataFrame: DataFrame with the intersection of the two DataFrames, including duplicates

    Examples:
        >>> import daft
        >>> df1 = daft.from_pydict({"a": [1, 2, 2], "b": [4, 6, 6]})
        >>> df2 = daft.from_pydict({"a": [1, 1, 2, 2], "b": [4, 4, 6, 6]})
        >>> df1.intersect_all(df2).sort("a").collect()
        ╭───────┬───────╮
        │ a     ┆ b     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ 1     ┆ 4     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 6     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 6     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)

    """
    builder = self._builder.intersect_all(other._builder)
    return DataFrame(builder)

into_batches #

into_batches(batch_size: int) -> DataFrame

Splits or coalesces DataFrame to partitions of size batch_size.

Note

Batch sizing is performed on a best-effort basis. The heuristic is to emit a batch when we have enough rows to fill batch_size * 0.8 rows. This approach prioritizes processing efficiency over uniform batch sizes, especially when using the Ray Runner, as batches can be distributed over the cluster. The exception to this is that the last batch will be the remainder of the total number of rows in the DataFrame.

Parameters:

Name	Type	Description	Default
`batch_size`	`int`	number of target rows per partition.	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Dataframe with `batch_size` rows per partition.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})
>>> df = df.into_batches(2)
>>> for i, block in enumerate(df.to_arrow_iter()):
...     assert len(block) == 2, f"Expected batch size 2, got {len(block)}"

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def into_batches(self, batch_size: int) -> "DataFrame":
    """Splits or coalesces DataFrame to partitions of size ``batch_size``.

    Note:
        Batch sizing is performed on a best-effort basis.
        The heuristic is to emit a batch when we have enough rows to fill `batch_size * 0.8` rows.
        This approach prioritizes processing efficiency over uniform batch sizes, especially when using the Ray Runner, as batches can be distributed over the cluster.
        The exception to this is that the last batch will be the remainder of the total number of rows in the DataFrame.

    Args:
        batch_size (int): number of target rows per partition.

    Returns:
        DataFrame: Dataframe with `batch_size` rows per partition.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})
        >>> df = df.into_batches(2)
        >>> for i, block in enumerate(df.to_arrow_iter()):
        ...     assert len(block) == 2, f"Expected batch size 2, got {len(block)}"
    """
    if batch_size <= 0:
        raise ValueError("batch_size must be greater than 0")

    builder = self._builder.into_batches(batch_size)
    return DataFrame(builder)

into_partitions #

into_partitions(num: int) -> DataFrame

Splits or coalesces DataFrame to num partitions. Order is preserved.

This will naively greedily split partitions in a round-robin fashion to hit the targeted number of partitions. The number of rows/size in a given partition is not taken into account during the splitting.

Parameters:

Name	Type	Description	Default
`num`	`int`	number of target partitions.	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Dataframe with `num` partitions.

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def into_partitions(self, num: int) -> "DataFrame":
    """Splits or coalesces DataFrame to ``num`` partitions. Order is preserved.

    This will naively greedily split partitions in a round-robin fashion to hit the targeted number of partitions.
    The number of rows/size in a given partition is not taken into account during the splitting.

    Args:
        num (int): number of target partitions.

    Returns:
        DataFrame: Dataframe with `num` partitions.
    """
    if get_or_create_runner().name == "native":
        warnings.warn(
            "DataFrame.into_partitions not supported on the NativeRunner. This will be a no-op. Please use the RayRunner via `daft.set_runner_ray()` instead if you need to repartition."
        )

    builder = self._builder.into_partitions(num)
    return DataFrame(builder)

iter_partitions #

iter_partitions(results_buffer_size: int | None | Literal['num_cpus'] = 'num_cpus') -> Iterator[Union[MicroPartition, ObjectRef]]

Begin executing this dataframe and return an iterator over the partitions.

Each partition will be returned as a daft.recordbatch object (if using Python runner backend) or a ray ObjectRef (if using Ray runner backend).

Parameters:

Name	Type	Description	Default
`results_buffer_size`	`int \| None \| Literal['num_cpus']`	how many partitions to allow in the results buffer (defaults to the total number of CPUs available on the machine).	`'num_cpus'`

A quick note on configuring asynchronous/parallel execution using results_buffer_size.

The results_buffer_size kwarg controls how many results Daft will allow to be in the buffer while iterating. Once this buffer is filled, Daft will not run any more work until some partition is consumed from the buffer.

Increasing this value means the iterator will consume more memory and CPU resources but have higher throughput
Decreasing this value means the iterator will consume lower memory and CPU resources, but have lower throughput
Setting this value to None means the iterator will consume as much resources as it deems appropriate per-iteration

The default value is the total number of CPUs available on the current machine.

Returns:

Type	Description
`Iterator[Union[MicroPartition, ObjectRef]]`	Iterator[Union[MicroPartition, ray.ObjectRef]]: An iterator over the partitions of the DataFrame.
`Iterator[Union[MicroPartition, ObjectRef]]`	Each partition is a MicroPartition object (if using Python runner backend) or a ray ObjectRef
`Iterator[Union[MicroPartition, ObjectRef]]`	(if using Ray runner backend).

Examples:

>>> import daft
>>>
>>> daft.set_runner_ray()
>>>
>>> df = daft.from_pydict({"foo": [1, 2, 3], "bar": ["a", "b", "c"]}).into_partitions(2)
>>> for part in df.iter_partitions():
...     print(part)

MicroPartition with 3 rows:
TableState: Loaded. 1 tables
╭───────┬────────╮
│ foo   ┆ bar    │
│ ---   ┆ ---    │
│ Int64 ┆ String │
╞═══════╪════════╡
│ 1     ┆ a      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2     ┆ b      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3     ┆ c      │
╰───────┴────────╯
Statistics: missing

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def iter_partitions(
    self, results_buffer_size: int | None | Literal["num_cpus"] = "num_cpus"
) -> Iterator[Union[MicroPartition, "ray.ObjectRef"]]:
    """Begin executing this dataframe and return an iterator over the partitions.

    Each partition will be returned as a daft.recordbatch object (if using Python runner backend)
    or a ray ObjectRef (if using Ray runner backend).

    Args:
        results_buffer_size: how many partitions to allow in the results buffer (defaults to the total number of CPUs
            available on the machine).

    Note: A quick note on configuring asynchronous/parallel execution using `results_buffer_size`.
        The `results_buffer_size` kwarg controls how many results Daft will allow to be in the buffer while iterating.
        Once this buffer is filled, Daft will not run any more work until some partition is consumed from the buffer.

        * Increasing this value means the iterator will consume more memory and CPU resources but have higher throughput
        * Decreasing this value means the iterator will consume lower memory and CPU resources, but have lower throughput
        * Setting this value to `None` means the iterator will consume as much resources as it deems appropriate per-iteration

        The default value is the total number of CPUs available on the current machine.

    Returns:
        Iterator[Union[MicroPartition, ray.ObjectRef]]: An iterator over the partitions of the DataFrame.
        Each partition is a MicroPartition object (if using Python runner backend) or a ray ObjectRef
        (if using Ray runner backend).

    Examples:
        >>> import daft
        >>>
        >>> daft.set_runner_ray()  # doctest: +SKIP
        >>>
        >>> df = daft.from_pydict({"foo": [1, 2, 3], "bar": ["a", "b", "c"]}).into_partitions(2)
        >>> for part in df.iter_partitions():
        ...     print(part)  # doctest: +SKIP
        MicroPartition with 3 rows:
        TableState: Loaded. 1 tables
        ╭───────┬────────╮
        │ foo   ┆ bar    │
        │ ---   ┆ ---    │
        │ Int64 ┆ String │
        ╞═══════╪════════╡
        │ 1     ┆ a      │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 2     ┆ b      │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 3     ┆ c      │
        ╰───────┴────────╯
        <BLANKLINE>
        <BLANKLINE>
        Statistics: missing
    """
    if results_buffer_size == "num_cpus":
        results_buffer_size = multiprocessing.cpu_count()
    elif results_buffer_size is not None and not results_buffer_size > 0:
        raise ValueError(f"Provided `results_buffer_size` value must be > 0, received: {results_buffer_size}")

    results = self._result
    if results is not None:
        # If the dataframe has already finished executing,
        # use the precomputed results.
        for mat_result in results.values():
            yield mat_result.partition()

    else:
        # Execute the dataframe in a streaming fashion.
        results_iter: Iterator[MaterializedResult[Any]] = get_or_create_runner().run_iter(
            self._builder, results_buffer_size=results_buffer_size
        )
        for result in results_iter:
            yield result.partition()

iter_rows #

iter_rows(results_buffer_size: int | None | Literal['num_cpus'] = 'num_cpus', column_format: Literal['python', 'arrow'] = 'python') -> Iterator[dict[str, Any]]

Return an iterator of rows for this dataframe.

Each row will be a Python dictionary of the form { "key" : value, ...}. If you are instead looking to iterate over entire partitions of data, see df.iter_partitions().

By default, Daft will convert the columns to Python lists for easy consumption. Datatypes with Python equivalents will be converted accordingly, e.g. timestamps to datetime, tensors to numpy arrays. For nested data such as List or Struct arrays, however, this can be expensive. You may wish to set column_format to "arrow" such that the nested data is returned as Arrow scalars.

Parameters:

Name	Type	Description	Default
`results_buffer_size`	`int \| None \| Literal['num_cpus']`	how many partitions to allow in the results buffer (defaults to the total number of CPUs available on the machine).	`'num_cpus'`
`column_format`	`Literal['python', 'arrow']`	the format of the columns to iterate over. One of "python" or "arrow". Defaults to "python".	`'python'`

A quick note on configuring asynchronous/parallel execution using results_buffer_size.

The results_buffer_size kwarg controls how many results Daft will allow to be in the buffer while iterating. Once this buffer is filled, Daft will not run any more work until some partition is consumed from the buffer.

Increasing this value means the iterator will consume more memory and CPU resources but have higher throughput
Decreasing this value means the iterator will consume lower memory and CPU resources, but have lower throughput
Setting this value to None means the iterator will consume as much resources as it deems appropriate per-iteration

The default value is the total number of CPUs available on the current machine.

Returns:

Type	Description
`Iterator[dict[str, Any]]`	Iterator[dict[str, Any]]: An iterator over the rows of the DataFrame, where each row is a dictionary
`Iterator[dict[str, Any]]`	mapping column names to values.

Examples:

>>> import daft
>>>
>>> df = daft.from_pydict({"foo": [1, 2, 3], "bar": ["a", "b", "c"]})
>>> for row in df.iter_rows():
...     print(row)

{'foo': 1, 'bar': 'a'}
{'foo': 2, 'bar': 'b'}
{'foo': 3, 'bar': 'c'}

Tip

See also df.iter_partitions(): iterator over entire partitions instead of single rows

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def iter_rows(
    self,
    results_buffer_size: int | None | Literal["num_cpus"] = "num_cpus",
    column_format: Literal["python", "arrow"] = "python",
) -> Iterator[dict[str, Any]]:
    """Return an iterator of rows for this dataframe.

    Each row will be a Python dictionary of the form `{ "key" : value, ...}`. If you are instead looking to iterate over
    entire partitions of data, see [`df.iter_partitions()`][daft.DataFrame.iter_partitions].

    By default, Daft will convert the columns to Python lists for easy consumption. Datatypes with Python equivalents will be converted accordingly, e.g. timestamps to datetime, tensors to numpy arrays.
    For nested data such as List or Struct arrays, however, this can be expensive. You may wish to set `column_format` to "arrow" such that the nested data is returned as Arrow scalars.

    Args:
        results_buffer_size: how many partitions to allow in the results buffer (defaults to the total number of CPUs
            available on the machine).
        column_format: the format of the columns to iterate over. One of "python" or "arrow". Defaults to "python".

    Note: A quick note on configuring asynchronous/parallel execution using `results_buffer_size`.
        The `results_buffer_size` kwarg controls how many results Daft will allow to be in the buffer while iterating.
        Once this buffer is filled, Daft will not run any more work until some partition is consumed from the buffer.

        * Increasing this value means the iterator will consume more memory and CPU resources but have higher throughput
        * Decreasing this value means the iterator will consume lower memory and CPU resources, but have lower throughput
        * Setting this value to `None` means the iterator will consume as much resources as it deems appropriate per-iteration

        The default value is the total number of CPUs available on the current machine.

    Returns:
        Iterator[dict[str, Any]]: An iterator over the rows of the DataFrame, where each row is a dictionary
        mapping column names to values.

    Examples:
        >>> import daft
        >>>
        >>> df = daft.from_pydict({"foo": [1, 2, 3], "bar": ["a", "b", "c"]})
        >>> for row in df.iter_rows():
        ...     print(row)
        {'foo': 1, 'bar': 'a'}
        {'foo': 2, 'bar': 'b'}
        {'foo': 3, 'bar': 'c'}

    Tip:
        See also [`df.iter_partitions()`][daft.DataFrame.iter_partitions]: iterator over entire partitions instead of single rows
    """
    if results_buffer_size == "num_cpus":
        results_buffer_size = multiprocessing.cpu_count()

    def arrow_iter_rows(table: "pyarrow.Table") -> Iterator[dict[str, Any]]:
        columns = table.columns
        for i in range(len(table)):
            row = {col._name: col[i] for col in columns}
            yield row

    def python_iter_rows(pydict: dict[str, list[Any]], num_rows: int) -> Iterator[dict[str, Any]]:
        for i in range(num_rows):
            row = {key: value[i] for (key, value) in pydict.items()}
            yield row

    if self._result is not None:
        # If the dataframe has already finished executing,
        # use the precomputed results.
        if column_format == "python":
            yield from python_iter_rows(self.to_pydict(), len(self))
        elif column_format == "arrow":
            yield from arrow_iter_rows(self.to_arrow())
        else:
            raise ValueError(
                f"Unsupported column_format: {column_format}, supported formats are 'python' and 'arrow'"
            )
    else:
        # Execute the dataframe in a streaming fashion.
        partitions_iter = get_or_create_runner().run_iter_tables(
            self._builder, results_buffer_size=results_buffer_size
        )

        # Iterate through partitions.
        for partition in partitions_iter:
            if column_format == "python":
                yield from python_iter_rows(partition.to_pydict(), len(partition))
            elif column_format == "arrow":
                yield from arrow_iter_rows(partition.to_arrow())
            else:
                raise ValueError(
                    f"Unsupported column_format: {column_format}, supported formats are 'python' and 'arrow'"
                )

join #

join(other: DataFrame, on: list[ColumnInputType] | ColumnInputType | None = None, left_on: list[ColumnInputType] | ColumnInputType | None = None, right_on: list[ColumnInputType] | ColumnInputType | None = None, how: Literal['inner', 'left', 'right', 'outer', 'anti', 'semi', 'cross'] = 'inner', strategy: Literal['hash', 'sort_merge', 'broadcast'] | None = None, prefix: str | None = None, suffix: str | None = None) -> DataFrame

Column-wise join of the current DataFrame with an other DataFrame, similar to a SQL JOIN.

If the two DataFrames have duplicate non-join key column names, "right." will be prepended to the conflicting right columns. You can change the behavior by passing either (or both) prefix or suffix to the function. If prefix is passed, it will be prepended to the conflicting right columns. If suffix is passed, it will be appended to the conflicting right columns.

Parameters:

Name	Type	Description	Default
`other`	`DataFrame`	the right DataFrame to join on.	required
`on`	`Optional[Union[List[ColumnInputType], ColumnInputType]]`	key or keys to join on [use if the keys on the left and right side match.]. Defaults to None.	`None`
`left_on`	`Optional[Union[List[ColumnInputType], ColumnInputType]]`	key or keys to join on left DataFrame. Defaults to None.	`None`
`right_on`	`Optional[Union[List[ColumnInputType], ColumnInputType]]`	key or keys to join on right DataFrame. Defaults to None.	`None`
`how`	`str`	what type of join to perform; currently "inner", "left", "right", "outer", "anti", "semi", and "cross" are supported. Defaults to "inner".	`'inner'`
`strategy`	`Optional[str]`	The join strategy (algorithm) to use; currently "hash", "sort_merge", "broadcast", and None are supported, where None chooses the join strategy automatically during query optimization. The default is None.	`None`
`suffix`	`Optional[str]`	Suffix to add to the column names in case of a name collision. Defaults to "".	`None`
`prefix`	`Optional[str]`	Prefix to add to the column names in case of a name collision. Defaults to "right.".	`None`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Joined DataFrame.

Raises:

Type	Description
`ValueError`	if `on` is passed in and `left_on` or `right_on` is not None.
`ValueError`	if `on` is None but both `left_on` and `right_on` are not defined.

Note

Although self joins are supported, we currently duplicate the logical plan for the right side and recompute the entire tree. Caching for this is on the roadmap.

Examples:

>>> import daft
>>> from daft import col
>>> df1 = daft.from_pydict({"a": ["w", "x", "y"], "b": [1, 2, 3]})
>>> df2 = daft.from_pydict({"a": ["x", "y", "z"], "b": [20, 30, 40]})
>>> joined_df = df1.join(df2, left_on=df1["a"], right_on=df2["a"])
>>> joined_df.show()

╭────────┬───────┬─────────╮
│ a      ┆ b     ┆ right.b │
│ ---    ┆ ---   ┆ ---     │
│ String ┆ Int64 ┆ Int64   │
╞════════╪═══════╪═════════╡
│ x      ┆ 2     ┆ 20      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ y      ┆ 3     ┆ 30      │
╰────────┴───────┴─────────╯
(Showing first 2 of 2 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def join(
    self,
    other: "DataFrame",
    on: list[ColumnInputType] | ColumnInputType | None = None,
    left_on: list[ColumnInputType] | ColumnInputType | None = None,
    right_on: list[ColumnInputType] | ColumnInputType | None = None,
    how: Literal["inner", "left", "right", "outer", "anti", "semi", "cross"] = "inner",
    strategy: Literal["hash", "sort_merge", "broadcast"] | None = None,
    prefix: str | None = None,
    suffix: str | None = None,
) -> "DataFrame":
    """Column-wise join of the current DataFrame with an ``other`` DataFrame, similar to a SQL ``JOIN``.

    If the two DataFrames have duplicate non-join key column names, "right." will be prepended to the conflicting right columns. You can change the behavior by passing either (or both) `prefix` or `suffix` to the function.
    If `prefix` is passed, it will be prepended to the conflicting right columns. If `suffix` is passed, it will be appended to the conflicting right columns.

    Args:
        other (DataFrame): the right DataFrame to join on.
        on (Optional[Union[List[ColumnInputType], ColumnInputType]]): key or keys to join on [use if the keys on the left and right side match.]. Defaults to None.
        left_on (Optional[Union[List[ColumnInputType], ColumnInputType]], optional): key or keys to join on left DataFrame. Defaults to None.
        right_on (Optional[Union[List[ColumnInputType], ColumnInputType]], optional): key or keys to join on right DataFrame. Defaults to None.
        how (str, optional): what type of join to perform; currently "inner", "left", "right", "outer", "anti", "semi", and "cross" are supported. Defaults to "inner".
        strategy (Optional[str]): The join strategy (algorithm) to use; currently "hash", "sort_merge", "broadcast", and None are supported, where None
            chooses the join strategy automatically during query optimization. The default is None.
        suffix (Optional[str], optional): Suffix to add to the column names in case of a name collision. Defaults to "".
        prefix (Optional[str], optional): Prefix to add to the column names in case of a name collision. Defaults to "right.".

    Returns:
        DataFrame: Joined DataFrame.

    Raises:
        ValueError: if `on` is passed in and `left_on` or `right_on` is not None.
        ValueError: if `on` is None but both `left_on` and `right_on` are not defined.

    Note:
        Although self joins are supported, we currently duplicate the logical plan for the right side
        and recompute the entire tree. Caching for this is on the roadmap.

    Examples:
        >>> import daft
        >>> from daft import col
        >>> df1 = daft.from_pydict({"a": ["w", "x", "y"], "b": [1, 2, 3]})
        >>> df2 = daft.from_pydict({"a": ["x", "y", "z"], "b": [20, 30, 40]})
        >>> joined_df = df1.join(df2, left_on=df1["a"], right_on=df2["a"])
        >>> joined_df.show()
        ╭────────┬───────┬─────────╮
        │ a      ┆ b     ┆ right.b │
        │ ---    ┆ ---   ┆ ---     │
        │ String ┆ Int64 ┆ Int64   │
        ╞════════╪═══════╪═════════╡
        │ x      ┆ 2     ┆ 20      │
        ├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
        │ y      ┆ 3     ┆ 30      │
        ╰────────┴───────┴─────────╯
        <BLANKLINE>
        (Showing first 2 of 2 rows)
    """
    if how == "cross":
        if any(side_on is not None for side_on in [on, left_on, right_on]):
            raise ValueError("In a cross join, `on`, `left_on`, and `right_on` cannot be set")
        if strategy is not None:
            raise ValueError("In a cross join, `strategy` cannot be set")
        left_on = []
        right_on = []
    elif on is None:
        if left_on is None or right_on is None:
            raise ValueError("If `on` is None then both `left_on` and `right_on` must not be None")
    else:
        if left_on is not None or right_on is not None:
            raise ValueError("If `on` is not None then both `left_on` and `right_on` must be None")
        left_on = on
        right_on = on

    join_type = JoinType.from_join_type_str(how)
    join_strategy = JoinStrategy.from_join_strategy_str(strategy) if strategy is not None else None

    if join_strategy == JoinStrategy.SortMerge and join_type != JoinType.Inner:
        raise ValueError("Sort merge join only supports inner joins")
    elif join_strategy == JoinStrategy.Broadcast and join_type == JoinType.Outer:
        raise ValueError("Broadcast join does not support outer joins")

    left_exprs = column_inputs_to_expressions(tuple(left_on) if isinstance(left_on, list) else (left_on,))
    right_exprs = column_inputs_to_expressions(tuple(right_on) if isinstance(right_on, list) else (right_on,))
    builder = self._builder.join(
        other._builder,
        left_on=left_exprs,
        right_on=right_exprs,
        how=join_type,
        strategy=join_strategy,
        prefix=prefix,
        suffix=suffix,
    )
    return DataFrame(builder)

join_asof #

join_asof(other: DataFrame, *, on: ColumnInputType | None = None, left_on: ColumnInputType | None = None, right_on: ColumnInputType | None = None, by: list[ColumnInputType] | ColumnInputType | None = None, left_by: list[ColumnInputType] | ColumnInputType | None = None, right_by: list[ColumnInputType] | ColumnInputType | None = None, strategy: Literal['backward', 'forward', 'nearest'] = 'backward', prefix: str | None = None, suffix: str | None = None, _assume_sorted_and_aligned: bool = False) -> DataFrame

Point-in-time (asof) join: each left row matches the nearest right row according to the chosen strategy.

Parameters:

Name	Type	Description	Default
`other`	`DataFrame`	Right-hand DataFrame (e.g. feature table).	required
`on`	`ColumnInputType \| None`	Asof key column when it has the same name on both sides. Exactly one column.	`None`
`left_on`	`ColumnInputType \| None`	Asof key on the left when names differ. Exactly one column; use with `right_on`.	`None`
`right_on`	`ColumnInputType \| None`	Asof key on the right when names differ. Exactly one column; use with `left_on`.	`None`
`by`	`list[ColumnInputType] \| ColumnInputType \| None`	Equality key column(s) with the same name on both sides (entity / group columns).	`None`
`left_by`	`list[ColumnInputType] \| ColumnInputType \| None`	Equality keys on the left when names differ; use with `right_by`.	`None`
`right_by`	`list[ColumnInputType] \| ColumnInputType \| None`	Equality keys on the right when names differ; use with `left_by`.	`None`
`strategy`	`Literal['backward', 'forward', 'nearest']`	Match strategy. `"backward"` finds the latest right row at or before the left timestamp. `"forward"` finds the earliest right row at or after the left timestamp. `"nearest"` finds the right row with the minimum absolute difference in on_key; For tie-breaking, prefer the larger/forward value.	`'backward'`
`_assume_sorted_and_aligned`	`bool`	Asserts that both tables have the same number of partitions with identical boundaries, and that rows within each partition are sorted ascending by the on-key. Also requires `enable_scan_task_split_and_merge=False`. When these conditions hold, Daft skips the distributed range-repartition shuffle and zips partitions by index. Passing `True` when the conditions are not met produces incorrect results.	`False`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Left-join-shaped result (every left row kept; unmatched right columns are null).

Raises:

Type	Description
`ValueError`	if `on` is set and `left_on` or `right_on` is not None.
`ValueError`	if `on` is None but `left_on` or `right_on` is missing.
`ValueError`	if both `by` and `left_by` / `right_by` are set.
`ValueError`	if only one of `left_by` and `right_by` is set.
`ValueError`	if `left_by` and `right_by` have different lengths.

Examples:

>>> import daft
>>> left = daft.from_pydict({"entity": ["A", "A", "B"], "timestamp": [10, 11, 10]})
>>> right = daft.from_pydict(
...     {
...         "entity": ["A", "A", "A", "B", "B"],
...         "timestamp": [9, 10, 11, 9, 11],
...         "value": [1.0, 2.0, 3.0, 5.0, 6.0],
...     }
... )
>>> left.join_asof(right, on="timestamp", by="entity").sort(["entity", "timestamp"]).show()

╭────────┬───────────┬─────────╮
│ entity ┆ timestamp ┆ value   │
│ ---    ┆ ---       ┆ ---     │
│ String ┆ Int64     ┆ Float64 │
╞════════╪═══════════╪═════════╡
│ A      ┆ 10        ┆ 2       │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ A      ┆ 11        ┆ 3       │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ B      ┆ 10        ┆ 5       │
╰────────┴───────────┴─────────╯
(Showing first 3 of 3 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def join_asof(
    self,
    other: "DataFrame",
    *,
    on: ColumnInputType | None = None,
    left_on: ColumnInputType | None = None,
    right_on: ColumnInputType | None = None,
    by: list[ColumnInputType] | ColumnInputType | None = None,
    left_by: list[ColumnInputType] | ColumnInputType | None = None,
    right_by: list[ColumnInputType] | ColumnInputType | None = None,
    strategy: Literal["backward", "forward", "nearest"] = "backward",
    prefix: str | None = None,
    suffix: str | None = None,
    _assume_sorted_and_aligned: bool = False,
) -> "DataFrame":
    """Point-in-time (asof) join: each left row matches the nearest right row according to the chosen strategy.

    Args:
        other: Right-hand DataFrame (e.g. feature table).
        on: Asof key column when it has the same name on both sides. Exactly one column.
        left_on: Asof key on the left when names differ. Exactly one column; use with ``right_on``.
        right_on: Asof key on the right when names differ. Exactly one column; use with ``left_on``.
        by: Equality key column(s) with the same name on both sides (entity / group columns).
        left_by: Equality keys on the left when names differ; use with ``right_by``.
        right_by: Equality keys on the right when names differ; use with ``left_by``.
        strategy: Match strategy. ``"backward"`` finds the latest right row at or before the left timestamp. ``"forward"`` finds the earliest right row at or after the left timestamp. ``"nearest"`` finds the right row with the minimum absolute difference in on_key; For tie-breaking, prefer the larger/forward value.
        _assume_sorted_and_aligned: Asserts that both tables have the same number of
            partitions with identical boundaries, and that rows within each partition are
            sorted ascending by the on-key. Also requires
            ``enable_scan_task_split_and_merge=False``. When these conditions hold, Daft
            skips the distributed range-repartition shuffle and zips partitions by index.
            Passing ``True`` when the conditions are not met produces incorrect results.

    Returns:
        DataFrame: Left-join-shaped result (every left row kept; unmatched right columns are null).

    Raises:
        ValueError: if ``on`` is set and ``left_on`` or ``right_on`` is not None.
        ValueError: if ``on`` is None but ``left_on`` or ``right_on`` is missing.
        ValueError: if both ``by`` and ``left_by`` / ``right_by`` are set.
        ValueError: if only one of ``left_by`` and ``right_by`` is set.
        ValueError: if ``left_by`` and ``right_by`` have different lengths.

    Examples:
        >>> import daft
        >>> left = daft.from_pydict({"entity": ["A", "A", "B"], "timestamp": [10, 11, 10]})
        >>> right = daft.from_pydict(
        ...     {
        ...         "entity": ["A", "A", "A", "B", "B"],
        ...         "timestamp": [9, 10, 11, 9, 11],
        ...         "value": [1.0, 2.0, 3.0, 5.0, 6.0],
        ...     }
        ... )
        >>> left.join_asof(right, on="timestamp", by="entity").sort(["entity", "timestamp"]).show()
        ╭────────┬───────────┬─────────╮
        │ entity ┆ timestamp ┆ value   │
        │ ---    ┆ ---       ┆ ---     │
        │ String ┆ Int64     ┆ Float64 │
        ╞════════╪═══════════╪═════════╡
        │ A      ┆ 10        ┆ 2       │
        ├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
        │ A      ┆ 11        ┆ 3       │
        ├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
        │ B      ┆ 10        ┆ 5       │
        ╰────────┴───────────┴─────────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)
    """
    if on is not None:
        if left_on is not None or right_on is not None:
            raise ValueError("If `on` is set then `left_on` and `right_on` must be None")
        left_on_expr = column_input_to_expression(on)
        right_on_expr = column_input_to_expression(on)
    else:
        if left_on is None or right_on is None:
            raise ValueError("If `on` is None then both `left_on` and `right_on` must be set")
        left_on_expr = column_input_to_expression(left_on)
        right_on_expr = column_input_to_expression(right_on)

    if by is not None:
        if left_by is not None or right_by is not None:
            raise ValueError("Cannot specify both `by` and `left_by`/`right_by`")
        by_tuple = tuple(by) if isinstance(by, list) else (by,)
        left_by_exprs = column_inputs_to_expressions(by_tuple)
        right_by_exprs = column_inputs_to_expressions(by_tuple)
    else:
        if left_by is None and right_by is None:
            left_by_exprs = []
            right_by_exprs = []
        elif left_by is None or right_by is None:
            raise ValueError("Specify both `left_by` and `right_by`, or neither")
        else:
            left_by_tuple = tuple(left_by) if isinstance(left_by, list) else (left_by,)
            right_by_tuple = tuple(right_by) if isinstance(right_by, list) else (right_by,)
            left_by_exprs = column_inputs_to_expressions(left_by_tuple)
            right_by_exprs = column_inputs_to_expressions(right_by_tuple)
            if len(left_by_exprs) != len(right_by_exprs):
                raise ValueError(
                    "left_by and right_by must have the same number of columns, got "
                    f"{len(left_by_exprs)} and {len(right_by_exprs)}"
                )

    asof_strategy = AsofJoinStrategy.from_asof_join_strategy_str(strategy)
    builder = self._builder.join_asof(
        other._builder,
        left_by=left_by_exprs,
        right_by=right_by_exprs,
        left_on=left_on_expr,
        right_on=right_on_expr,
        strategy=asof_strategy,
        prefix=prefix,
        suffix=suffix,
        assume_sorted_and_aligned=_assume_sorted_and_aligned,
    )
    return DataFrame(builder)

limit #

limit(num: int) -> DataFrame

Limits the rows in the DataFrame to the first N rows, similar to a SQL LIMIT.

Parameters:

Name	Type	Description	Default
`num`	`int`	maximum rows to allow.	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Limited DataFrame

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3, 4, 5, 6, 7]})
>>> df_limited = df.limit(5)  # returns 5 rows
>>> df_limited.show()

╭───────╮
│ x     │
│ ---   │
│ Int64 │
╞═══════╡
│ 1     │
├╌╌╌╌╌╌╌┤
│ 2     │
├╌╌╌╌╌╌╌┤
│ 3     │
├╌╌╌╌╌╌╌┤
│ 4     │
├╌╌╌╌╌╌╌┤
│ 5     │
╰───────╯
(Showing first 5 of 5 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def limit(self, num: int) -> "DataFrame":
    """Limits the rows in the DataFrame to the first ``N`` rows, similar to a SQL ``LIMIT``.

    Args:
        num (int): maximum rows to allow.

    Returns:
        DataFrame: Limited DataFrame

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3, 4, 5, 6, 7]})
        >>> df_limited = df.limit(5)  # returns 5 rows
        >>> df_limited.show()
        ╭───────╮
        │ x     │
        │ ---   │
        │ Int64 │
        ╞═══════╡
        │ 1     │
        ├╌╌╌╌╌╌╌┤
        │ 2     │
        ├╌╌╌╌╌╌╌┤
        │ 3     │
        ├╌╌╌╌╌╌╌┤
        │ 4     │
        ├╌╌╌╌╌╌╌┤
        │ 5     │
        ╰───────╯
        <BLANKLINE>
        (Showing first 5 of 5 rows)

    """
    builder = self._builder.limit(num, eager=False)
    return DataFrame(builder)

max #

max(*cols: ColumnInputType) -> DataFrame

Performs a global max on the DataFrame.

Parameters:

Name	Type	Description	Default
`*cols`	`Union[str, Expression]`	columns to max	`()`

Returns: DataFrame: Globally aggregated max. Should be a single row.

Examples:

>>> import daft
>>> df = daft.from_pydict({"col_a": [1, 2, 3]})
>>> df = df.max("col_a")
>>> df.show()

╭───────╮
│ col_a │
│ ---   │
│ Int64 │
╞═══════╡
│ 3     │
╰───────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def max(self, *cols: ColumnInputType) -> "DataFrame":
    """Performs a global max on the DataFrame.

    Args:
        *cols (Union[str, Expression]): columns to max
    Returns:
        DataFrame: Globally aggregated max. Should be a single row.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"col_a": [1, 2, 3]})
        >>> df = df.max("col_a")
        >>> df.show()
        ╭───────╮
        │ col_a │
        │ ---   │
        │ Int64 │
        ╞═══════╡
        │ 3     │
        ╰───────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)
    """
    return self._apply_agg_fn(Expression.max, cols)

mean #

mean(*cols: ColumnInputType) -> DataFrame

Performs a global mean on the DataFrame.

Parameters:

Name	Type	Description	Default
`*cols`	`Union[str, Expression]`	columns to mean	`()`

Returns: DataFrame: Globally aggregated mean. Should be a single row.

Examples:

>>> import daft
>>> df = daft.from_pydict({"col_a": [1, 2, 3]})
>>> df = df.mean("col_a")
>>> df.show()

╭─────────╮
│ col_a   │
│ ---     │
│ Float64 │
╞═════════╡
│ 2       │
╰─────────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def mean(self, *cols: ColumnInputType) -> "DataFrame":
    """Performs a global mean on the DataFrame.

    Args:
        *cols (Union[str, Expression]): columns to mean
    Returns:
        DataFrame: Globally aggregated mean. Should be a single row.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"col_a": [1, 2, 3]})
        >>> df = df.mean("col_a")
        >>> df.show()
        ╭─────────╮
        │ col_a   │
        │ ---     │
        │ Float64 │
        ╞═════════╡
        │ 2       │
        ╰─────────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)
    """
    return self._apply_agg_fn(Expression.mean, cols)

melt #

melt(ids: ManyColumnsInputType, values: ManyColumnsInputType = [], variable_name: str = 'variable', value_name: str = 'value') -> DataFrame

Alias for unpivot.

Parameters:

Name	Type	Description	Default
`ids`	`ManyColumnsInputType`	Columns to keep as identifiers	required
`values`	`Optional[ManyColumnsInputType]`	Columns to unpivot. If not specified, all columns except ids will be unpivoted.	`[]`
`variable_name`	`Optional[str]`	Name of the variable column. Defaults to "variable".	`'variable'`
`value_name`	`Optional[str]`	Name of the value column. Defaults to "value".	`'value'`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Unpivoted DataFrame

Examples:

>>> import daft
>>> df = daft.from_pydict(
...     {
...         "year": [2020, 2021, 2022],
...         "Jan": [10, 30, 50],
...         "Feb": [20, 40, 60],
...     }
... )
>>> df = df.melt("year", ["Jan", "Feb"], variable_name="month", value_name="inventory")
>>> df = df.sort("year")
>>> df.show()

╭───────┬────────┬───────────╮
│ year  ┆ month  ┆ inventory │
│ ---   ┆ ---    ┆ ---       │
│ Int64 ┆ String ┆ Int64     │
╞═══════╪════════╪═══════════╡
│ 2020  ┆ Jan    ┆ 10        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2020  ┆ Feb    ┆ 20        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021  ┆ Jan    ┆ 30        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021  ┆ Feb    ┆ 40        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022  ┆ Jan    ┆ 50        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022  ┆ Feb    ┆ 60        │
╰───────┴────────┴───────────╯
(Showing first 6 of 6 rows)

Tip

min #

min(*cols: ColumnInputType) -> DataFrame

Performs a global min on the DataFrame.

Parameters:

Name	Type	Description	Default
`*cols`	`Union[str, Expression]`	columns to min	`()`

Returns: DataFrame: Globally aggregated min. Should be a single row.

Examples:

>>> import daft
>>> df = daft.from_pydict({"col_a": [1, 2, 3]})
>>> df = df.min("col_a")
>>> df.show()

╭───────╮
│ col_a │
│ ---   │
│ Int64 │
╞═══════╡
│ 1     │
╰───────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def min(self, *cols: ColumnInputType) -> "DataFrame":
    """Performs a global min on the DataFrame.

    Args:
        *cols (Union[str, Expression]): columns to min
    Returns:
        DataFrame: Globally aggregated min. Should be a single row.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"col_a": [1, 2, 3]})
        >>> df = df.min("col_a")
        >>> df.show()
        ╭───────╮
        │ col_a │
        │ ---   │
        │ Int64 │
        ╞═══════╡
        │ 1     │
        ╰───────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)
    """
    return self._apply_agg_fn(Expression.min, cols)

num_partitions #

num_partitions() -> int | None

Returns the number of partitions that will be used to execute this DataFrame.

The query optimizer may change the partitioning strategy. This method runs the optimizer and then inspects the resulting physical plan scheduler to determine how many partitions the execution will use.

Returns:

Name	Type	Description
`int`	`int \| None`	The number of partitions in the optimized physical execution plan.

Examples:

>>> import daft
>>>
>>> daft.set_runner_ray()
>>>
>>> # Create a DataFrame with 1000 rows
>>> df = daft.from_pydict({"x": list(range(1000))})
>>>
>>> # Partition count may depend on default config or optimizer decisions
>>> df.num_partitions()
>>>
>>> # You can repartition manually (if supported), and then inspect again:
>>> df2 = df.repartition(10)
>>> df2.num_partitions()

1
10

Source code in daft/dataframe/dataframe.py

def num_partitions(self) -> int | None:
    """Returns the number of partitions that will be used to execute this DataFrame.

    The query optimizer may change the partitioning strategy. This method runs the optimizer
    and then inspects the resulting physical plan scheduler to determine how many partitions
    the execution will use.

    Returns:
        int: The number of partitions in the optimized physical execution plan.

    Examples:
        >>> import daft
        >>>
        >>> daft.set_runner_ray()  # doctest: +SKIP
        >>>
        >>> # Create a DataFrame with 1000 rows
        >>> df = daft.from_pydict({"x": list(range(1000))})
        >>>
        >>> # Partition count may depend on default config or optimizer decisions
        >>> df.num_partitions()  # doctest: +SKIP
        1
        >>>
        >>> # You can repartition manually (if supported), and then inspect again:
        >>> df2 = df.repartition(10)  # doctest: +SKIP
        >>> df2.num_partitions()  # doctest: +SKIP
        10
    """
    runner_name = get_or_create_runner().name
    # Native runner does not support num_partitions
    if runner_name == "native":
        return None
    else:
        execution_config = get_context().daft_execution_config
        optimized = self._builder.optimize(execution_config)
        distributed_plan = DistributedPhysicalPlan.from_logical_plan_builder(
            optimized._builder, "<tmp>", execution_config
        )
        return distributed_plan.num_partitions()

offset #

offset(num: int) -> DataFrame

Returns a new DataFrame by skipping the first N rows, similar to a SQL Offset.

Parameters:

Name	Type	Description	Default
`num`	`int`	the number of rows to skip	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	A new DataFrame by skipping the first `N` rows

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3, 4, 5, 6, 7]})
>>> df = df.offset(1).limit(5)  # skip the first row and return 5 rows
>>> df.show()

╭───────╮
│ x     │
│ ---   │
│ Int64 │
╞═══════╡
│ 2     │
├╌╌╌╌╌╌╌┤
│ 3     │
├╌╌╌╌╌╌╌┤
│ 4     │
├╌╌╌╌╌╌╌┤
│ 5     │
├╌╌╌╌╌╌╌┤
│ 6     │
╰───────╯
(Showing first 5 of 5 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def offset(self, num: int) -> "DataFrame":
    """Returns a new DataFrame by skipping the first ``N`` rows, similar to a SQL ``Offset``.

    Args:
        num (int): the number of rows to skip

    Returns:
        DataFrame: A new DataFrame by skipping the first ``N`` rows

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3, 4, 5, 6, 7]})
        >>> df = df.offset(1).limit(5)  # skip the first row and return 5 rows
        >>> df.show()
        ╭───────╮
        │ x     │
        │ ---   │
        │ Int64 │
        ╞═══════╡
        │ 2     │
        ├╌╌╌╌╌╌╌┤
        │ 3     │
        ├╌╌╌╌╌╌╌┤
        │ 4     │
        ├╌╌╌╌╌╌╌┤
        │ 5     │
        ├╌╌╌╌╌╌╌┤
        │ 6     │
        ╰───────╯
        <BLANKLINE>
        (Showing first 5 of 5 rows)

    """
    builder = self._builder.offset(num)
    return DataFrame(builder)

pipe #

pipe(function: Callable[Concatenate[DataFrame, P], T], *args: args, **kwargs: kwargs) -> T

Apply the function to this DataFrame.

Parameters:

Name	Type	Description	Default
`function`	`Callable[Concatenate[DataFrame, P], T]`	Function to apply.	required
`*args`	`args`	Positional arguments to pass to the function.	`()`
`**kwargs`	`kwargs`	Keyword arguments to pass to the function.	`{}`

Returns:

Type	Description
`T`	Result of applying the function on this DataFrame.

Examples:

>>> import daft
>>>
>>> df = daft.from_pydict({"x": [1, 2, 3]})
>>>
>>> def square(df, column: str):
...     return df.select((df[column] * df[column]).alias(column))
>>>
>>> df.pipe(square, "x").show()

╭───────╮
│ x     │
│ ---   │
│ Int64 │
╞═══════╡
│ 1     │
├╌╌╌╌╌╌╌┤
│ 4     │
├╌╌╌╌╌╌╌┤
│ 9     │
╰───────╯
(Showing first 3 of 3 rows)

Source code in daft/dataframe/dataframe.py

def pipe(
    self,
    function: Callable[Concatenate["DataFrame", P], T],
    *args: P.args,
    **kwargs: P.kwargs,
) -> T:
    """Apply the function to this DataFrame.

    Args:
        function (Callable[Concatenate["DataFrame", P], T]): Function to apply.
        *args (P.args): Positional arguments to pass to the function.
        **kwargs (P.kwargs): Keyword arguments to pass to the function.

    Returns:
        Result of applying the function on this DataFrame.

    Examples:
        >>> import daft
        >>>
        >>> df = daft.from_pydict({"x": [1, 2, 3]})
        >>>
        >>> def square(df, column: str):
        ...     return df.select((df[column] * df[column]).alias(column))
        >>>
        >>> df.pipe(square, "x").show()
        ╭───────╮
        │ x     │
        │ ---   │
        │ Int64 │
        ╞═══════╡
        │ 1     │
        ├╌╌╌╌╌╌╌┤
        │ 4     │
        ├╌╌╌╌╌╌╌┤
        │ 9     │
        ╰───────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)
    """
    return function(self, *args, **kwargs)

pivot #

pivot(group_by: ManyColumnsInputType, pivot_col: ColumnInputType, value_col: ColumnInputType, agg_fn: str, names: list[str] | None = None) -> DataFrame

Pivots a column of the DataFrame and performs an aggregation on the values.

Parameters:

Name	Type	Description	Default
`group_by`	`ManyColumnsInputType`	columns to group by	required
`pivot_col`	`Union[str, Expression]`	column to pivot	required
`value_col`	`Union[str, Expression]`	column to aggregate	required
`agg_fn`	`str`	aggregation function to apply	required
`names`	`Optional[List[str]]`	names of the pivoted columns	`None`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with pivoted columns

Note

You may wish to provide a list of distinct values to pivot on, which is more efficient as it avoids a distinct operation. Without this list, Daft will perform a distinct operation on the pivot column to determine the unique values to pivot on.

Examples:

>>> import daft
>>> data = {
...     "id": [1, 2, 3, 4],
...     "version": ["3.8", "3.8", "3.9", "3.9"],
...     "platform": ["macos", "macos", "macos", "windows"],
...     "downloads": [100, 200, 150, 250],
... }
>>> df = daft.from_pydict(data)
>>> df = df.pivot("version", "platform", "downloads", "sum")
>>>
>>> df = df.sort("version").select("version", "windows", "macos")
>>> df.show()

╭─────────┬─────────┬───────╮
│ version ┆ windows ┆ macos │
│ ---     ┆ ---     ┆ ---   │
│ String  ┆ Int64   ┆ Int64 │
╞═════════╪═════════╪═══════╡
│ 3.8     ┆ None    ┆ 300   │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3.9     ┆ 250     ┆ 150   │
╰─────────┴─────────┴───────╯
(Showing first 2 of 2 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def pivot(
    self,
    group_by: ManyColumnsInputType,
    pivot_col: ColumnInputType,
    value_col: ColumnInputType,
    agg_fn: str,
    names: list[str] | None = None,
) -> "DataFrame":
    """Pivots a column of the DataFrame and performs an aggregation on the values.

    Args:
        group_by (ManyColumnsInputType): columns to group by
        pivot_col (Union[str, Expression]): column to pivot
        value_col (Union[str, Expression]): column to aggregate
        agg_fn (str): aggregation function to apply
        names (Optional[List[str]]): names of the pivoted columns

    Returns:
        DataFrame: DataFrame with pivoted columns

    Note:
        You may wish to provide a list of distinct values to pivot on, which is more efficient as it avoids
        a distinct operation. Without this list, Daft will perform a distinct operation on the pivot column to
        determine the unique values to pivot on.

    Examples:
        >>> import daft
        >>> data = {
        ...     "id": [1, 2, 3, 4],
        ...     "version": ["3.8", "3.8", "3.9", "3.9"],
        ...     "platform": ["macos", "macos", "macos", "windows"],
        ...     "downloads": [100, 200, 150, 250],
        ... }
        >>> df = daft.from_pydict(data)
        >>> df = df.pivot("version", "platform", "downloads", "sum")
        >>>
        >>> df = df.sort("version").select("version", "windows", "macos")
        >>> df.show()
        ╭─────────┬─────────┬───────╮
        │ version ┆ windows ┆ macos │
        │ ---     ┆ ---     ┆ ---   │
        │ String  ┆ Int64   ┆ Int64 │
        ╞═════════╪═════════╪═══════╡
        │ 3.8     ┆ None    ┆ 300   │
        ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 3.9     ┆ 250     ┆ 150   │
        ╰─────────┴─────────┴───────╯
        <BLANKLINE>
        (Showing first 2 of 2 rows)


    """
    group_by_expr = column_inputs_to_expressions(group_by)
    [pivot_col_expr, value_col_expr] = column_inputs_to_expressions([pivot_col, value_col])
    agg_expr = self._map_agg_string_to_expr(value_col_expr, agg_fn)

    if names is None:
        names = (
            self.select(typing.cast("ColumnInputType", pivot_col_expr))
            .distinct()
            .to_pydict()[pivot_col_expr.name()]
        )
        names = [str(x) for x in names]
    builder = self._builder.pivot(group_by_expr, pivot_col_expr, value_col_expr, agg_expr, names)
    return DataFrame(builder)

product #

product(*cols: ColumnInputType) -> DataFrame

Performs a global product on the DataFrame.

Parameters:

Name	Type	Description	Default
`*cols`	`Union[str, Expression]`	columns to product	`()`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Globally aggregated products. Should be a single row.

Examples:

>>> import daft
>>> df = daft.from_pydict({"col_a": [1, 2, 3]})
>>> df = df.product("col_a")
>>> df.show()

╭───────╮
│ col_a │
│ ---   │
│ Int64 │
╞═══════╡
│ 6     │
╰───────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def product(self, *cols: ColumnInputType) -> "DataFrame":
    """Performs a global product on the DataFrame.

    Args:
        *cols (Union[str, Expression]): columns to product

    Returns:
        DataFrame: Globally aggregated products. Should be a single row.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"col_a": [1, 2, 3]})
        >>> df = df.product("col_a")
        >>> df.show()
        ╭───────╮
        │ col_a │
        │ ---   │
        │ Int64 │
        ╞═══════╡
        │ 6     │
        ╰───────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)
    """
    return self._apply_agg_fn(Expression.product, cols)

repartition #

repartition(num: int | None, *partition_by: ColumnInputType) -> DataFrame

Repartitions DataFrame to num partitions.

If columns are passed in, then DataFrame will be repartitioned by those, otherwise random repartitioning will occur.

Parameters:

Name	Type	Description	Default
`num`	`Optional[int]`	Number of target partitions; if None, the number of partitions will not be changed.	required
`*partition_by`	`Union[str, Expression]`	Optional columns to partition by.	`()`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Repartitioned DataFrame.

This function will globally shuffle your data, which is potentially a very expensive operation.

If instead you merely wish to "split" or "coalesce" partitions to obtain a target number of partitions, you mean instead wish to consider using DataFrame.into_partitions which avoids shuffling of data in favor of splitting/coalescing adjacent partitions where appropriate.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
>>> repartitioned_df = df.repartition(3)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def repartition(self, num: int | None, *partition_by: ColumnInputType) -> "DataFrame":
    """Repartitions DataFrame to ``num`` partitions.

    If columns are passed in, then DataFrame will be repartitioned by those, otherwise
    random repartitioning will occur.

    Args:
        num (Optional[int]): Number of target partitions; if None, the number of partitions will not be changed.
        *partition_by (Union[str, Expression]): Optional columns to partition by.

    Returns:
        DataFrame: Repartitioned DataFrame.

    Note: This function will globally shuffle your data, which is potentially a very expensive operation.
        If instead you merely wish to "split" or "coalesce" partitions to obtain a target number of partitions,
        you mean instead wish to consider using [DataFrame.into_partitions][daft.DataFrame.into_partitions] which
        avoids shuffling of data in favor of splitting/coalescing adjacent partitions where appropriate.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
        >>> repartitioned_df = df.repartition(3)

    """
    if get_or_create_runner().name == "native":
        warnings.warn(
            "DataFrame.repartition not supported on the NativeRunner. This will be a no-op. Please use the RayRunner via `daft.set_runner_ray()` instead if you need to repartition."
        )
    if len(partition_by) == 0:
        warnings.warn(
            "No columns specified for repartition, so doing a random shuffle. If you do not require rebalancing of "
            "partitions, you may instead prefer using `df.into_partitions(N)` which is a cheaper operation that "
            "avoids shuffling data."
        )
        builder = self._builder.random_shuffle(num)
    else:
        builder = self._builder.hash_repartition(num, column_inputs_to_expressions(partition_by))
    return DataFrame(builder)

sample #

sample(fraction: float | None = None, size: int | None = None, with_replacement: bool = False, seed: int | None = None) -> DataFrame

Samples rows from the DataFrame.

Parameters:

Name	Type	Description	Default
`fraction`	`Optional[float]`	fraction of rows to sample (between 0.0 and 1.0). Must specify either `fraction` or `size`, but not both. For backward compatibility, can also be passed as a positional argument.	`None`
`size`	`Optional[int]`	exact number of rows to sample. Must specify either `fraction` or `size`, but not both. If `size` exceeds the total number of rows: - When `with_replacement=False`: raises ValueError - When `with_replacement=True`: returns `size` rows (may contain duplicates) Note: Sample by size only works on the native runner right now.	`None`
`with_replacement`	`bool`	whether to sample with replacement. Defaults to False.	`False`
`seed`	`Optional[int]`	random seed. Defaults to None.	`None`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with sampled rows.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
>>> # Sample by fraction (backward compatible positional argument)
>>> sampled_df = df.sample(0.5)
>>> sampled_df = sampled_df.collect()
>>> # sampled_df.show()
>>> # ╭───────┬───────┬───────╮
>>> # │ x     ┆ y     ┆ z     │
>>> # │ ---   ┆ ---   ┆ ---   │
>>> # │ Int64 ┆ Int64 ┆ Int64 │
>>> # ╞═══════╪═══════╪═══════╡
>>> # │ 3     ┆ 6     ┆ 9     │
>>> # ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
>>> # │ 1     ┆ 4     ┆ 7     │
>>> # ╰───────┴───────┴───────╯
>>> # <BLANKLINE>
>>> # (Showing first 2 of 2 rows)
>>> # Samples will vary from output to output
>>> # here is a sample output
>>> # ╭───────┬───────┬───────╮
>>> # │ x     ┆ y     ┆ z     │
>>> # │ ---   ┆ ---   ┆ ---   │
>>> # │ Int64 ┆ Int64 ┆ Int64 │
>>> # |═══════╪═══════╪═══════╡
>>> # │ 2     ┆ 5     ┆ 8     │
>>> # ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
>>> # │ 3     ┆ 6     ┆ 9     │
>>> # ╰───────┴───────┴───────╯
>>> # Sample by exact number of rows
>>> sampled_df = df.sample(size=2)
>>> sampled_df = sampled_df.collect()

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def sample(
    self,
    fraction: float | None = None,
    size: int | None = None,
    with_replacement: bool = False,
    seed: int | None = None,
) -> "DataFrame":
    """Samples rows from the DataFrame.

    Args:
        fraction (Optional[float]): fraction of rows to sample (between 0.0 and 1.0).
            Must specify either `fraction` or `size`, but not both.
            For backward compatibility, can also be passed as a positional argument.
        size (Optional[int]): exact number of rows to sample.
            Must specify either `fraction` or `size`, but not both.
            If `size` exceeds the total number of rows:
            - When `with_replacement=False`: raises ValueError
            - When `with_replacement=True`: returns `size` rows (may contain duplicates)
            Note: Sample by size only works on the native runner right now.
        with_replacement (bool, optional): whether to sample with replacement. Defaults to False.
        seed (Optional[int], optional): random seed. Defaults to None.

    Returns:
        DataFrame: DataFrame with sampled rows.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
        >>> # Sample by fraction (backward compatible positional argument)
        >>> sampled_df = df.sample(0.5)
        >>> sampled_df = sampled_df.collect()
        >>> # sampled_df.show()
        >>> # ╭───────┬───────┬───────╮
        >>> # │ x     ┆ y     ┆ z     │
        >>> # │ ---   ┆ ---   ┆ ---   │
        >>> # │ Int64 ┆ Int64 ┆ Int64 │
        >>> # ╞═══════╪═══════╪═══════╡
        >>> # │ 3     ┆ 6     ┆ 9     │
        >>> # ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        >>> # │ 1     ┆ 4     ┆ 7     │
        >>> # ╰───────┴───────┴───────╯
        >>> # <BLANKLINE>
        >>> # (Showing first 2 of 2 rows)
        >>> # Samples will vary from output to output
        >>> # here is a sample output
        >>> # ╭───────┬───────┬───────╮
        >>> # │ x     ┆ y     ┆ z     │
        >>> # │ ---   ┆ ---   ┆ ---   │
        >>> # │ Int64 ┆ Int64 ┆ Int64 │
        >>> # |═══════╪═══════╪═══════╡
        >>> # │ 2     ┆ 5     ┆ 8     │
        >>> # ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        >>> # │ 3     ┆ 6     ┆ 9     │
        >>> # ╰───────┴───────┴───────╯
        >>> # Sample by exact number of rows
        >>> sampled_df = df.sample(size=2)
        >>> sampled_df = sampled_df.collect()
    """
    if fraction is not None and size is not None:
        raise ValueError("Must specify either `fraction` or `size`, but not both")
    if fraction is None and size is None:
        raise ValueError("Must specify either `fraction` or `size`")
    if fraction is not None and (fraction < 0.0 or fraction > 1.0):
        raise ValueError(f"fraction should be between 0.0 and 1.0, but got {fraction}")
    if size is not None:
        if size < 0:
            raise ValueError(f"size should be non-negative, but got {size}")
        if get_or_create_runner().name == "ray":
            raise ValueError(
                "Sample by size only works on the native runner right now. "
                "Please use `daft.set_runner_native()` to switch to the native runner, "
                "or use `fraction` instead of `size` for sampling."
            )

    builder = self._builder.sample(fraction, size, with_replacement, seed)
    return DataFrame(builder)

schema #

schema() -> Schema

Returns the Schema of the DataFrame, which provides information about each column, as a Python object.

Returns:

Name	Type	Description
`Schema`	`Schema`	schema of the DataFrame

Examples:

>>> import daft
>>>
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": ["a", "b", "c"]})
>>> df.schema()

╭─────────────┬────────╮
│ Column Name ┆ DType  │
╞═════════════╪════════╡
│ x           ┆ Int64  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ y           ┆ String │
╰─────────────┴────────╯

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def schema(self) -> Schema:
    """Returns the Schema of the DataFrame, which provides information about each column, as a Python object.

    Returns:
        Schema: schema of the DataFrame

    Examples:
        >>> import daft
        >>>
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": ["a", "b", "c"]})
        >>> df.schema()
        ╭─────────────┬────────╮
        │ Column Name ┆ DType  │
        ╞═════════════╪════════╡
        │ x           ┆ Int64  │
        ├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ y           ┆ String │
        ╰─────────────┴────────╯
        <BLANKLINE>
    """
    return self.__builder.schema()

select #

select(*columns: ColumnInputType, **projections: Expression) -> DataFrame

Creates a new DataFrame from the provided expressions, similar to a SQL SELECT.

Parameters:

Name	Type	Description	Default
`*columns`	`Union[str, Expression]`	columns to select from the current DataFrame	`()`
`**projections`	`Expression`	additional projections in kwarg format.	`{}`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	new DataFrame that will select the passed in columns

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
>>> df = df.select("x", daft.col("y"), daft.col("z") + 1)
>>> df.show()

╭───────┬───────┬───────╮
│ x     ┆ y     ┆ z     │
│ ---   ┆ ---   ┆ ---   │
│ Int64 ┆ Int64 ┆ Int64 │
╞═══════╪═══════╪═══════╡
│ 1     ┆ 4     ┆ 8     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     ┆ 9     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 6     ┆ 10    │
╰───────┴───────┴───────╯
(Showing first 3 of 3 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def select(self, *columns: ColumnInputType, **projections: Expression) -> "DataFrame":
    """Creates a new DataFrame from the provided expressions, similar to a SQL ``SELECT``.

    Args:
        *columns (Union[str, Expression]): columns to select from the current DataFrame
        **projections (Expression): additional projections in kwarg format.

    Returns:
        DataFrame: new DataFrame that will select the passed in columns

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
        >>> df = df.select("x", daft.col("y"), daft.col("z") + 1)
        >>> df.show()
        ╭───────┬───────┬───────╮
        │ x     ┆ y     ┆ z     │
        │ ---   ┆ ---   ┆ ---   │
        │ Int64 ┆ Int64 ┆ Int64 │
        ╞═══════╪═══════╪═══════╡
        │ 1     ┆ 4     ┆ 8     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 5     ┆ 9     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 3     ┆ 6     ┆ 10    │
        ╰───────┴───────┴───────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)
    """
    selection = column_inputs_to_expressions(columns)
    selection += [expr.alias(alias) for (alias, expr) in projections.items()]
    builder = self._builder.select(selection)
    return DataFrame(builder)

show #

show(n: int = 8, format: PreviewFormat | None = None, verbose: bool | None = None, max_width: int | None = None, align: PreviewAlign | None = None, columns: list[PreviewColumn] | None = None) -> None

Executes enough of the DataFrame in order to display the first n rows.

If IPython is installed, this will use IPython's display utility to pretty-print in a notebook/REPL environment. Otherwise, this will fall back onto a naive Python print.

If no format is given, then daft's truncating preview format is used. - The output is a 'fancy' table with rounded corners. - Headers contain the column's data type. - Columns are truncated to 30 characters. - The table's overall width is limited to 10 columns. Default values can be overridden with environment variables: - DAFT_SHOW_FORMAT - DAFT_SHOW_VERBOSE - DAFT_SHOW_MAX_WIDTH - DAFT_SHOW_ALIGN

Parameters:

Name	Type	Description	Default
`n`	`int`	number of rows to show. Defaults to 8.	`8`
`format`	`PreviewFormat`	the box-drawing format e.g. "fancy" or "markdown".	`None`
`verbose`	`bool`	if True, headers include the column's data type.	`None`
`max_width`	`int \| None`	global max column width	`None`
`align`	`PreviewAlign`	global column align	`None`
`columns`	`list[PreviewColumn]`	column overrides	`None`

Note

This call is blocking and will execute the DataFrame when called

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
>>> df.show()
>>> df.show(format="markdown")
>>> df.show(max_width=50)
>>> df.show(align="auto")

Usage

If columns are given, their length MUST match the schema.
If columns are given, their settings override any global settings.

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def show(
    self,
    n: int = 8,
    format: PreviewFormat | None = None,
    verbose: bool | None = None,
    max_width: int | None = None,
    align: PreviewAlign | None = None,
    columns: list[PreviewColumn] | None = None,
) -> None:
    """Executes enough of the DataFrame in order to display the first ``n`` rows.

    If IPython is installed, this will use IPython's `display` utility to pretty-print in a
    notebook/REPL environment. Otherwise, this will fall back onto a naive Python `print`.

    If no format is given, then daft's truncating preview format is used.
        - The output is a 'fancy' table with rounded corners.
        - Headers contain the column's data type.
        - Columns are truncated to 30 characters.
        - The table's overall width is limited to 10 columns.
    Default values can be overridden with environment variables:
        - ``DAFT_SHOW_FORMAT``
        - ``DAFT_SHOW_VERBOSE``
        - ``DAFT_SHOW_MAX_WIDTH``
        - ``DAFT_SHOW_ALIGN``

    Args:
        n: number of rows to show. Defaults to 8.
        format (PreviewFormat): the box-drawing format e.g. "fancy" or "markdown".
        verbose (bool): if True, headers include the column's data type.
        max_width (int | None): global max column width
        align (PreviewAlign): global column align
        columns (list[PreviewColumn]): column overrides

    Note:
        This call is **blocking** and will execute the DataFrame when called

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
        >>> df.show()  # doctest: +SKIP
        >>> df.show(format="markdown")  # doctest: +SKIP
        >>> df.show(max_width=50)  # doctest: +SKIP
        >>> df.show(align="auto")  # doctest: +SKIP

    Tip: Usage
        - If columns are given, their length MUST match the schema.
        - If columns are given, their settings override any global settings.

    """
    schema = self.schema()
    preview = self._construct_show_preview(n)
    format, verbose, max_width, align = resolve_show_defaults(format, verbose, max_width, align)
    preview_formatter = PreviewFormatter(
        preview,
        schema,
        format,
        verbose=verbose,
        max_width=max_width,
        align=align,
        columns=columns,
    )

    try:
        from IPython.display import HTML, display

        if in_notebook() and preview.partition is not None:
            try:
                interactive_html = preview_formatter._generate_interactive_html()
                display(HTML(interactive_html), clear=True)
                return
            except Exception:
                pass

        display(preview_formatter, clear=True)
    except ImportError:
        print(preview_formatter)
    return

shuffle #

shuffle(seed: int | None = None) -> DataFrame

Randomly reorders rows of the DataFrame.

This is analogous to shuffle operation in the Hugging Face datasets library.

Note

This performs a global sort and is expensive. For randomly redistributing rows across partitions see :meth:DataFrame.repartition with no partition_by (random partition shuffle).

Parameters:

Name	Type	Description	Default
`seed`	`int \| None`	Optional RNG seed passed to `random_int` for best-effort reproducibility on a fixed partition layout; not guaranteed across runners or plan changes.	`None`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	A new DataFrame with rows in random order.

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def shuffle(self, seed: int | None = None) -> "DataFrame":
    """Randomly reorders rows of the DataFrame.

    This is analogous to ``shuffle`` operation in the Hugging Face ``datasets`` library.

    Note:
        This performs a global sort and is expensive. For randomly redistributing rows across
        partitions see :meth:`DataFrame.repartition` with no ``partition_by`` (random partition shuffle).

    Args:
        seed: Optional RNG seed passed to ``random_int`` for best-effort reproducibility
            on a fixed partition layout; not guaranteed across runners or plan changes.

    Returns:
        DataFrame: A new DataFrame with rows in random order.
    """
    builder = self._builder.shuffle(seed)
    return DataFrame(builder)

skew #

skew(*cols: ColumnInputType) -> DataFrame

Performs a global skew on the DataFrame.

Parameters:

Name	Type	Description	Default
`*cols`	`Union[str, Expression]`	columns to compute skewness for	`()`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Globally aggregated skewness. Should be a single row.

Note

Daft uses the biased (population) skewness formula, which is equivalent to scipy.stats.skew(bias=True). This differs from pandas' default DataFrame.skew(), which uses the adjusted Fisher-Pearson (sample) formula. For small samples, the two formulas can produce meaningfully different results.

Examples:

>>> import daft
>>> df = daft.from_pydict({"col_a": [1, 2, 3, 4, 5]})
>>> df = df.skew("col_a")
>>> df.show()

╭─────────╮
│ col_a   │
│ ---     │
│ Float64 │
╞═════════╡
│ 0       │
╰─────────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def skew(self, *cols: ColumnInputType) -> "DataFrame":
    """Performs a global skew on the DataFrame.

    Args:
        *cols (Union[str, Expression]): columns to compute skewness for

    Returns:
        DataFrame: Globally aggregated skewness. Should be a single row.

    Note:
        Daft uses the **biased (population) skewness** formula, which is equivalent to
        ``scipy.stats.skew(bias=True)``. This differs from pandas' default ``DataFrame.skew()``,
        which uses the adjusted Fisher-Pearson (sample) formula. For small samples, the two
        formulas can produce meaningfully different results.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"col_a": [1, 2, 3, 4, 5]})
        >>> df = df.skew("col_a")
        >>> df.show()
        ╭─────────╮
        │ col_a   │
        │ ---     │
        │ Float64 │
        ╞═════════╡
        │ 0       │
        ╰─────────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)

    """
    return self._apply_agg_fn(Expression.skew, cols)

skip_existing #

skip_existing(existing_path: str | Path | list[str | Path], key_column: str | list[str], file_format: str | FileFormat, io_config: IOConfig | None = None, num_workers: int = 4, cpus_per_worker: float = 0.5, keys_load_batch_size: int = 100000, max_concurrency_per_worker: int = 1, filter_batch_size: int = 10000, **reader_args: Any) -> DataFrame

Filter out rows whose key(s) already exist in existing data (i.e., already processed rows).

This method reads existing data from the given path(s), builds a Ray actor-backed distributed key filter from the existing key columns, and filters the current DataFrame to only include rows whose key(s) are not present in the existing data. This is useful for incremental data processing pipelines where you want to avoid re-processing data that has already been written.

Missing paths are treated permissively: if none of the provided paths exist, the current DataFrame is returned unchanged; if only some paths exist, Daft logs a warning and continues with the existing subset.

Parameters:

Name	Type	Description	Default
`existing_path`	`str \| Path \| list[str \| Path]`	Path or list of paths to the existing data directory/file(s).	required
`key_column`	`str \| list[str]`	Column name(s) to use as the key for matching. Can be a single column name or a list of column names for composite keys.	required
`file_format`	`str \| FileFormat`	Format of the existing data files. Supported formats are Parquet, CSV, and JSON/JSONL/NDJSON.	required
`io_config`	`IOConfig \| None`	IO configuration for reading the existing data.	`None`
`num_workers`	`int`	Number of Ray actors to spawn for key filtering. Each actor holds a shard of existing keys and filters incoming partitions in parallel. Higher values increase parallelism and typically reduce per-actor memory usage.	`4`
`cpus_per_worker`	`float`	Number of CPUs to allocate per Ray actor.	`0.5`
`keys_load_batch_size`	`int`	Batch size when loading keys from existing data into actors.	`100000`
`max_concurrency_per_worker`	`int`	Maximum concurrency for per-actor operations.	`1`
`filter_batch_size`	`int`	Batch size for the key filter operation. Controls how many rows are sent to the key filter actors per RPC call. Larger values reduce RPC overhead but increase memory usage proportionally across all concurrent tasks (total memory ≈ num_tasks × filter_batch_size × avg_key_size). For lightweight keys (int, short string), 10000-50000 works well. For large keys (URLs, long strings), keep this lower to avoid excessive memory usage. Defaults to 10000.	`10000`
`**reader_args`	`Any`	Additional keyword arguments forwarded to the underlying reader for `file_format` when scanning `existing_path`.	`{}`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	A new DataFrame with rows filtered to exclude those whose keys exist in the existing data.

Raises:

Type	Description
`ValueError`	If key columns are invalid, paths are empty, or parameters are out of range.
`RuntimeError`	If the existing data cannot be read during execution or key filter resources cannot be allocated.

Examples:

>>> import daft
>>> import tempfile
>>> from pathlib import Path
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> df = daft.from_pydict({"id": [1, 2, 3, 4], "value": ["a", "b", "c", "d"]})
>>> # Filter out rows where 'id' already exists in local Parquet data
>>> daft.set_runner_ray()
>>> with tempfile.TemporaryDirectory() as tmpdir:
...     pq.write_table(pa.table({"id": [1, 3]}), Path(tmpdir) / "part-0.parquet")
...     filtered_df = df.skip_existing(
...         existing_path=tmpdir,
...         key_column="id",
...         file_format="parquet",
...     ).collect()
...     filtered_df.select("id").to_pydict()["id"]

[2, 4]

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def skip_existing(
    self,
    existing_path: str | pathlib.Path | list[str | pathlib.Path],
    key_column: str | list[str],
    file_format: str | FileFormat,
    io_config: IOConfig | None = None,
    num_workers: int = 4,
    cpus_per_worker: float = 0.5,
    keys_load_batch_size: int = 100000,
    max_concurrency_per_worker: int = 1,
    filter_batch_size: int = 10000,
    **reader_args: Any,
) -> "DataFrame":
    """Filter out rows whose key(s) already exist in existing data (i.e., already processed rows).

    This method reads existing data from the given path(s), builds a Ray actor-backed
    distributed key filter from the existing key columns, and filters the current
    DataFrame to only include rows whose key(s) are not present in the existing data.
    This is useful for incremental data processing pipelines where you want to avoid
    re-processing data that has already been written.

    Missing paths are treated permissively:
    if none of the provided paths exist, the current DataFrame is returned unchanged;
    if only some paths exist, Daft logs a warning and continues with the existing subset.

    Args:
        existing_path: Path or list of paths to the existing data directory/file(s).
        key_column: Column name(s) to use as the key for matching. Can be a single column name
            or a list of column names for composite keys.
        file_format: Format of the existing data files. Supported formats are Parquet, CSV,
            and JSON/JSONL/NDJSON.
        io_config: IO configuration for reading the existing data.
        num_workers: Number of Ray actors to spawn for key filtering. Each actor holds a
            shard of existing keys and filters incoming partitions in parallel. Higher values
            increase parallelism and typically reduce per-actor memory usage.
        cpus_per_worker: Number of CPUs to allocate per Ray actor.
        keys_load_batch_size: Batch size when loading keys from existing data into actors.
        max_concurrency_per_worker: Maximum concurrency for per-actor operations.
        filter_batch_size: Batch size for the key filter operation. Controls how many rows
            are sent to the key filter actors per RPC call. Larger values reduce RPC
            overhead but increase memory usage proportionally across all concurrent tasks
            (total memory ≈ num_tasks × filter_batch_size × avg_key_size). For lightweight
            keys (int, short string), 10000-50000 works well. For large keys (URLs, long
            strings), keep this lower to avoid excessive memory usage. Defaults to 10000.
        **reader_args: Additional keyword arguments forwarded to the underlying reader for
            `file_format` when scanning `existing_path`.

    Returns:
        DataFrame: A new DataFrame with rows filtered to exclude those whose keys exist
            in the existing data.

    Raises:
        ValueError: If key columns are invalid, paths are empty, or parameters are out of range.
        RuntimeError: If the existing data cannot be read during execution or key filter
            resources cannot be allocated.

    Examples:
        >>> import daft
        >>> import tempfile
        >>> from pathlib import Path
        >>> import pyarrow as pa
        >>> import pyarrow.parquet as pq
        >>> df = daft.from_pydict({"id": [1, 2, 3, 4], "value": ["a", "b", "c", "d"]})
        >>> # Filter out rows where 'id' already exists in local Parquet data
        >>> daft.set_runner_ray()  # doctest: +SKIP
        >>> with tempfile.TemporaryDirectory() as tmpdir:  # doctest: +SKIP
        ...     pq.write_table(pa.table({"id": [1, 3]}), Path(tmpdir) / "part-0.parquet")
        ...     filtered_df = df.skip_existing(
        ...         existing_path=tmpdir,
        ...         key_column="id",
        ...         file_format="parquet",
        ...     ).collect()
        ...     filtered_df.select("id").to_pydict()["id"]
        [2, 4]
    """
    if isinstance(file_format, str):
        fmt = file_format.strip().lower()
        if fmt == "parquet":
            file_format = FileFormat.Parquet
        elif fmt == "csv":
            file_format = FileFormat.Csv
        elif fmt in ("json", "jsonl", "ndjson"):
            file_format = FileFormat.Json
        else:
            raise ValueError(f"[skip_existing] Unsupported format: {file_format}")

    if file_format not in (FileFormat.Parquet, FileFormat.Csv, FileFormat.Json):
        raise ValueError(f"[skip_existing] Unsupported format: {file_format}")

    io_config = get_context().daft_planning_config.default_io_config if io_config is None else io_config

    existing_path_strs: list[str]
    if isinstance(existing_path, list):
        existing_path_strs = [str(p) for p in existing_path]
    else:
        existing_path_strs = [str(existing_path)]
    if not existing_path_strs or any(path == "" for path in existing_path_strs):
        raise ValueError("[skip_existing] existing_path must be a non-empty list of non-empty paths")

    if not isinstance(key_column, list):
        key_column = [key_column]
    if not key_column or any(not isinstance(key, str) or key == "" for key in key_column):
        raise ValueError("[skip_existing] key_column must be a non-empty list of non-empty column names")

    from pyarrow.fs import FileType

    from daft.filesystem import _resolve_paths_and_filesystem

    resolved_paths, fs = _resolve_paths_and_filesystem(existing_path_strs, io_config=io_config)
    infos = fs.get_file_info(resolved_paths)

    original_existing_path_strs = existing_path_strs
    existing_path_strs = [
        path for path, info in zip(original_existing_path_strs, infos) if info.type != FileType.NotFound
    ]
    missing_path_strs = [
        path for path, info in zip(original_existing_path_strs, infos) if info.type == FileType.NotFound
    ]

    if missing_path_strs:
        if not existing_path_strs:
            logger.warning(
                "[skip_existing] No existing data found at %s, processing all rows.",
                original_existing_path_strs,
            )
            return self

        logger.warning(
            "[skip_existing] Some existing data paths were not found at %s; continuing with existing paths %s.",
            missing_path_strs,
            existing_path_strs,
        )

    from daft.daft import KeyFilteringConfig

    read_fn: Callable[..., DataFrame]
    if file_format == FileFormat.Parquet:
        from daft.io._parquet import read_parquet

        read_fn = read_parquet
    elif file_format == FileFormat.Csv:
        from daft.io._csv import read_csv

        read_fn = read_csv
    else:
        from daft.io._json import read_json

        read_fn = read_json

    key_exprs = column_inputs_to_expressions(tuple(key_column))
    key_filtering_config = KeyFilteringConfig(
        num_workers=num_workers,
        cpus_per_worker=cpus_per_worker,
        keys_load_batch_size=keys_load_batch_size,
        max_concurrency_per_worker=max_concurrency_per_worker,
        filter_batch_size=filter_batch_size,
    )
    read_kwargs = dict(reader_args)
    right_df = read_fn(path=existing_path_strs, io_config=io_config, **read_kwargs)

    builder = self._builder.join(
        right=right_df._builder,
        left_on=key_exprs,
        right_on=key_exprs,
        how=JoinType.Anti,
        strategy=JoinStrategy.KeyFiltering,
        key_filtering_config=key_filtering_config,
    )
    return DataFrame(builder)

sort #

sort(by: ColumnInputType | list[ColumnInputType], desc: bool | list[bool] = False, nulls_first: bool | list[bool] | None = None) -> DataFrame

Sorts DataFrame globally.

Parameters:

Name	Type	Description	Default
`by`	`Union[ColumnInputType, List[ColumnInputType]]`	column to sort by. Can be `str` or expression as well as a list of either.	required
`desc`	`Union[bool, List[bool])`	Sort by descending order. Defaults to False.	`False`
`nulls_first`	`Union[bool, List[bool])`	Sort by nulls first. Defaults to nulls being treated as the greatest value.	`None`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Sorted DataFrame.

Note

Since this a global sort, this requires an expensive repartition which can be quite slow.
Supports multicolumn sorts and can have unique descending and nulls_first flags per column.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [3, 2, 1], "y": [6, 4, 5]})
>>> sorted_df = df.sort(df["x"] + df["y"])
>>> sorted_df.show()

╭───────┬───────╮
│ x     ┆ y     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 2     ┆ 4     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1     ┆ 5     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 6     │
╰───────┴───────╯
(Showing first 3 of 3 rows)

You can also sort by multiple columns, and specify the 'descending' flag for each column:

>>> df = daft.from_pydict({"x": [1, 2, 1, 2], "y": [9, 8, 7, 6]})
>>> sorted_df = df.sort(["x", "y"], [True, False])
>>> sorted_df.show()

╭───────┬───────╮
│ x     ┆ y     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 2     ┆ 6     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 8     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1     ┆ 7     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1     ┆ 9     │
╰───────┴───────╯
(Showing first 4 of 4 rows)

You can also specify null positioning (first/last) for each column

>>> df = daft.from_pydict({"x": [1, 2, 1, 2, None], "y": [9, 8, None, 6, None]})
>>> sorted_df = df.sort(["x", "y"], [True, False], nulls_first=[True, True])
>>> sorted_df.show()

╭───────┬───────╮
│ x     ┆ y     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ None  ┆ None  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 6     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 8     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1     ┆ None  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1     ┆ 9     │
╰───────┴───────╯
(Showing first 5 of 5 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def sort(
    self,
    by: ColumnInputType | list[ColumnInputType],
    desc: bool | list[bool] = False,
    nulls_first: bool | list[bool] | None = None,
) -> "DataFrame":
    """Sorts DataFrame globally.

    Args:
        by (Union[ColumnInputType, List[ColumnInputType]]): column to sort by. Can be `str` or expression as well as a list of either.
        desc (Union[bool, List[bool]), optional): Sort by descending order. Defaults to False.
        nulls_first (Union[bool, List[bool]), optional): Sort by nulls first. Defaults to nulls being treated as the greatest value.

    Returns:
        DataFrame: Sorted DataFrame.

    Note:
        * Since this a global sort, this requires an expensive repartition which can be quite slow.
        * Supports multicolumn sorts and can have unique `descending` and `nulls_first` flags per column.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [3, 2, 1], "y": [6, 4, 5]})
        >>> sorted_df = df.sort(df["x"] + df["y"])
        >>> sorted_df.show()
        ╭───────┬───────╮
        │ x     ┆ y     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ 2     ┆ 4     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 1     ┆ 5     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 3     ┆ 6     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)

        You can also sort by multiple columns, and specify the 'descending' flag for each column:

        >>> df = daft.from_pydict({"x": [1, 2, 1, 2], "y": [9, 8, 7, 6]})
        >>> sorted_df = df.sort(["x", "y"], [True, False])
        >>> sorted_df.show()
        ╭───────┬───────╮
        │ x     ┆ y     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ 2     ┆ 6     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 8     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 1     ┆ 7     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 1     ┆ 9     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 4 of 4 rows)

        You can also specify null positioning (first/last) for each column

        >>> df = daft.from_pydict({"x": [1, 2, 1, 2, None], "y": [9, 8, None, 6, None]})
        >>> sorted_df = df.sort(["x", "y"], [True, False], nulls_first=[True, True])
        >>> sorted_df.show()
        ╭───────┬───────╮
        │ x     ┆ y     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ None  ┆ None  │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 6     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 8     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 1     ┆ None  │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 1     ┆ 9     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 5 of 5 rows)
    """
    if not isinstance(by, list):
        by = [
            by,
        ]

    if nulls_first is None:
        nulls_first = desc

    sort_by = column_inputs_to_expressions(by)

    builder = self._builder.sort(sort_by=sort_by, descending=desc, nulls_first=nulls_first)
    return DataFrame(builder)

stddev #

stddev(*cols: ColumnInputType, ddof: int = 1) -> DataFrame

Performs a global standard deviation on the DataFrame.

Parameters:

Name	Type	Description	Default
`*cols`	`Union[str, Expression]`	columns to stddev	`()`
`ddof`	`int`	Delta degrees of freedom used in the denominator `N - ddof`. Defaults to 1 (sample standard deviation).	`1`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Globally aggregated standard deviation. Should be a single row.

Examples:

>>> import daft
>>> df = daft.from_pydict({"col_a": [0, 1, 2]})
>>> df = df.stddev("col_a")
>>> df.show()

╭─────────╮
│ col_a   │
│ ---     │
│ Float64 │
╞═════════╡
│ 1       │
╰─────────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def stddev(self, *cols: ColumnInputType, ddof: int = 1) -> "DataFrame":
    """Performs a global standard deviation on the DataFrame.

    Args:
        *cols (Union[str, Expression]): columns to stddev
        ddof (int): Delta degrees of freedom used in the denominator `N - ddof`.
            Defaults to 1 (sample standard deviation).

    Returns:
        DataFrame: Globally aggregated standard deviation. Should be a single row.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"col_a": [0, 1, 2]})
        >>> df = df.stddev("col_a")
        >>> df.show()
        ╭─────────╮
        │ col_a   │
        │ ---     │
        │ Float64 │
        ╞═════════╡
        │ 1       │
        ╰─────────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)

    """
    return self._apply_agg_fn(lambda expr: Expression.stddev(expr, ddof), cols)

sum #

sum(*cols: ManyColumnsInputType) -> DataFrame

Performs a global sum on the DataFrame.

Parameters:

Name	Type	Description	Default
`*cols`	`Union[str, Expression]`	columns to sum	`()`

Returns: DataFrame: Globally aggregated sums. Should be a single row.

Examples:

>>> import daft
>>> df = daft.from_pydict({"col_a": [1, 2, 3]})
>>> df = df.sum("col_a")
>>> df.show()

╭───────╮
│ col_a │
│ ---   │
│ Int64 │
╞═══════╡
│ 6     │
╰───────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def sum(self, *cols: ManyColumnsInputType) -> "DataFrame":
    """Performs a global sum on the DataFrame.

    Args:
        *cols (Union[str, Expression]): columns to sum
    Returns:
        DataFrame: Globally aggregated sums. Should be a single row.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"col_a": [1, 2, 3]})
        >>> df = df.sum("col_a")
        >>> df.show()
        ╭───────╮
        │ col_a │
        │ ---   │
        │ Int64 │
        ╞═══════╡
        │ 6     │
        ╰───────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)
    """
    return self._apply_agg_fn(Expression.sum, cols)

summarize #

summarize() -> DataFrame

Returns column statistics for the DataFrame.

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	new DataFrame with the computed column statistics.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
>>> df.summarize().show()

╭────────┬────────┬────────┬────────────┬────────┬─────────────┬───────────────────────╮
│ column ┆ type   ┆ min    ┆      …     ┆ count  ┆ count_nulls ┆ approx_count_distinct │
│ ---    ┆ ---    ┆ ---    ┆            ┆ ---    ┆ ---         ┆ ---                   │
│ String ┆ String ┆ String ┆ (1 hidden) ┆ UInt64 ┆ UInt64      ┆ UInt64                │
╞════════╪════════╪════════╪════════════╪════════╪═════════════╪═══════════════════════╡
│ x      ┆ Int64  ┆ 1      ┆ …          ┆ 3      ┆ 0           ┆ 3                     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ y      ┆ Int64  ┆ 4      ┆ …          ┆ 3      ┆ 0           ┆ 3                     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ z      ┆ Int64  ┆ 7      ┆ …          ┆ 3      ┆ 0           ┆ 3                     │
╰────────┴────────┴────────┴────────────┴────────┴─────────────┴───────────────────────╯
(Showing first 3 of 3 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def summarize(self) -> "DataFrame":
    """Returns column statistics for the DataFrame.

    Returns:
        DataFrame: new DataFrame with the computed column statistics.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 8, 9]})
        >>> df.summarize().show()  # doctest: +SKIP
        ╭────────┬────────┬────────┬────────────┬────────┬─────────────┬───────────────────────╮
        │ column ┆ type   ┆ min    ┆      …     ┆ count  ┆ count_nulls ┆ approx_count_distinct │
        │ ---    ┆ ---    ┆ ---    ┆            ┆ ---    ┆ ---         ┆ ---                   │
        │ String ┆ String ┆ String ┆ (1 hidden) ┆ UInt64 ┆ UInt64      ┆ UInt64                │
        ╞════════╪════════╪════════╪════════════╪════════╪═════════════╪═══════════════════════╡
        │ x      ┆ Int64  ┆ 1      ┆ …          ┆ 3      ┆ 0           ┆ 3                     │
        ├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ y      ┆ Int64  ┆ 4      ┆ …          ┆ 3      ┆ 0           ┆ 3                     │
        ├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ z      ┆ Int64  ┆ 7      ┆ …          ┆ 3      ┆ 0           ┆ 3                     │
        ╰────────┴────────┴────────┴────────────┴────────┴─────────────┴───────────────────────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)
    """
    builder = self._builder.summarize()
    return DataFrame(builder)

to_arrow #

to_arrow() -> Table

Converts the current DataFrame to a pyarrow Table.

If results have not computed yet, collect will be called.

Returns:

Type	Description
`Table`	pyarrow.Table: pyarrow Table converted from a Daft DataFrame

Note

This call is blocking and will execute the DataFrame when called

Examples:

>>> import daft
>>> df = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> arrow_table = df.to_arrow()
>>> print(arrow_table)

pyarrow.Table
a: int64
b: int64
----
a: [[1,2,3]]
b: [[4,5,6]]

Tip

See also DataFrame.to_arrow_iter() for a streaming iterator over the rows of the DataFrame as Arrow RecordBatches.

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def to_arrow(self) -> "pyarrow.Table":
    """Converts the current DataFrame to a [pyarrow Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html).

    If results have not computed yet, collect will be called.

    Returns:
        pyarrow.Table: [pyarrow Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html) converted from a Daft DataFrame

    Note:
        This call is **blocking** and will execute the DataFrame when called

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
        >>> arrow_table = df.to_arrow()
        >>> print(arrow_table)
        pyarrow.Table
        a: int64
        b: int64
        ----
        a: [[1,2,3]]
        b: [[4,5,6]]

    Tip:
        See also [DataFrame.to_arrow_iter()][daft.DataFrame.to_arrow_iter] for
        a streaming iterator over the rows of the DataFrame as Arrow RecordBatches.
    """
    import pyarrow as pa

    arrow_rb_iter = self.to_arrow_iter(results_buffer_size=None)
    return pa.Table.from_batches(arrow_rb_iter, schema=self.schema().to_pyarrow_schema())

to_arrow_iter #

to_arrow_iter(results_buffer_size: int | None | Literal['num_cpus'] = 'num_cpus') -> Iterator[RecordBatch]

Return an iterator of pyarrow recordbatches for this dataframe.

Parameters:

Name	Type	Description	Default
`results_buffer_size`	`int \| None \| Literal['num_cpus']`	how many partitions to allow in the results buffer (defaults to the total number of CPUs available on the machine).	`'num_cpus'`

Note: A quick note on configuring asynchronous/parallel execution using results_buffer_size. The results_buffer_size kwarg controls how many results Daft will allow to be in the buffer while iterating. Once this buffer is filled, Daft will not run any more work until some partition is consumed from the buffer. * Increasing this value means the iterator will consume more memory and CPU resources but have higher throughput * Decreasing this value means the iterator will consume lower memory and CPU resources, but have lower throughput * Setting this value to None means the iterator will consume as much resources as it deems appropriate per-iteration The default value is the total number of CPUs available on the current machine.

Returns:

Type	Description
`Iterator[RecordBatch]`	Iterator[pyarrow.RecordBatch]: An iterator over the RecordBatches of the DataFrame.

Examples:

>>> import daft
>>>
>>> df = daft.from_pydict({"foo": [1, 2, 3], "bar": ["a", "b", "c"]})
>>> for batch in df.to_arrow_iter():
...     print(batch)

pyarrow.RecordBatch
foo: int64
bar: large_string
----
foo: [1,2,3]
bar: ["a","b","c"]

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def to_arrow_iter(
    self,
    results_buffer_size: int | None | Literal["num_cpus"] = "num_cpus",
) -> Iterator["pyarrow.RecordBatch"]:
    """Return an iterator of pyarrow recordbatches for this dataframe.

    Args:
        results_buffer_size: how many partitions to allow in the results buffer (defaults to the total number of CPUs
            available on the machine).
    Note: A quick note on configuring asynchronous/parallel execution using `results_buffer_size`.
        The `results_buffer_size` kwarg controls how many results Daft will allow to be in the buffer while iterating.
        Once this buffer is filled, Daft will not run any more work until some partition is consumed from the buffer.
        * Increasing this value means the iterator will consume more memory and CPU resources but have higher throughput
        * Decreasing this value means the iterator will consume lower memory and CPU resources, but have lower throughput
        * Setting this value to `None` means the iterator will consume as much resources as it deems appropriate per-iteration
        The default value is the total number of CPUs available on the current machine.

    Returns:
        Iterator[pyarrow.RecordBatch]: An iterator over the RecordBatches of the DataFrame.

    Examples:
        >>> import daft
        >>>
        >>> df = daft.from_pydict({"foo": [1, 2, 3], "bar": ["a", "b", "c"]})
        >>> for batch in df.to_arrow_iter():
        ...     print(batch)
        pyarrow.RecordBatch
        foo: int64
        bar: large_string
        ----
        foo: [1,2,3]
        bar: ["a","b","c"]
    """
    if results_buffer_size == "num_cpus":
        results_buffer_size = multiprocessing.cpu_count()
    if results_buffer_size is not None and not results_buffer_size > 0:
        raise ValueError(f"Provided `results_buffer_size` value must be > 0, received: {results_buffer_size}")

    results = self._result
    if results is not None:
        # If the dataframe has already finished executing,
        # use the precomputed results.

        for _, result in results.items():
            yield from (result.micropartition().to_arrow().to_batches())
    else:
        # Execute the dataframe in a streaming fashion.
        partitions_iter = get_or_create_runner().run_iter_tables(
            self._builder, results_buffer_size=results_buffer_size
        )

        # Iterate through partitions.
        for partition in partitions_iter:
            yield from partition.to_arrow().to_batches()

to_dask_dataframe #

to_dask_dataframe(meta: Union[DataFrame, Series[Any], dict[str, Any], Iterable[Any], tuple[Any], None] = None) -> DataFrame

Converts the current Daft DataFrame to a Dask DataFrame.

The returned Dask DataFrame will use Dask-on-Ray to execute operations on a Ray cluster.

Parameters:

Name	Type	Description	Default
`meta`	`Union[DataFrame, Series[Any], dict[str, Any], Iterable[Any], tuple[Any], None]`	An empty pandas DataFrameor Series that matches the dtypes and column names of the stream. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of `{name: dtype}` or iterable of `(name, dtype)` can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of `(name, dtype)` can be used. By default, this will be inferred from the underlying Daft DataFrame schema, with this argument supplying an optional override.	`None`

Returns:

Type	Description
`DataFrame`	dask.DataFrame: A Dask DataFrame stored on a Ray cluster.

Note

This function can only work if Daft is running using the RayRunner.

Examples:

>>> import daft
>>> daft.set_runner_ray()
>>> df = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> dask_df = df.to_dask_dataframe()

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def to_dask_dataframe(
    self,
    meta: Union[
        "pandas.DataFrame",
        "pandas.Series[Any]",
        dict[str, Any],
        Iterable[Any],
        tuple[Any],
        None,
    ] = None,
) -> "dask.DataFrame":
    """Converts the current Daft DataFrame to a Dask DataFrame.

    The returned Dask DataFrame will use [Dask-on-Ray](https://docs.ray.io/en/latest/ray-more-libs/dask-on-ray.html)
    to execute operations on a Ray cluster.

    Args:
        meta: An empty [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)or [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) that matches the dtypes and column
            names of the stream. This metadata is necessary for many algorithms in
            dask dataframe to work. For ease of use, some alternative inputs are
            also available. Instead of a DataFrame, a dict of ``{name: dtype}`` or
            iterable of ``(name, dtype)`` can be provided (note that the order of
            the names should match the order of the columns). Instead of a series, a
            tuple of ``(name, dtype)`` can be used.
            By default, this will be inferred from the underlying Daft DataFrame schema,
            with this argument supplying an optional override.

    Returns:
        dask.DataFrame: A Dask DataFrame stored on a Ray cluster.

    Note:
        This function can only work if Daft is running using the RayRunner.

    Examples:
        >>> import daft
        >>> daft.set_runner_ray()  # doctest: +SKIP
        >>> df = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
        >>> dask_df = df.to_dask_dataframe()  # doctest: +SKIP

    """
    from daft.runners.ray_runner import RayPartitionSet

    self.collect()
    partition_set = self._result
    assert partition_set is not None
    # TODO(Clark): Support Dask DataFrame conversion for the local runner if
    # Dask is using a non-distributed scheduler.
    if not isinstance(partition_set, RayPartitionSet):
        raise ValueError("Cannot convert to Dask DataFrame if not running on Ray backend")
    return partition_set.to_dask_dataframe(meta)

to_pandas #

to_pandas(coerce_temporal_nanoseconds: bool = False) -> DataFrame

Converts the current DataFrame to a pandas DataFrame.

If results have not computed yet, collect will be called.

Parameters:

Name	Type	Description	Default
`coerce_temporal_nanoseconds`	`bool`	Whether to coerce temporal columns to nanoseconds. Only applicable to pandas version >= 2.0 and pyarrow version >= 13.0.0. Defaults to False. See `pyarrow.Table.to_pandas <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas>`__ for more information.	`False`

Returns:

Type	Description
`DataFrame`	pandas.DataFrame: pandas DataFrame converted from a Daft DataFrame

Note

This call is blocking and will execute the DataFrame when called

Examples:

>>> import daft
>>> df = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> pd_df = df.to_pandas()
>>> print(pd_df)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def to_pandas(self, coerce_temporal_nanoseconds: bool = False) -> "pandas.DataFrame":
    """Converts the current DataFrame to a [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

    If results have not computed yet, collect will be called.

    Args:
        coerce_temporal_nanoseconds (bool): Whether to coerce temporal columns to nanoseconds. Only applicable to pandas version >= 2.0 and pyarrow version >= 13.0.0. Defaults to False. See `pyarrow.Table.to_pandas <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas>`__ for more information.

    Returns:
        pandas.DataFrame: [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) converted from a Daft DataFrame

    Note:
        This call is **blocking** and will execute the DataFrame when called

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
        >>> pd_df = df.to_pandas()
        >>> print(pd_df)
           a  b
        0  1  4
        1  2  5
        2  3  6
    """
    self.collect()
    result = self._result
    assert result is not None

    pd_df = result.to_pandas(
        schema=self._builder.schema(),
        coerce_temporal_nanoseconds=coerce_temporal_nanoseconds,
    )
    return pd_df

to_pydict #

to_pydict(maps_as_pydicts: Literal['lossy', 'strict'] | None = None) -> dict[str, list[Any]]

Converts the current DataFrame to a python dictionary. The dictionary contains Python lists of Python objects for each column.

If results have not computed yet, collect will be called.

Parameters:

Name	Type	Description	Default
`maps_as_pydicts`	`Literal['lossy', 'strict'] \| None`	If None (default), Map values are converted to association lists (`list[tuple[key, value]]`) preserving order and duplicates. If `"lossy"` or `"strict"`, Map values are converted to Python dicts. `"lossy"` keeps the last value for duplicate keys and warns. `"strict"` raises on duplicate keys.	`None`

Returns:

Type	Description
`dict[str, list[Any]]`	dict[str, list[Any]]: python dict converted from a Daft DataFrame

Note

This call is blocking and will execute the DataFrame when called

Examples:

>>> import daft
>>> df = daft.from_pydict({"a": [1, 2, 3, 4], "b": [2, 4, 3, 1]})
>>> print(df.to_pydict())

{'a': [1, 2, 3, 4], 'b': [2, 4, 3, 1]}

Tip

See also DataFrame.to_pylist() for a convenience method that converts the DataFrame to a list of Python dict objects.

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def to_pydict(self, maps_as_pydicts: Literal["lossy", "strict"] | None = None) -> dict[str, list[Any]]:
    """Converts the current DataFrame to a python dictionary. The dictionary contains Python lists of Python objects for each column.

    If results have not computed yet, collect will be called.

    Args:
        maps_as_pydicts: If None (default), Map values are converted to association lists
            (`list[tuple[key, value]]`) preserving order and duplicates.
            If `"lossy"` or `"strict"`, Map values are converted to Python dicts.
            `"lossy"` keeps the last value for duplicate keys and warns.
            `"strict"` raises on duplicate keys.

    Returns:
        dict[str, list[Any]]: python dict converted from a Daft DataFrame

    Note:
        This call is **blocking** and will execute the DataFrame when called

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"a": [1, 2, 3, 4], "b": [2, 4, 3, 1]})
        >>> print(df.to_pydict())
        {'a': [1, 2, 3, 4], 'b': [2, 4, 3, 1]}

    Tip:
        See also [DataFrame.to_pylist()][daft.DataFrame.to_pylist] for
        a convenience method that converts the DataFrame to a list of Python dict objects.
    """
    self.collect()
    result = self._result
    assert result is not None
    return result.to_pydict(schema=self.schema(), maps_as_pydicts=maps_as_pydicts)

to_pylist #

to_pylist(maps_as_pydicts: Literal['lossy', 'strict'] | None = None) -> list[Any]

Converts the current Dataframe into a python list.

Parameters:

Name	Type	Description	Default
`maps_as_pydicts`	`Literal['lossy', 'strict'] \| None`	If None (default), Map values are converted to association lists (`list[tuple[key, value]]`) preserving order and duplicates. If `"lossy"` or `"strict"`, Map values are converted to Python dicts. `"lossy"` keeps the last value for duplicate keys and warns. `"strict"` raises on duplicate keys.	`None`

Returns:

Type	Description
`list[Any]`	List[dict[str, Any]]: List of python dict objects.

Warning

This is a convenience method over DataFrame.iter_rows(). Users should prefer using .iter_rows() directly instead for lower memory utilization if they are streaming rows out of a DataFrame and don't require full materialization of the Python list.

Examples:

>>> import daft
>>> from daft import col
>>> df = daft.from_pydict({"a": [1, 2, 3, 4], "b": [2, 4, 3, 1]})
>>> print(df.to_pylist())

[{'a': 1, 'b': 2}, {'a': 2, 'b': 4}, {'a': 3, 'b': 3}, {'a': 4, 'b': 1}]

to_ray_dataset #

to_ray_dataset() -> DataSet

Converts the current DataFrame to a Ray Dataset which is useful for running distributed ML model training in Ray.

Returns:

Type	Description
`DataSet`	ray.data.dataset.DataSet: Ray dataset

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> ray_dataset = df.to_ray_dataset()

Note

This function requires Ray to be installed. It works with any Daft runner - when using the native runner, partitions are converted to Arrow tables locally and then handed to Ray.

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def to_ray_dataset(self) -> "ray.data.dataset.DataSet":
    """Converts the current DataFrame to a [Ray Dataset](https://docs.ray.io/en/latest/data/api/dataset.html#ray.data.Dataset) which is useful for running distributed ML model training in Ray.

    Returns:
        ray.data.dataset.DataSet: [Ray dataset](https://docs.ray.io/en/latest/data/api/dataset.html#ray.data.Dataset)

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
        >>> ray_dataset = df.to_ray_dataset()  # doctest: +SKIP

    Note:
        This function requires Ray to be installed. It works with any Daft runner -
        when using the native runner, partitions are converted to Arrow tables locally
        and then handed to Ray.
    """
    from daft.runners.ray_runner import RayPartitionSet

    self.collect()
    partition_set = self._result
    assert partition_set is not None
    if isinstance(partition_set, RayPartitionSet):
        return partition_set.to_ray_dataset()

    # Native runner path: convert MicroPartitions to Arrow tables locally,
    # then create a Ray Dataset from them.
    import ray.data

    from daft.runners.ray_runner import _micropartition_to_ray_dataset_block

    blocks = [_micropartition_to_ray_dataset_block(result.micropartition()) for _, result in partition_set.items()]
    # All partitions share the same schema, so either all convert to Arrow or all
    # fall back to pylist. Handle both cases.
    if blocks and isinstance(blocks[0], list):
        all_items = [item for block in blocks for item in block]
        return ray.data.from_items(all_items)
    return ray.data.from_arrow(blocks)

to_torch_iter_dataset #

to_torch_iter_dataset(shard_strategy: Literal['file'] | None = None, world_size: int | None = None, rank: int | None = None) -> IterableDataset

Convert the current DataFrame into a Torch IterableDataset <https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset>__ for use with PyTorch.

Begins execution of the DataFrame if it is not yet executed.

Items will be returned in pydict format: a dict of {"column name": value} for each row in the data.

Parameters:

Name	Type	Description	Default
`shard_strategy`	`Optional[Literal['file']]`	Strategy to use for sharding the dataset. Currently only "file" is supported.	`None`
`world_size`	`Optional[int]`	Total number of workers for sharding. Required if shard_strategy is specified.	`None`
`rank`	`Optional[int]`	Rank of current worker for sharding. Required if shard_strategy is specified.	`None`

Returns:

Type	Description
`IterableDataset`	torch.utils.data.IterableDataset: A PyTorch IterableDataset containing the data from the DataFrame.

Examples:

>>> import daft
>>> import torch
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> torch_iter_dataset = df.to_torch_iter_dataset()
>>> list(torch.utils.data.DataLoader(torch_iter_dataset))

[{'x': tensor([1]), 'y': tensor([4])}, {'x': tensor([2]), 'y': tensor([5])}, {'x': tensor([3]), 'y': tensor([6])}]

Note

The produced dataset is meant to be used with the single-process DataLoader, and does not support data sharding hooks for multi-process data loading.

Do keep in mind that Daft is already using multithreading or multiprocessing under the hood to compute the data stream that feeds this dataset.

Tip

This method returns results locally. For distributed training, you may want to use DataFrame.to_ray_dataset().

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def to_torch_iter_dataset(
    self,
    shard_strategy: Literal["file"] | None = None,
    world_size: int | None = None,
    rank: int | None = None,
) -> "torch.utils.data.IterableDataset":
    """Convert the current DataFrame into a `Torch IterableDataset <https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset>`__ for use with PyTorch.

    Begins execution of the DataFrame if it is not yet executed.

    Items will be returned in pydict format: a dict of `{"column name": value}` for each row in the data.

    Args:
        shard_strategy (Optional[Literal["file"]]): Strategy to use for sharding the dataset. Currently only "file" is supported.
        world_size (Optional[int]): Total number of workers for sharding. Required if shard_strategy is specified.
        rank (Optional[int]): Rank of current worker for sharding. Required if shard_strategy is specified.

    Returns:
        torch.utils.data.IterableDataset: A PyTorch IterableDataset containing the data from the DataFrame.

    Examples:
        >>> import daft
        >>> import torch  # doctest: +SKIP
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
        >>> torch_iter_dataset = df.to_torch_iter_dataset()  # doctest: +SKIP
        >>> list(torch.utils.data.DataLoader(torch_iter_dataset))  # doctest: +SKIP
        [{'x': tensor([1]), 'y': tensor([4])}, {'x': tensor([2]), 'y': tensor([5])}, {'x': tensor([3]), 'y': tensor([6])}]

    Note:
        The produced dataset is meant to be used with the single-process DataLoader,
        and does not support data sharding hooks for multi-process data loading.

        Do keep in mind that Daft is already using multithreading or multiprocessing under the hood
        to compute the data stream that feeds this dataset.

    Tip:
        This method returns results locally.
        For distributed training, you may want to use [DataFrame.to_ray_dataset()][daft.DataFrame.to_ray_dataset].
    """
    from daft.dataframe.to_torch import DaftTorchIterableDataset

    # TODO(desmond): We need to take in the batch size and number of epochs. So that when we shard, we can ensure that each shard produces
    # the same number of batches without coordination.

    if shard_strategy is not None:
        if world_size is None or rank is None:
            raise ValueError("world_size and rank must be specified when using sharding")
        df = self._shard(shard_strategy, world_size, rank)
    else:
        df = self

    return DaftTorchIterableDataset(df)

to_torch_map_dataset #

to_torch_map_dataset(shard_strategy: Literal['file'] | None = None, world_size: int | None = None, rank: int | None = None) -> Dataset

Convert the current DataFrame into a map-style Torch Dataset for use with PyTorch.

This method will materialize the entire DataFrame and block on completion.

Items will be returned in pydict format: a dict of {"column name": value} for each row in the data.

Note

If you do not need random access, you may get better performance out of an IterableDataset, which streams data items in as soon as they are ready and does not block on full materialization.

Tip

This method returns results locally. For distributed training, you may want to use DataFrame.to_ray_dataset().

Parameters:

Name	Type	Description	Default
`shard_strategy`	`Optional[Literal['file']]`	Strategy to use for sharding the dataset. Currently only "file" is supported.	`None`
`world_size`	`Optional[int]`	Total number of workers for sharding. Required if shard_strategy is specified.	`None`
`rank`	`Optional[int]`	Rank of current worker for sharding. Required if shard_strategy is specified.	`None`

Returns:

Type	Description
`Dataset`	torch.utils.data.Dataset: A PyTorch Dataset containing the data from the DataFrame.

Note

The produced dataset is meant to be used with the single-process DataLoader, and does not support data sharding hooks for multi-process data loading.

Examples:

>>> import daft
>>> import torch
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> torch_dataset = df.to_torch_map_dataset()

Tip

This method returns results locally. For distributed training, you may want to use DataFrame.to_ray_dataset().

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def to_torch_map_dataset(
    self,
    shard_strategy: Literal["file"] | None = None,
    world_size: int | None = None,
    rank: int | None = None,
) -> "torch.utils.data.Dataset":
    """Convert the current DataFrame into a map-style [Torch Dataset](https://pytorch.org/docs/stable/data.html#map-style-datasets) for use with PyTorch.

    This method will materialize the entire DataFrame and block on completion.

    Items will be returned in pydict format: a dict of `{"column name": value}` for each row in the data.

    Note:
        If you do not need random access, you may get better performance out of an IterableDataset,
        which streams data items in as soon as they are ready and does not block on full materialization.

    Tip:
        This method returns results locally.
        For distributed training, you may want to use [DataFrame.to_ray_dataset()][daft.DataFrame.to_ray_dataset].

    Args:
        shard_strategy (Optional[Literal["file"]]): Strategy to use for sharding the dataset. Currently only "file" is supported.
        world_size (Optional[int]): Total number of workers for sharding. Required if shard_strategy is specified.
        rank (Optional[int]): Rank of current worker for sharding. Required if shard_strategy is specified.

    Returns:
        torch.utils.data.Dataset: A PyTorch Dataset containing the data from the DataFrame.

    Note:
        The produced dataset is meant to be used with the single-process DataLoader,
        and does not support data sharding hooks for multi-process data loading.

    Examples:
        >>> import daft
        >>> import torch  # doctest: +SKIP
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
        >>> torch_dataset = df.to_torch_map_dataset()  # doctest: +SKIP

    Tip:
        This method returns results locally.
        For distributed training, you may want to use [DataFrame.to_ray_dataset()][daft.DataFrame.to_ray_dataset].
    """
    from daft.dataframe.to_torch import DaftTorchDataset

    if shard_strategy is not None:
        if world_size is None or rank is None:
            raise ValueError("world_size and rank must be specified when using sharding")
        df = self._shard(shard_strategy, world_size, rank)
    else:
        df = self

    return DaftTorchDataset(df.to_pydict(), len(df))

transform #

transform(func: Callable[..., DataFrame], *args: Any, **kwargs: Any) -> DataFrame

Apply a function that takes and returns a DataFrame.

Allow splitting your transformation into different units of work (functions) while preserving the syntax for chaining transformations.

Examples:

>>> import daft
>>> df = daft.from_pydict({"col_a": [1, 2, 3, 4]})
>>> def add_1(df):
...     df = df.select(daft.col("col_a") + 1)
...     return df
>>> def multiply_x(df, x):
...     df = df.select(daft.col("col_a") * x)
...     return df
>>> df = df.transform(add_1).transform(multiply_x, 4)
>>> df.show()

╭───────╮
│ col_a │
│ ---   │
│ Int64 │
╞═══════╡
│ 8     │
├╌╌╌╌╌╌╌┤
│ 12    │
├╌╌╌╌╌╌╌┤
│ 16    │
├╌╌╌╌╌╌╌┤
│ 20    │
╰───────╯
(Showing first 4 of 4 rows)

Parameters:

Name	Type	Description	Default
`func`	`Callable[..., DataFrame]`	A function that takes and returns a DataFrame.	required
`*args`	`Any`	Positional arguments to pass to func.	`()`
`**kwargs`	`Any`	Keyword arguments to pass to func.	`{}`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Transformed DataFrame.

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def transform(self, func: Callable[..., "DataFrame"], *args: Any, **kwargs: Any) -> "DataFrame":
    """Apply a function that takes and returns a DataFrame.

    Allow splitting your transformation into different units of work (functions) while preserving the syntax for chaining transformations.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"col_a": [1, 2, 3, 4]})
        >>> def add_1(df):
        ...     df = df.select(daft.col("col_a") + 1)
        ...     return df
        >>> def multiply_x(df, x):
        ...     df = df.select(daft.col("col_a") * x)
        ...     return df
        >>> df = df.transform(add_1).transform(multiply_x, 4)
        >>> df.show()
        ╭───────╮
        │ col_a │
        │ ---   │
        │ Int64 │
        ╞═══════╡
        │ 8     │
        ├╌╌╌╌╌╌╌┤
        │ 12    │
        ├╌╌╌╌╌╌╌┤
        │ 16    │
        ├╌╌╌╌╌╌╌┤
        │ 20    │
        ╰───────╯
        <BLANKLINE>
        (Showing first 4 of 4 rows)

    Args:
        func: A function that takes and returns a DataFrame.
        *args: Positional arguments to pass to func.
        **kwargs: Keyword arguments to pass to func.

    Returns:
        DataFrame: Transformed DataFrame.
    """
    result = func(self, *args, **kwargs)
    assert isinstance(result, DataFrame), (
        f"Func returned an instance of type [{type(result)}], should have been DataFrame."
    )
    return result

union #

union(other: DataFrame) -> DataFrame

Returns the distinct union of two DataFrames.

Parameters:

Name	Type	Description	Default
`other`	`DataFrame`	The DataFrame to union with this one.	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	A new DataFrame containing the distinct rows from both DataFrames.

Examples:

>>> import daft
>>> df1 = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> df2 = daft.from_pydict({"x": [3, 4, 5], "y": [6, 7, 8]})
>>> df1.union(df2).sort("x").show()

╭───────┬───────╮
│ x     ┆ y     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 1     ┆ 4     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 6     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4     ┆ 7     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 5     ┆ 8     │
╰───────┴───────╯
(Showing first 5 of 5 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def union(self, other: "DataFrame") -> "DataFrame":
    """Returns the distinct union of two DataFrames.

    Args:
        other (DataFrame): The DataFrame to union with this one.

    Returns:
        DataFrame: A new DataFrame containing the distinct rows from both DataFrames.

    Examples:
        >>> import daft
        >>> df1 = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
        >>> df2 = daft.from_pydict({"x": [3, 4, 5], "y": [6, 7, 8]})
        >>> df1.union(df2).sort("x").show()
        ╭───────┬───────╮
        │ x     ┆ y     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ 1     ┆ 4     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 5     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 3     ┆ 6     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 4     ┆ 7     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 5     ┆ 8     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 5 of 5 rows)
    """
    builder = self._builder.union(other._builder)
    return DataFrame(builder)

union_all #

union_all(other: DataFrame) -> DataFrame

Returns the union of two DataFrames, including duplicates.

Parameters:

Name	Type	Description	Default
`other`	`DataFrame`	The DataFrame to union with this one.	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	A new DataFrame containing all rows from both DataFrames, including duplicates.

Examples:

>>> import daft
>>> df1 = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> df2 = daft.from_pydict({"x": [3, 2, 1], "y": [6, 5, 4]})
>>> df1.union_all(df2).sort("x").show()

╭───────┬───────╮
│ x     ┆ y     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 1     ┆ 4     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1     ┆ 4     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 6     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 6     │
╰───────┴───────╯
(Showing first 6 of 6 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def union_all(self, other: "DataFrame") -> "DataFrame":
    """Returns the union of two DataFrames, including duplicates.

    Args:
        other (DataFrame): The DataFrame to union with this one.

    Returns:
        DataFrame: A new DataFrame containing all rows from both DataFrames, including duplicates.

    Examples:
        >>> import daft
        >>> df1 = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
        >>> df2 = daft.from_pydict({"x": [3, 2, 1], "y": [6, 5, 4]})
        >>> df1.union_all(df2).sort("x").show()
        ╭───────┬───────╮
        │ x     ┆ y     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ 1     ┆ 4     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 1     ┆ 4     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 5     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 5     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 3     ┆ 6     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 3     ┆ 6     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 6 of 6 rows)
    """
    builder = self._builder.union(other._builder, is_all=True)
    return DataFrame(builder)

union_all_by_name #

union_all_by_name(other: DataFrame) -> DataFrame

Returns the union of two DataFrames, including duplicates, with columns matched by name.

Parameters:

Name	Type	Description	Default
`other`	`DataFrame`	The DataFrame to union with this one, matching columns by name.	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	A new DataFrame containing all rows from both DataFrames, including duplicates, with columns matched by name.

Examples:

>>> import daft
>>> df1 = daft.from_pydict({"x": [1, 2], "y": [4, 5], "w": [9, 10]})
>>> df2 = daft.from_pydict({"y": [6, 6, 7, 7], "z": ["a", "a", "b", "b"]})
>>> df1.union_all_by_name(df2).sort("y").show()

╭───────┬───────┬───────┬────────╮
│ x     ┆ y     ┆ w     ┆ z      │
│ ---   ┆ ---   ┆ ---   ┆ ---    │
│ Int64 ┆ Int64 ┆ Int64 ┆ String │
╞═══════╪═══════╪═══════╪════════╡
│ 1     ┆ 4     ┆ 9     ┆ None   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2     ┆ 5     ┆ 10    ┆ None   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ None  ┆ 6     ┆ None  ┆ a      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ None  ┆ 6     ┆ None  ┆ a      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ None  ┆ 7     ┆ None  ┆ b      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ None  ┆ 7     ┆ None  ┆ b      │
╰───────┴───────┴───────┴────────╯
(Showing first 6 of 6 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def union_all_by_name(self, other: "DataFrame") -> "DataFrame":
    """Returns the union of two DataFrames, including duplicates, with columns matched by name.

    Args:
        other (DataFrame): The DataFrame to union with this one, matching columns by name.

    Returns:
        DataFrame: A new DataFrame containing all rows from both DataFrames, including duplicates, with columns matched by name.

    Examples:
        >>> import daft
        >>> df1 = daft.from_pydict({"x": [1, 2], "y": [4, 5], "w": [9, 10]})
        >>> df2 = daft.from_pydict({"y": [6, 6, 7, 7], "z": ["a", "a", "b", "b"]})
        >>> df1.union_all_by_name(df2).sort("y").show()
        ╭───────┬───────┬───────┬────────╮
        │ x     ┆ y     ┆ w     ┆ z      │
        │ ---   ┆ ---   ┆ ---   ┆ ---    │
        │ Int64 ┆ Int64 ┆ Int64 ┆ String │
        ╞═══════╪═══════╪═══════╪════════╡
        │ 1     ┆ 4     ┆ 9     ┆ None   │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 2     ┆ 5     ┆ 10    ┆ None   │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ None  ┆ 6     ┆ None  ┆ a      │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ None  ┆ 6     ┆ None  ┆ a      │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ None  ┆ 7     ┆ None  ┆ b      │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ None  ┆ 7     ┆ None  ┆ b      │
        ╰───────┴───────┴───────┴────────╯
        <BLANKLINE>
        (Showing first 6 of 6 rows)
    """
    builder = self._builder.union(other._builder, is_all=True, is_by_name=True)
    return DataFrame(builder)

union_by_name #

union_by_name(other: DataFrame) -> DataFrame

Returns the distinct union by name.

Parameters:

Name	Type	Description	Default
`other`	`DataFrame`	The DataFrame to union with this one, matching columns by name.	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	A new DataFrame containing the distinct rows from both DataFrames, with columns matched by name.

Examples:

>>> import daft
>>> df1 = daft.from_pydict({"x": [1, 2], "y": [4, 5], "w": [9, 10]})
>>> df2 = daft.from_pydict({"y": [6, 7], "z": ["a", "b"]})
>>> df1.union_by_name(df2).sort("y").show()

╭───────┬───────┬───────┬────────╮
│ x     ┆ y     ┆ w     ┆ z      │
│ ---   ┆ ---   ┆ ---   ┆ ---    │
│ Int64 ┆ Int64 ┆ Int64 ┆ String │
╞═══════╪═══════╪═══════╪════════╡
│ 1     ┆ 4     ┆ 9     ┆ None   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2     ┆ 5     ┆ 10    ┆ None   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ None  ┆ 6     ┆ None  ┆ a      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ None  ┆ 7     ┆ None  ┆ b      │
╰───────┴───────┴───────┴────────╯
(Showing first 4 of 4 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def union_by_name(self, other: "DataFrame") -> "DataFrame":
    """Returns the distinct union by name.

    Args:
        other (DataFrame): The DataFrame to union with this one, matching columns by name.

    Returns:
        DataFrame: A new DataFrame containing the distinct rows from both DataFrames, with columns matched by name.

    Examples:
        >>> import daft
        >>> df1 = daft.from_pydict({"x": [1, 2], "y": [4, 5], "w": [9, 10]})
        >>> df2 = daft.from_pydict({"y": [6, 7], "z": ["a", "b"]})
        >>> df1.union_by_name(df2).sort("y").show()
        ╭───────┬───────┬───────┬────────╮
        │ x     ┆ y     ┆ w     ┆ z      │
        │ ---   ┆ ---   ┆ ---   ┆ ---    │
        │ Int64 ┆ Int64 ┆ Int64 ┆ String │
        ╞═══════╪═══════╪═══════╪════════╡
        │ 1     ┆ 4     ┆ 9     ┆ None   │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ 2     ┆ 5     ┆ 10    ┆ None   │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ None  ┆ 6     ┆ None  ┆ a      │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ None  ┆ 7     ┆ None  ┆ b      │
        ╰───────┴───────┴───────┴────────╯
        <BLANKLINE>
        (Showing first 4 of 4 rows)
    """
    builder = self._builder.union(other._builder, is_all=False, is_by_name=True)
    return DataFrame(builder)

unique #

unique(*by: ColumnInputType) -> DataFrame

Computes distinct rows, dropping duplicates.

Alias for DataFrame.distinct.

Parameters:

Name	Type	Description	Default
`*by`	`Union[str, Expression]`	columns to perform distinct on. Defaults to all columns.	`()`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame that has only distinct rows.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 2], "y": [4, 5, 5], "z": [7, 8, 8]})
>>> distinct_df = df.unique()
>>> distinct_df = distinct_df.sort("x")
>>> distinct_df.show()

╭───────┬───────┬───────╮
│ x     ┆ y     ┆ z     │
│ ---   ┆ ---   ┆ ---   │
│ Int64 ┆ Int64 ┆ Int64 │
╞═══════╪═══════╪═══════╡
│ 1     ┆ 4     ┆ 7     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     ┆ 8     │
╰───────┴───────┴───────╯
(Showing first 2 of 2 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def unique(self, *by: ColumnInputType) -> "DataFrame":
    """Computes distinct rows, dropping duplicates.

    Alias for [DataFrame.distinct][daft.DataFrame.distinct].

    Args:
        *by (Union[str, Expression]): columns to perform distinct on. Defaults to all columns.

    Returns:
        DataFrame: DataFrame that has only distinct rows.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 2], "y": [4, 5, 5], "z": [7, 8, 8]})
        >>> distinct_df = df.unique()
        >>> distinct_df = distinct_df.sort("x")
        >>> distinct_df.show()
        ╭───────┬───────┬───────╮
        │ x     ┆ y     ┆ z     │
        │ ---   ┆ ---   ┆ ---   │
        │ Int64 ┆ Int64 ┆ Int64 │
        ╞═══════╪═══════╪═══════╡
        │ 1     ┆ 4     ┆ 7     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 5     ┆ 8     │
        ╰───────┴───────┴───────╯
        <BLANKLINE>
        (Showing first 2 of 2 rows)
    """
    return self.distinct(*by)

unpivot #

unpivot(ids: ManyColumnsInputType, values: ManyColumnsInputType = [], variable_name: str = 'variable', value_name: str = 'value') -> DataFrame

Unpivots a DataFrame from wide to long format.

Parameters:

Name	Type	Description	Default
`ids`	`ManyColumnsInputType`	Columns to keep as identifiers	required
`values`	`Optional[ManyColumnsInputType]`	Columns to unpivot. If not specified, all columns except ids will be unpivoted.	`[]`
`variable_name`	`Optional[str]`	Name of the variable column. Defaults to "variable".	`'variable'`
`value_name`	`Optional[str]`	Name of the value column. Defaults to "value".	`'value'`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Unpivoted DataFrame

Tip

var #

var(*cols: ColumnInputType, ddof: int = 1) -> DataFrame

Performs a global variance on the DataFrame.

Parameters:

Name	Type	Description	Default
`*cols`	`Union[str, Expression]`	columns to compute variance for	`()`
`ddof`	`int`	Delta degrees of freedom used in the denominator `N - ddof`. Defaults to 1 (sample variance).	`1`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Globally aggregated variance. Should be a single row.

Examples:

>>> import daft
>>> df = daft.from_pydict({"col_a": [0, 1, 2]})
>>> df = df.var("col_a")
>>> df.show()

╭─────────╮
│ col_a   │
│ ---     │
│ Float64 │
╞═════════╡
│ 1       │
╰─────────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def var(self, *cols: ColumnInputType, ddof: int = 1) -> "DataFrame":
    """Performs a global variance on the DataFrame.

    Args:
        *cols (Union[str, Expression]): columns to compute variance for
        ddof (int): Delta degrees of freedom used in the denominator `N - ddof`.
            Defaults to 1 (sample variance).

    Returns:
        DataFrame: Globally aggregated variance. Should be a single row.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"col_a": [0, 1, 2]})
        >>> df = df.var("col_a")
        >>> df.show()
        ╭─────────╮
        │ col_a   │
        │ ---     │
        │ Float64 │
        ╞═════════╡
        │ 1       │
        ╰─────────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)

    """
    return self._apply_agg_fn(lambda expr: Expression.var(expr, ddof), cols)

where #

where(predicate: Expression | str) -> DataFrame

Filters rows via a predicate expression, similar to SQL WHERE.

Parameters:

Name	Type	Description	Default
`predicate`	`Expression`	expression that keeps row if evaluates to True.	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Filtered DataFrame.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 6, 6], "z": [7, 8, 9]})
>>> df.where((df["x"] > 1) & (df["y"] > 1)).collect()

╭───────┬───────┬───────╮
│ x     ┆ y     ┆ z     │
│ ---   ┆ ---   ┆ ---   │
│ Int64 ┆ Int64 ┆ Int64 │
╞═══════╪═══════╪═══════╡
│ 2     ┆ 6     ┆ 8     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 6     ┆ 9     │
╰───────┴───────┴───────╯
(Showing first 2 of 2 rows)

You can also use a string expression as a predicate.

Note: this will use the method sql_expr to parse the string into an expression this may raise an error if the expression is not yet supported in the sql engine.

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 9, 9]})
>>> df.where("z = 9 AND y > 5").collect()

╭───────┬───────┬───────╮
│ x     ┆ y     ┆ z     │
│ ---   ┆ ---   ┆ ---   │
│ Int64 ┆ Int64 ┆ Int64 │
╞═══════╪═══════╪═══════╡
│ 3     ┆ 6     ┆ 9     │
╰───────┴───────┴───────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def where(self, predicate: Expression | str) -> "DataFrame":
    """Filters rows via a predicate expression, similar to SQL ``WHERE``.

    Args:
        predicate (Expression): expression that keeps row if evaluates to True.

    Returns:
        DataFrame: Filtered DataFrame.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 6, 6], "z": [7, 8, 9]})
        >>> df.where((df["x"] > 1) & (df["y"] > 1)).collect()
        ╭───────┬───────┬───────╮
        │ x     ┆ y     ┆ z     │
        │ ---   ┆ ---   ┆ ---   │
        │ Int64 ┆ Int64 ┆ Int64 │
        ╞═══════╪═══════╪═══════╡
        │ 2     ┆ 6     ┆ 8     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 3     ┆ 6     ┆ 9     │
        ╰───────┴───────┴───────╯
        <BLANKLINE>
        (Showing first 2 of 2 rows)

        You can also use a string expression as a predicate.

        Note: this will use the method `sql_expr` to parse the string into an expression
        this may raise an error if the expression is not yet supported in the sql engine.

        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6], "z": [7, 9, 9]})
        >>> df.where("z = 9 AND y > 5").collect()
        ╭───────┬───────┬───────╮
        │ x     ┆ y     ┆ z     │
        │ ---   ┆ ---   ┆ ---   │
        │ Int64 ┆ Int64 ┆ Int64 │
        ╞═══════╪═══════╪═══════╡
        │ 3     ┆ 6     ┆ 9     │
        ╰───────┴───────┴───────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)
    """
    if isinstance(predicate, str):
        from daft.sql.sql import sql_expr

        predicate = sql_expr(predicate)
    builder = self._builder.filter(predicate)
    return DataFrame(builder)

with_column #

with_column(column_name: str, expr: Expression) -> DataFrame

Adds a column to the current DataFrame with an Expression, equivalent to a select with all current columns and the new one.

Parameters:

Name	Type	Description	Default
`column_name`	`str`	name of new column	required
`expr`	`Expression`	expression of the new column.	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with new column.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3]})
>>> new_df = df.with_column("x+1", df["x"] + 1)
>>> new_df.show()

╭───────┬───────╮
│ x     ┆ x+1   │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 1     ┆ 2     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 3     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 4     │
╰───────┴───────╯
(Showing first 3 of 3 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def with_column(
    self,
    column_name: str,
    expr: Expression,
) -> "DataFrame":
    """Adds a column to the current DataFrame with an Expression, equivalent to a ``select`` with all current columns and the new one.

    Args:
        column_name (str): name of new column
        expr (Expression): expression of the new column.

    Returns:
        DataFrame: DataFrame with new column.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3]})
        >>> new_df = df.with_column("x+1", df["x"] + 1)
        >>> new_df.show()
        ╭───────┬───────╮
        │ x     ┆ x+1   │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ 1     ┆ 2     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 3     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 3     ┆ 4     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)
    """
    return self.with_columns({column_name: expr})

with_column_renamed #

with_column_renamed(existing: str, new: str) -> DataFrame

Renames a column in the current DataFrame.

If the column in the DataFrame schema does not exist, this will be a no-op.

Parameters:

Name	Type	Description	Default
`existing`	`str`	name of the existing column to rename	required
`new`	`str`	new name for the column	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with the column renamed.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> df.with_column_renamed("x", "foo").show()

╭───────┬───────╮
│ foo   ┆ y     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 1     ┆ 4     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 6     │
╰───────┴───────╯
(Showing first 3 of 3 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def with_column_renamed(self, existing: str, new: str) -> "DataFrame":
    """Renames a column in the current DataFrame.

    If the column in the DataFrame schema does not exist, this will be a no-op.

    Args:
        existing (str): name of the existing column to rename
        new (str): new name for the column

    Returns:
        DataFrame: DataFrame with the column renamed.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
        >>> df.with_column_renamed("x", "foo").show()
        ╭───────┬───────╮
        │ foo   ┆ y     │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ 1     ┆ 4     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 5     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 3     ┆ 6     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)
    """
    builder = self._builder.with_column_renamed(existing, new)
    return DataFrame(builder)

with_columns #

with_columns(columns: dict[str, Expression]) -> DataFrame

Adds columns to the current DataFrame with Expressions, equivalent to a select with all current columns and the new ones.

Parameters:

Name	Type	Description	Default
`columns`	`Dict[str, Expression]`	Dictionary of new columns in the format { name: expression }	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with new columns.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> new_df = df.with_columns({"foo": df["x"] + 1, "bar": df["y"] - df["x"]})
>>> new_df.show()

╭───────┬───────┬───────┬───────╮
│ x     ┆ y     ┆ foo   ┆ bar   │
│ ---   ┆ ---   ┆ ---   ┆ ---   │
│ Int64 ┆ Int64 ┆ Int64 ┆ Int64 │
╞═══════╪═══════╪═══════╪═══════╡
│ 1     ┆ 4     ┆ 2     ┆ 3     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     ┆ 3     ┆ 3     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 6     ┆ 4     ┆ 3     │
╰───────┴───────┴───────┴───────╯
(Showing first 3 of 3 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def with_columns(
    self,
    columns: dict[str, Expression],
) -> "DataFrame":
    """Adds columns to the current DataFrame with Expressions, equivalent to a ``select`` with all current columns and the new ones.

    Args:
        columns (Dict[str, Expression]): Dictionary of new columns in the format { name: expression }

    Returns:
        DataFrame: DataFrame with new columns.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
        >>> new_df = df.with_columns({"foo": df["x"] + 1, "bar": df["y"] - df["x"]})
        >>> new_df.show()
        ╭───────┬───────┬───────┬───────╮
        │ x     ┆ y     ┆ foo   ┆ bar   │
        │ ---   ┆ ---   ┆ ---   ┆ ---   │
        │ Int64 ┆ Int64 ┆ Int64 ┆ Int64 │
        ╞═══════╪═══════╪═══════╪═══════╡
        │ 1     ┆ 4     ┆ 2     ┆ 3     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 5     ┆ 3     ┆ 3     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 3     ┆ 6     ┆ 4     ┆ 3     │
        ╰───────┴───────┴───────┴───────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)
    """
    new_columns = [col.alias(name) for name, col in columns.items()]

    builder = self._builder.with_columns(new_columns)
    return DataFrame(builder)

with_columns_renamed #

with_columns_renamed(cols_map: dict[str, str]) -> DataFrame

Renames multiple columns in the current DataFrame.

If the columns in the DataFrame schema do not exist, this will be a no-op.

Parameters:

Name	Type	Description	Default
`cols_map`	`Dict[str, str]`	Dictionary of columns to rename in the format { existing: new }	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	DataFrame with the columns renamed.

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> df.with_columns_renamed({"x": "foo", "y": "bar"}).show()

╭───────┬───────╮
│ foo   ┆ bar   │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 1     ┆ 4     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 6     │
╰───────┴───────╯
(Showing first 3 of 3 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def with_columns_renamed(self, cols_map: dict[str, str]) -> "DataFrame":
    """Renames multiple columns in the current DataFrame.

    If the columns in the DataFrame schema do not exist, this will be a no-op.

    Args:
        cols_map (Dict[str, str]): Dictionary of columns to rename in the format { existing: new }

    Returns:
        DataFrame: DataFrame with the columns renamed.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
        >>> df.with_columns_renamed({"x": "foo", "y": "bar"}).show()
        ╭───────┬───────╮
        │ foo   ┆ bar   │
        │ ---   ┆ ---   │
        │ Int64 ┆ Int64 │
        ╞═══════╪═══════╡
        │ 1     ┆ 4     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 2     ┆ 5     │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
        │ 3     ┆ 6     │
        ╰───────┴───────╯
        <BLANKLINE>
        (Showing first 3 of 3 rows)
    """
    builder = self._builder.with_columns_renamed(cols_map)
    return DataFrame(builder)

write_bigtable #

write_bigtable(project_id: str, instance_id: str, table_id: str, row_key_column: str, column_family_mappings: dict[str, str], client_kwargs: dict[str, Any] | None = None, write_kwargs: dict[str, Any] | None = None, serialize_incompatible_types: bool = True) -> DataFrame

Write a DataFrame into a Google Cloud Bigtable table.

Bigtable only accepts datatypes that can be converted to bytes in cells (for more details, please consult the Bigtable documentation: https://cloud.google.com/bigtable/docs/overview#data-types). By default, write_bigtable automatically serializes incompatible types to JSON. This can be disabled by setting auto_convert=False.

This data sink transforms each row of the dataframe into Bigtable rows. A row key is always required. The row_key_column parameter can be used to specify the column name to use for the row key.

Every column must also belong to a column family. The column_family_mappings parameter can be used to specify the column family to use for each column. For example, if you have a column "name" and a column "age", you can specify a "user_data" column family by passing a dictionary like {"name": "user_data", "age": "user_data"}.

EXPERIMENTAL: This features is early in development and will change.

Parameters:

Name	Type	Description	Default
`project_id`	`str`	The Google Cloud project ID.	required
`instance_id`	`str`	The Bigtable instance ID.	required
`table_id`	`str`	The table to write to.	required
`row_key_column`	`str`	Column name for the row key.	required
`column_family_mappings`	`dict[str, str]`	Mapping of column names to column families.	required
`client_kwargs`	`dict[str, Any] \| None`	Optional dictionary of arguments to pass to the Bigtable Client constructor.	`None`
`write_kwargs`	`dict[str, Any] \| None`	Optional dictionary of arguments to pass to the Bigtable MutationsBatcher.	`None`
`serialize_incompatible_types`	`bool`	Whether to automatically convert non-bytes/int values to Bigtable-compatible formats. If False, will raise an error for unsupported types. Defaults to True.	`True`

Source code in daft/dataframe/dataframe.py

def write_bigtable(
    self,
    project_id: str,
    instance_id: str,
    table_id: str,
    row_key_column: str,
    column_family_mappings: dict[str, str],
    client_kwargs: dict[str, Any] | None = None,
    write_kwargs: dict[str, Any] | None = None,
    serialize_incompatible_types: bool = True,
) -> "DataFrame":
    """Write a DataFrame into a Google Cloud Bigtable table.

    Bigtable only accepts datatypes that can be converted to bytes in cells (for more details, please consult the Bigtable documentation: https://cloud.google.com/bigtable/docs/overview#data-types).
    By default, `write_bigtable` automatically serializes incompatible types to JSON. This can be disabled by setting `auto_convert=False`.

    This data sink transforms each row of the dataframe into Bigtable rows.
    A row key is always required. The `row_key_column` parameter can be used to specify the column name to use for the row key.

    Every column must also belong to a column family. The `column_family_mappings` parameter can be used to specify the column family to use for each column.
    For example, if you have a column "name" and a column "age", you can specify a "user_data" column family by passing a dictionary like {"name": "user_data", "age": "user_data"}.

    EXPERIMENTAL: This features is early in development and will change.

    Args:
        project_id: The Google Cloud project ID.
        instance_id: The Bigtable instance ID.
        table_id: The table to write to.
        row_key_column: Column name for the row key.
        column_family_mappings: Mapping of column names to column families.
        client_kwargs: Optional dictionary of arguments to pass to the Bigtable Client constructor.
        write_kwargs: Optional dictionary of arguments to pass to the Bigtable MutationsBatcher.
        serialize_incompatible_types: Whether to automatically convert non-bytes/int values to Bigtable-compatible formats.
                                      If False, will raise an error for unsupported types. Defaults to True.
    """
    from daft.io.bigtable.bigtable_data_sink import BigtableDataSink

    sink = BigtableDataSink(
        project_id, instance_id, table_id, row_key_column, column_family_mappings, client_kwargs, write_kwargs
    )

    # Preprocess the DataFrame using the sink's validation and preprocessing logic
    df_to_write = sink._preprocess_dataframe(self, serialize_incompatible_types)

    return df_to_write.write_sink(sink)

write_clickhouse #

write_clickhouse(table: str, *, host: str, port: int | None = None, user: str | None = None, password: str | None = None, database: str | None = None, client_kwargs: dict[str, Any] | None = None, write_kwargs: dict[str, Any] | None = None) -> DataFrame

Writes the DataFrame to a ClickHouse table.

Parameters:

Name	Type	Description	Default
`table`	`str`	Name of the ClickHouse table to write to.	required
`host`	`str`	ClickHouse host.	required
`port`	`int \| None`	ClickHouse port.	`None`
`user`	`str \| None`	ClickHouse user.	`None`
`password`	`str \| None`	ClickHouse password.	`None`
`database`	`str \| None`	ClickHouse database.	`None`
`client_kwargs`	`dict[str, Any] \| None`	Optional dictionary of arguments to pass to the ClickHouse client constructor.	`None`
`write_kwargs`	`dict[str, Any] \| None`	Optional dictionary of arguments to pass to the ClickHouse write() method.	`None`

Examples:

>>> import daft
>>> df = daft.from_pydict({"a": [1, 2, 3, 4]})
>>> df.write_clickhouse(table="", host="", port=8123, user="", password="")

╭────────────────────┬─────────────────────╮
│ total_written_rows ┆ total_written_bytes │
│ ---                ┆ ---                 │
│ Int64              ┆ Int64               │
╞════════════════════╪═════════════════════╡
│ 4                  ┆ 32                  │
╰────────────────────┴─────────────────────╯

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def write_clickhouse(
    self,
    table: str,
    *,
    host: str,
    port: int | None = None,
    user: str | None = None,
    password: str | None = None,
    database: str | None = None,
    client_kwargs: dict[str, Any] | None = None,
    write_kwargs: dict[str, Any] | None = None,
) -> "DataFrame":
    """Writes the DataFrame to a ClickHouse table.

    Args:
        table: Name of the ClickHouse table to write to.
        host: ClickHouse host.
        port: ClickHouse port.
        user: ClickHouse user.
        password: ClickHouse password.
        database: ClickHouse database.
        client_kwargs: Optional dictionary of arguments to pass to the ClickHouse client constructor.
        write_kwargs: Optional dictionary of arguments to pass to the ClickHouse write() method.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"a": [1, 2, 3, 4]})  # doctest: +SKIP
        >>> df.write_clickhouse(table="", host="", port=8123, user="", password="")  # doctest: +SKIP
        ╭────────────────────┬─────────────────────╮
        │ total_written_rows ┆ total_written_bytes │
        │ ---                ┆ ---                 │
        │ Int64              ┆ Int64               │
        ╞════════════════════╪═════════════════════╡
        │ 4                  ┆ 32                  │
        ╰────────────────────┴─────────────────────╯
    """
    from daft.io.clickhouse.clickhouse_data_sink import ClickHouseDataSink

    sink = ClickHouseDataSink(
        table,
        host=host,
        port=port,
        user=user,
        password=password,
        database=database,
        client_kwargs=client_kwargs,
        write_kwargs=write_kwargs,
    )
    return self.write_sink(sink)

write_csv #

write_csv(root_dir: str | Path, write_mode: Literal['append', 'overwrite', 'overwrite-partitions'] = 'append', partition_cols: list[ColumnInputType] | None = None, io_config: IOConfig | None = None, delimiter: str | None = None, quote: str | None = None, escape: str | None = None, header: bool | None = True, date_format: str | None = None, timestamp_format: str | None = None) -> DataFrame

Writes the DataFrame as CSV files, returning a new DataFrame with paths to the files that were written.

Files will be written to <root_dir>/* with randomly generated UUIDs as the file names.

Parameters:

Name	Type	Description	Default
`root_dir`	`str`	root file path to write CSV files to.	required
`write_mode`	`str`	Operation mode of the write. `append` will add new data, `overwrite` will replace the contents of the root directory with new data. `overwrite-partitions` will replace only the contents in the partitions that are being written to. Defaults to "append".	`'append'`
`partition_cols`	`Optional[List[ColumnInputType]]`	How to subpartition each partition further. Defaults to None.	`None`
`io_config`	`Optional[IOConfig]`	configurations to use when interacting with remote storage.	`None`
`delimiter`	`Optional[str]`	Single-character field delimiter (default `,`).	`None`
`quote`	`Optional[str]`	Single-character quote used around fields containing delimiters default `"`.	`None`
`escape`	`Optional[str]`	Single-character escape for special characters default `\\`.	`None`
`header`	`Optional[bool]`	Whether to write a header row with column names, default True.	`True`
`date_format`	`Optional[str]`	Format string for date columns. Uses chrono strftime format (e.g., "%Y-%m-%d", "%d/%m/%Y"). Defaults to None (ISO 8601 format).	`None`
`timestamp_format`	`Optional[str]`	Format string for timestamp columns. Uses chrono strftime format (e.g., "%Y-%m-%d %H:%M:%S", "%+"). Defaults to None (ISO 8601 format).	`None`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	The filenames that were written out as strings.

Note

This call is blocking and will execute the DataFrame when called

Timezone handling: For timezone-aware timestamp columns, the timestamps are converted to the target timezone before formatting. For example, a timestamp stored as UTC but with timezone "America/New_York" will be formatted in Eastern Time, not UTC. If the timezone string is invalid, an error will be raised.

Examples:

Basic usage:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": ["a", "b", "c"]})
>>> df.write_csv("output_dir", write_mode="overwrite")

Custom date format (e.g., DD/MM/YYYY):

>>> import datetime
>>> df = daft.from_pydict({"date": [datetime.date(2024, 1, 15)]})
>>> df.write_csv("output_dir", date_format="%d/%m/%Y")

# Output: 15/01/2024

Custom timestamp format:

>>> df = daft.from_pydict({"ts": [datetime.datetime(2024, 1, 15, 10, 30, 45)]})
>>> df.write_csv("output_dir", timestamp_format="%Y-%m-%d %H:%M:%S")

# Output: 2024-01-15 10:30:45

ISO 8601 / RFC 3339 timestamp format:

>>> df.write_csv("output_dir", timestamp_format="%+")

# Output: 2024-01-15T10:30:45+00:00

Tip

See also df.write_parquet() and df.write_json() other formats for writing DataFrames

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def write_csv(
    self,
    root_dir: str | pathlib.Path,
    write_mode: Literal["append", "overwrite", "overwrite-partitions"] = "append",
    partition_cols: list[ColumnInputType] | None = None,
    io_config: IOConfig | None = None,
    delimiter: str | None = None,
    quote: str | None = None,
    escape: str | None = None,
    header: bool | None = True,
    date_format: str | None = None,
    timestamp_format: str | None = None,
) -> "DataFrame":
    r"""Writes the DataFrame as CSV files, returning a new DataFrame with paths to the files that were written.

    Files will be written to `<root_dir>/*` with randomly generated UUIDs as the file names.

    Args:
        root_dir (str): root file path to write CSV files to.
        write_mode (str, optional): Operation mode of the write. `append` will add new data, `overwrite` will replace the contents of the root directory with new data. `overwrite-partitions` will replace only the contents in the partitions that are being written to. Defaults to "append".
        partition_cols (Optional[List[ColumnInputType]], optional): How to subpartition each partition further. Defaults to None.
        io_config (Optional[IOConfig], optional): configurations to use when interacting with remote storage.
        delimiter (Optional[str], optional): Single-character field delimiter (default `,`).
        quote (Optional[str], optional): Single-character quote used around fields containing delimiters default `"`.
        escape (Optional[str], optional): Single-character escape for special characters default `\\`.
        header (Optional[bool], optional): Whether to write a header row with column names, default True.
        date_format (Optional[str], optional): Format string for date columns. Uses chrono strftime format (e.g., "%Y-%m-%d", "%d/%m/%Y"). Defaults to None (ISO 8601 format).
        timestamp_format (Optional[str], optional): Format string for timestamp columns. Uses chrono strftime format (e.g., "%Y-%m-%d %H:%M:%S", "%+"). Defaults to None (ISO 8601 format).

    Returns:
        DataFrame: The filenames that were written out as strings.

    Note:
        This call is **blocking** and will execute the DataFrame when called

        **Timezone handling**: For timezone-aware timestamp columns, the timestamps are converted
        to the target timezone before formatting. For example, a timestamp stored as UTC but with
        timezone "America/New_York" will be formatted in Eastern Time, not UTC. If the timezone
        string is invalid, an error will be raised.

    Examples:
        Basic usage:

        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": ["a", "b", "c"]})
        >>> df.write_csv("output_dir", write_mode="overwrite")  # doctest: +SKIP

        Custom date format (e.g., DD/MM/YYYY):

        >>> import datetime
        >>> df = daft.from_pydict({"date": [datetime.date(2024, 1, 15)]})
        >>> df.write_csv("output_dir", date_format="%d/%m/%Y")  # doctest: +SKIP
        # Output: 15/01/2024

        Custom timestamp format:

        >>> df = daft.from_pydict({"ts": [datetime.datetime(2024, 1, 15, 10, 30, 45)]})
        >>> df.write_csv("output_dir", timestamp_format="%Y-%m-%d %H:%M:%S")  # doctest: +SKIP
        # Output: 2024-01-15 10:30:45

        ISO 8601 / RFC 3339 timestamp format:

        >>> df.write_csv("output_dir", timestamp_format="%+")  # doctest: +SKIP
        # Output: 2024-01-15T10:30:45+00:00

    Tip:
        See also [`df.write_parquet()`][daft.DataFrame.write_parquet] and [`df.write_json()`][daft.DataFrame.write_json]
        other formats for writing DataFrames

    """
    if write_mode not in ["append", "overwrite", "overwrite-partitions"]:
        raise ValueError(
            f"Only support `append`, `overwrite`, or `overwrite-partitions` mode. {write_mode} is unsupported"
        )
    if write_mode == "overwrite-partitions" and partition_cols is None:
        raise ValueError("Partition columns must be specified to use `overwrite-partitions` mode.")

    io_config = get_context().daft_planning_config.default_io_config if io_config is None else io_config

    cols: list[Expression] | None = None
    if partition_cols is not None:
        cols = column_inputs_to_expressions(tuple(partition_cols))

    file_format_option = PyFormatSinkOption.csv(
        delimiter=delimiter,
        quote=quote,
        escape=escape,
        header=header,
        date_format=date_format,
        timestamp_format=timestamp_format,
    )
    builder = self._builder.write_tabular(
        root_dir=root_dir,
        partition_cols=cols,
        write_mode=WriteMode.from_str(write_mode),
        file_format=FileFormat.Csv,
        file_format_option=file_format_option,
        io_config=io_config,
    )

    # Block and write, then retrieve data
    write_df = DataFrame(builder)
    write_df.collect()
    assert write_df._result is not None

    # Populate and return a new disconnected DataFrame
    # Keep the original logical plan so explain() can still show upstream operators
    # (e.g. filters/projections before the write), instead of collapsing to an
    # in-memory source after collect() caches the result.
    result_df = DataFrame(write_df._get_current_builder())
    result_df._result_cache = write_df._result_cache
    result_df._preview = write_df._preview
    result_df._metadata = write_df._metadata
    return result_df

write_deltalake #

write_deltalake(table: Union[str, Path, DeltaTable, UnityCatalogTable], partition_cols: list[str] | None = None, mode: Literal['append', 'overwrite', 'error', 'ignore'] = 'append', schema_mode: Literal['merge', 'overwrite'] | None = None, name: str | None = None, description: str | None = None, configuration: Mapping[str, str | None] | None = None, custom_metadata: dict[str, str] | None = None, dynamo_table_name: str | None = None, allow_unsafe_rename: bool = False, io_config: IOConfig | None = None, checkpoint: IdempotentCommit | None = None) -> DataFrame

Writes the DataFrame to a Delta Lake table, returning a new DataFrame with the operations that occurred.

Parameters:

Name	Type	Description	Default
`table`	`Union[str, Path, DeltaTable, UnityCatalogTable]`	Destination Delta Lake Table or table URI to write dataframe to.	required
`partition_cols`	`List[str]`	How to subpartition each partition further. If table exists, expected to match table's existing partitioning scheme, otherwise creates the table with specified partition columns. Defaults to None.	`None`
`mode`	`str`	Operation mode of the write. `append` will add new data, `overwrite` will replace table with new data, `error` will raise an error if table already exists, and `ignore` will not write anything if table already exists. Defaults to `append`.	`'append'`
`schema_mode`	`str`	Schema mode of the write. If set to `overwrite`, allows replacing the schema of the table when doing `mode=overwrite`. Schema mode `merge` is currently not supported.	`None`
`name`	`str`	User-provided identifier for this table.	`None`
`description`	`str`	User-provided description for this table.	`None`
`configuration`	`Mapping[str, Optional[str]]`	A map containing configuration options for the metadata action.	`None`
`custom_metadata`	`Dict[str, str]`	Custom metadata to add to the commit info. Keys with prefix `daft.idempotence-` are reserved.	`None`
`dynamo_table_name`	`str`	Name of the DynamoDB table to be used as the locking provider if writing to S3.	`None`
`allow_unsafe_rename`	`bool`	Whether to allow unsafe rename when writing to S3 or local disk. Defaults to False.	`False`
`io_config`	`IOConfig`	configurations to use when interacting with remote storage.	`None`
`checkpoint`	`IdempotentCommit`	Bundled checkpoint store + idempotence key for an idempotent commit. When provided, the Delta commit's `custom_metadata` is tagged with `daft.idempotence-key` and retries with the same key recognize the prior attempt without producing a duplicate commit. Only `mode='append'` is supported. Requires the Ray runner.	`None`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	The operations that occurred with this write.

Note

This call is blocking and will execute the DataFrame when called.

When checkpoint is provided and write_deltalake raises after the Delta commit landed (e.g. a transient failure during the post-commit mark_committed bookkeeping), the user data is already durable in Delta. The next call with the same IdempotentCommit (same idempotence key) will detect the commit via its marker, finish the bookkeeping, and exit cleanly without producing a duplicate commit.

The returned DataFrame reflects only this call's writes — empty (0 rows) on a recovery short-circuit, populated when a new commit lands. Useful for run-to-run diffing.

Idempotence-key contract — read carefully:

Same key + different inputs → silent no-op (data loss). The destination already has a commit tagged with the key, so nothing new is written.
Different key + same retry → duplicate commit. The destination won't recognize the prior attempt and will commit again. Idempotence is broken.

The orchestrator pattern (run-id supplied from upstream DAG context) avoids both naturally.

Crashed runs leave orphan data files at the table location. Delta writes parquet files before the commit, so files from crashed attempts are not referenced by any commit but the bytes remain on disk.

Examples:

>>> import daft
>>> import deltalake
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": ["a", "b", "c"]})
>>> df.write_deltalake("s3://my-bucket/my-deltalake-table")

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def write_deltalake(
    self,
    table: Union[str, pathlib.Path, "deltalake.DeltaTable", "UnityCatalogTable"],
    partition_cols: list[str] | None = None,
    mode: Literal["append", "overwrite", "error", "ignore"] = "append",
    schema_mode: Literal["merge", "overwrite"] | None = None,
    name: str | None = None,
    description: str | None = None,
    configuration: Mapping[str, str | None] | None = None,
    custom_metadata: dict[str, str] | None = None,
    dynamo_table_name: str | None = None,
    allow_unsafe_rename: bool = False,
    io_config: IOConfig | None = None,
    checkpoint: "IdempotentCommit | None" = None,
) -> "DataFrame":
    """Writes the DataFrame to a [Delta Lake](https://docs.delta.io/latest/index.html) table, returning a new DataFrame with the operations that occurred.

    Args:
        table (Union[str, pathlib.Path, deltalake.DeltaTable, UnityCatalogTable]): Destination [Delta Lake Table](https://delta-io.github.io/delta-rs/api/delta_table/) or table URI to write dataframe to.
        partition_cols (List[str], optional): How to subpartition each partition further. If table exists, expected to match table's existing partitioning scheme, otherwise creates the table with specified partition columns. Defaults to None.
        mode (str, optional): Operation mode of the write. `append` will add new data, `overwrite` will replace table with new data, `error` will raise an error if table already exists, and `ignore` will not write anything if table already exists. Defaults to `append`.
        schema_mode (str, optional): Schema mode of the write. If set to `overwrite`, allows replacing the schema of the table when doing `mode=overwrite`. Schema mode `merge` is currently not supported.
        name (str, optional): User-provided identifier for this table.
        description (str, optional): User-provided description for this table.
        configuration (Mapping[str, Optional[str]], optional): A map containing configuration options for the metadata action.
        custom_metadata (Dict[str, str], optional): Custom metadata to add to the commit info. Keys with prefix ``daft.idempotence-`` are reserved.
        dynamo_table_name (str, optional): Name of the DynamoDB table to be used as the locking provider if writing to S3.
        allow_unsafe_rename (bool, optional): Whether to allow unsafe rename when writing to S3 or local disk. Defaults to False.
        io_config (IOConfig, optional): configurations to use when interacting with remote storage.
        checkpoint (IdempotentCommit, optional): Bundled checkpoint store + idempotence key for an idempotent commit. When provided, the Delta commit's ``custom_metadata`` is tagged with ``daft.idempotence-key`` and retries with the same key recognize the prior attempt without producing a duplicate commit. Only ``mode='append'`` is supported. Requires the Ray runner.

    Returns:
        DataFrame: The operations that occurred with this write.

    Note:
        This call is **blocking** and will execute the DataFrame when called.

        When ``checkpoint`` is provided and ``write_deltalake`` raises
        *after* the Delta commit landed (e.g. a transient failure during
        the post-commit ``mark_committed`` bookkeeping), the user data is
        already durable in Delta. The next call with the same
        ``IdempotentCommit`` (same idempotence key) will detect the
        commit via its marker, finish the bookkeeping, and exit cleanly
        without producing a duplicate commit.

        The returned DataFrame reflects only this call's writes — empty
        (0 rows) on a recovery short-circuit, populated when a new
        commit lands. Useful for run-to-run diffing.

        Idempotence-key contract — read carefully:

        - **Same key + different inputs → silent no-op (data loss).** The
          destination already has a commit tagged with the key, so
          nothing new is written.
        - **Different key + same retry → duplicate commit.** The
          destination won't recognize the prior attempt and will commit
          again. Idempotence is broken.

        The orchestrator pattern (run-id supplied from upstream DAG context)
        avoids both naturally.

        Crashed runs leave orphan data files at the table location.
        Delta writes parquet files before the commit, so files from
        crashed attempts are not referenced by any commit but the
        bytes remain on disk.

    Examples:
        >>> import daft
        >>> import deltalake
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": ["a", "b", "c"]})
        >>> df.write_deltalake("s3://my-bucket/my-deltalake-table")  # doctest: +SKIP
    """
    import json

    import deltalake
    import pyarrow as pa
    from deltalake.exceptions import TableNotFoundError
    from packaging.version import parse

    from daft import from_pydict
    from daft.dependencies import unity_catalog
    from daft.filesystem import get_protocol_from_path
    from daft.io.delta_lake._deltalake import delta_schema_to_pyarrow
    from daft.io.delta_lake.delta_lake_write import (
        AddAction,
        convert_pa_schema_to_delta,
        create_table_with_add_actions,
    )
    from daft.io.object_store_options import io_config_to_storage_options

    if schema_mode == "merge":
        raise ValueError("Schema mode' merge' is not currently supported for write_deltalake.")

    if parse(deltalake.__version__) < parse("0.14.0"):
        raise ValueError(f"Write delta lake is only supported on deltalake>=0.14.0, found {deltalake.__version__}")

    # Reserved-prefix guard. Fires regardless of `checkpoint=` so a user
    # can't land a `daft.idempotence-*` marker via custom_metadata
    # without going through the idempotent flow — that would let a
    # future `checkpoint=` call walk into a confused recovery branch.
    if custom_metadata:
        for key in custom_metadata:
            if key.startswith("daft.idempotence-"):
                raise ValueError(f"custom_metadata keys with prefix 'daft.idempotence-' are reserved; got: {key!r}")

    if checkpoint is not None and mode != "append":
        raise NotImplementedError(
            f"write_deltalake with checkpoint=... currently supports mode='append' only; "
            f"got mode={mode!r}. overwrite/error/ignore + checkpoint are tracked separately."
        )

    io_config = get_context().daft_planning_config.default_io_config if io_config is None else io_config

    # Retrieve table_uri and storage_options from various backends
    table_uri: str
    storage_options: dict[str, str]

    if isinstance(table, deltalake.DeltaTable):
        table_uri = table.table_uri
        storage_options = table._storage_options or {}
        new_storage_options = io_config_to_storage_options(io_config, table_uri)
        storage_options.update(new_storage_options or {})
    else:
        if isinstance(table, str):
            table_uri = os.path.expanduser(table)
        elif isinstance(table, pathlib.Path):
            table_uri = str(table)
        elif unity_catalog.module_available() and isinstance(table, unity_catalog.UnityCatalogTable):
            table_uri = table.table_uri
            io_config = table.io_config
        else:
            raise ValueError(f"Expected table to be a path or a DeltaTable, received: {type(table)}")

        if io_config is None:
            raise ValueError(
                "io_config was not provided to write_deltalake and could not be retrieved from defaults."
            )

        storage_options = io_config_to_storage_options(io_config, table_uri) or {}
        try:
            table = deltalake.DeltaTable(table_uri, storage_options=storage_options)
        except TableNotFoundError:
            table = None

    # see: https://delta-io.github.io/delta-rs/usage/writing/writing-to-s3-with-locking-provider/
    scheme = get_protocol_from_path(table_uri)
    if scheme == "s3" or scheme == "s3a":
        if dynamo_table_name is not None:
            storage_options["AWS_S3_LOCKING_PROVIDER"] = "dynamodb"
            storage_options["DELTA_DYNAMO_TABLE_NAME"] = dynamo_table_name
        else:
            storage_options["AWS_S3_ALLOW_UNSAFE_RENAME"] = "true"

            if not allow_unsafe_rename:
                warnings.warn("No DynamoDB table specified for Delta Lake locking. Defaulting to unsafe writes.")
    elif scheme == "file" and allow_unsafe_rename:
        storage_options["MOUNT_ALLOW_UNSAFE_RENAME"] = "true"

    pyarrow_schema = pa.schema((f.name, f.dtype.to_arrow_dtype()) for f in self.schema())

    large_dtypes = True
    delta_schema = convert_pa_schema_to_delta(pyarrow_schema, large_dtypes=large_dtypes)

    if table:
        if partition_cols and partition_cols != table.metadata().partition_columns:
            raise ValueError(
                f"Expected partition columns to match that of the existing table ({table.metadata().partition_columns}), but received: {partition_cols}"
            )
        else:
            partition_cols = table.metadata().partition_columns

        table.update_incremental()

        table_schema = delta_schema_to_pyarrow(table.schema())
        if Schema.from_pyarrow_schema(delta_schema) != Schema.from_pyarrow_schema(table_schema) and not (
            mode == "overwrite" and schema_mode == "overwrite"
        ):
            raise ValueError(
                "Schema of data does not match table schema\n"
                f"Data schema:\n{delta_schema}\nTable Schema:\n{table_schema}"
            )
        if mode == "error":
            raise AssertionError("Delta table already exists, write mode set to error.")
        elif mode == "ignore":
            return from_pydict(
                {
                    "operation": pa.array([], type=pa.string()),
                    "rows": pa.array([], type=pa.int64()),
                    "file_size": pa.array([], type=pa.int64()),
                    "file_name": pa.array([], type=pa.string()),
                }
            )
        version = table.version() + 1
    else:
        version = 0

    if partition_cols is not None:
        for c in partition_cols:
            if self.schema()[c].dtype == DataType.binary():
                raise NotImplementedError("Binary partition columns are not yet supported for Delta Lake writes")

    if checkpoint is not None:
        return self._write_deltalake_with_checkpoint(
            table=table,
            table_uri=table_uri,
            storage_options=storage_options,
            delta_schema=delta_schema,
            partition_cols=partition_cols,
            mode=mode,
            version=version,
            large_dtypes=large_dtypes,
            io_config=io_config,
            name=name,
            description=description,
            configuration=configuration,
            custom_metadata=custom_metadata,
            checkpoint=checkpoint,
        )

    builder = self._builder.write_deltalake(
        table_uri,
        mode,
        version,
        large_dtypes,
        io_config=io_config,
        partition_cols=partition_cols,
    )
    write_df = DataFrame(builder)
    write_df.collect()

    write_result = write_df.to_pydict()
    assert "add_action" in write_result
    add_actions: list[AddAction] = write_result["add_action"]

    operations = []
    paths = []
    rows = []
    sizes = []

    for add_action in add_actions:
        stats = json.loads(add_action.stats)
        operations.append("ADD")
        paths.append(add_action.path)
        rows.append(stats["numRecords"])
        sizes.append(add_action.size)

    if table is None:
        create_table_with_add_actions(
            table_uri,
            delta_schema,
            add_actions,
            mode,
            partition_cols or [],
            name,
            description,
            configuration,
            storage_options,
            custom_metadata,
        )
    else:
        if mode == "overwrite":
            old_actions = pa.table(table.get_add_actions())
            old_actions_dict = old_actions.to_pydict()
            for i in range(old_actions.num_rows):
                operations.append("DELETE")
                paths.append(old_actions_dict["path"][i])
                rows.append(old_actions_dict["num_records"][i])
                sizes.append(old_actions_dict["size_bytes"][i])

        metadata_param = _create_delta_metadata_param(custom_metadata)
        if parse(deltalake.__version__) < parse("1.0.0"):
            table._table.create_write_transaction(
                add_actions, mode, partition_cols or [], delta_schema, None, metadata_param
            )
        else:
            table._table.create_write_transaction(
                add_actions,
                mode,
                partition_cols or [],
                deltalake.Schema.from_arrow(delta_schema),
                None,
                metadata_param,
            )
        table.update_incremental()

    with_operations = from_pydict(
        {
            "operation": pa.array(operations, type=pa.string()),
            "rows": pa.array(rows, type=pa.int64()),
            "file_size": pa.array(sizes, type=pa.int64()),
            "file_name": pa.array([os.path.basename(fp) for fp in paths], type=pa.string()),
        }
    )
    with_operations._metadata = write_df._metadata
    return with_operations

write_huggingface #

write_huggingface(repo: str, split: str = 'train', data_dir: str = 'data', revision: str = 'main', overwrite: bool = False, commit_message: str = 'Upload dataset using Daft', commit_description: str | None = None, io_config: IOConfig | None = None) -> DataFrame

Write a DataFrame into a Hugging Face dataset.

Parameters:

Name	Type	Description	Default
`repo`	`str`	The ID of the repository to push to in the following format: `<user>/<dataset_name>` or `<org>/<dataset_name>`.	required
`split`	`str`	The name of the split that will be given to that dataset.	`'train'`
`data_dir`	`str`	Directory of the uploaded data files.	`'data'`
`revision`	`str`	Branch to push the uploaded files to.	`'main'`
`overwrite`	`bool`	Whether to overwrite or append.	`False`
`commit_message`	`str`	Message to commit while pushing.	`'Upload dataset using Daft'`
`commit_description`	`str \| None`	Description of the commit that will be created.	`None`
`io_config`	`IOConfig \| None`	Configurations to use when interacting with remote storage.	`None`

Source code in daft/dataframe/dataframe.py

def write_huggingface(
    self,
    repo: str,
    split: str = "train",
    data_dir: str = "data",
    revision: str = "main",
    overwrite: bool = False,
    commit_message: str = "Upload dataset using Daft",
    commit_description: str | None = None,
    io_config: IOConfig | None = None,
) -> "DataFrame":
    """Write a DataFrame into a Hugging Face dataset.

    Args:
        repo: The ID of the repository to push to in the following format: `<user>/<dataset_name>` or `<org>/<dataset_name>`.
        split: The name of the split that will be given to that dataset.
        data_dir: Directory of the uploaded data files.
        revision: Branch to push the uploaded files to.
        overwrite: Whether to overwrite or append.
        commit_message: Message to commit while pushing.
        commit_description: Description of the commit that will be created.
        io_config: Configurations to use when interacting with remote storage.
    """
    from daft.io.huggingface.sink import HuggingFaceSink

    io_config = get_context().daft_planning_config.default_io_config if io_config is None else io_config

    sink = HuggingFaceSink(
        repo, split, data_dir, revision, overwrite, commit_message, commit_description, io_config.hf
    )
    return self.write_sink(sink)

write_iceberg #

write_iceberg(table: Table, mode: str = 'append', io_config: IOConfig | None = None, snapshot_properties: dict[str, str] | None = None, checkpoint: IdempotentCommit | None = None) -> DataFrame

Writes the DataFrame to an Iceberg table, returning a new DataFrame with the operations that occurred.

Can be run in either append or overwrite mode which will either appends the rows in the DataFrame or will delete the existing rows and then append the DataFrame rows respectively.

Parameters:

Name	Type	Description	Default
`table`	`Table`	Destination PyIceberg Table to write dataframe to.	required
`mode`	`str`	Operation mode of the write. `append` or `overwrite` Iceberg Table. Defaults to `append`.	`'append'`
`io_config`	`IOConfig`	A custom IOConfig to use when accessing Iceberg object storage data. If provided, configurations set in `table` are ignored.	`None`
`snapshot_properties`	`dict[str, str]`	Optional snapshot properties to set while writing to the table. Keys with prefix `daft.idempotence-` are reserved.	`None`
`checkpoint`	`IdempotentCommit`	Bundled checkpoint store + idempotence key for an idempotent commit. When provided, the snapshot summary is tagged with `daft.idempotence-key` and retries with the same key recognize the prior attempt without producing a duplicate snapshot. Only `mode='append'` is supported. Requires the Ray runner.	`None`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	The operations that occurred with this write.

Note

This call is blocking and will execute the DataFrame when called.

When checkpoint is provided and write_iceberg raises after the catalog commit landed (e.g. a transient failure during the post-commit mark_committed bookkeeping), the user data is already durable in Iceberg. The next call with the same IdempotentCommit (same idempotence key) will detect the snapshot via its marker, finish the bookkeeping, and exit cleanly without producing a duplicate snapshot.

The returned DataFrame reflects only this call's writes — empty (0 rows) on a recovery short-circuit, populated when a new snapshot lands. Useful for run-to-run diffing.

Idempotence-key contract — read carefully:

Same key + different inputs → silent no-op (data loss). The destination already has a snapshot tagged with the key, so nothing new is written.
Different key + same retry → duplicate snapshot. The destination won't recognize the prior attempt and will commit again. Idempotence is broken.

The orchestrator pattern (run-id supplied from upstream DAG context) avoids both naturally.

Crashed runs leave orphan data files at the warehouse location. Iceberg writes stage data files before the snapshot commit, so files from crashed attempts are not referenced by any snapshot but the bytes remain on disk.

Examples:

>>> import pyiceberg
>>> import daft
>>>
>>> table = pyiceberg.Table(...)
>>> df = daft.from_pydict({"user_id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"]})
>>> df = df.write_iceberg(table, mode="overwrite")

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def write_iceberg(
    self,
    table: "pyiceberg.table.Table",
    mode: str = "append",
    io_config: IOConfig | None = None,
    snapshot_properties: dict[str, str] | None = None,
    checkpoint: "IdempotentCommit | None" = None,
) -> "DataFrame":
    """Writes the DataFrame to an [Iceberg](https://iceberg.apache.org/docs/nightly/) table, returning a new DataFrame with the operations that occurred.

    Can be run in either `append` or `overwrite` mode which will either appends the rows in the DataFrame or will delete the existing rows and then append the DataFrame rows respectively.

    Args:
        table (pyiceberg.table.Table): Destination [PyIceberg Table](https://py.iceberg.apache.org/reference/pyiceberg/table/#pyiceberg.table.Table) to write dataframe to.
        mode (str, optional): Operation mode of the write. `append` or `overwrite` Iceberg Table. Defaults to `append`.
        io_config (IOConfig, optional): A custom IOConfig to use when accessing Iceberg object storage data. If provided, configurations set in `table` are ignored.
        snapshot_properties (dict[str, str], optional): Optional snapshot properties to set while writing to the table. Keys with prefix ``daft.idempotence-`` are reserved.
        checkpoint (IdempotentCommit, optional): Bundled checkpoint store + idempotence key for an idempotent commit. When provided, the snapshot summary is tagged with ``daft.idempotence-key`` and retries with the same key recognize the prior attempt without producing a duplicate snapshot. Only ``mode='append'`` is supported. Requires the Ray runner.

    Returns:
        DataFrame: The operations that occurred with this write.

    Note:
        This call is **blocking** and will execute the DataFrame when called.

        When ``checkpoint`` is provided and ``write_iceberg`` raises
        *after* the catalog commit landed (e.g. a transient failure during
        the post-commit ``mark_committed`` bookkeeping), the user data is
        already durable in Iceberg. The next call with the same
        ``IdempotentCommit`` (same idempotence key) will detect the
        snapshot via its marker, finish the bookkeeping, and exit cleanly
        without producing a duplicate snapshot.

        The returned DataFrame reflects only this call's writes — empty
        (0 rows) on a recovery short-circuit, populated when a new
        snapshot lands. Useful for run-to-run diffing.

        Idempotence-key contract — read carefully:

        - **Same key + different inputs → silent no-op (data loss).** The
          destination already has a snapshot tagged with the key, so
          nothing new is written.
        - **Different key + same retry → duplicate snapshot.** The
          destination won't recognize the prior attempt and will commit
          again. Idempotence is broken.

        The orchestrator pattern (run-id supplied from upstream DAG context)
        avoids both naturally.

        Crashed runs leave orphan data files at the warehouse location.
        Iceberg writes stage data files before the snapshot commit, so
        files from crashed attempts are not referenced by any snapshot
        but the bytes remain on disk.

    Examples:
        >>> import pyiceberg
        >>> import daft
        >>>
        >>> table = pyiceberg.Table(...)  # doctest: +SKIP
        >>> df = daft.from_pydict({"user_id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"]})
        >>> df = df.write_iceberg(table, mode="overwrite")  # doctest: +SKIP

    """
    import pyarrow as pa
    import pyiceberg
    from packaging.version import parse

    from daft.io.iceberg._iceberg import _convert_iceberg_file_io_properties_to_io_config

    if len(table.spec().fields) > 0 and parse(pyiceberg.__version__) < parse("0.7.0"):
        raise ValueError("pyiceberg>=0.7.0 is required to write to a partitioned table")

    if parse(pyiceberg.__version__) < parse("0.6.0"):
        raise ValueError(f"Write Iceberg is only supported on pyiceberg>=0.6.0, found {pyiceberg.__version__}")

    # Snapshot properties are only supported on pyiceberg >= 0.7.0. See https://github.com/apache/iceberg-python/issues/367
    if snapshot_properties and parse(pyiceberg.__version__) < parse("0.7.0"):
        raise ValueError("Snapshot properties are only supported on pyiceberg>=0.7.0")

    if mode not in ["append", "overwrite"]:
        raise ValueError(f"Only support `append` or `overwrite` mode. {mode} is unsupported")

    if checkpoint is not None and mode == "overwrite":
        raise NotImplementedError(
            "write_iceberg with checkpoint=... currently supports mode='append' only; "
            "overwrite + checkpoint is tracked separately."
        )

    if checkpoint is not None and parse(pyiceberg.__version__) < parse("0.7.0"):
        raise ValueError("write_iceberg with checkpoint=... requires pyiceberg>=0.7.0")

    if snapshot_properties:
        for key in snapshot_properties:
            if key.startswith("daft.idempotence-"):
                raise ValueError(
                    f"snapshot_properties keys with prefix 'daft.idempotence-' are reserved; got: {key!r}"
                )

    io_config = (
        _convert_iceberg_file_io_properties_to_io_config(table.io.properties, table.location())
        if io_config is None
        else io_config
    )
    io_config = get_context().daft_planning_config.default_io_config if io_config is None else io_config

    if checkpoint is not None:
        return self._write_iceberg_with_checkpoint(table, io_config, snapshot_properties, checkpoint)

    operations = []
    path = []
    rows = []
    size = []

    builder = self._builder.write_iceberg(table, io_config)
    write_df = DataFrame(builder)
    write_df.collect()

    write_result = write_df.to_pydict()
    assert "data_file" in write_result
    data_files = write_result["data_file"]

    if mode == "overwrite":
        deleted_files = table.scan().plan_files()
    else:
        deleted_files = []

    schema = table.schema()
    partitioning: dict[str, list[Any]] = {
        schema.find_field(field.source_id).name: [] for field in table.spec().fields
    }

    for data_file in data_files:
        operations.append("ADD")
        path.append(data_file.file_path)
        rows.append(data_file.record_count)
        size.append(data_file.file_size_in_bytes)

        for field in partitioning:
            partitioning[field].append(getattr(data_file.partition, field, None))

    for pf in deleted_files:
        data_file = pf.file
        operations.append("DELETE")
        path.append(data_file.file_path)
        rows.append(data_file.record_count)
        size.append(data_file.file_size_in_bytes)

        for field in partitioning:
            partitioning[field].append(getattr(data_file.partition, field, None))

    if parse(pyiceberg.__version__) >= parse("0.7.0"):
        from pyiceberg.table import ALWAYS_TRUE, TableProperties

        if parse(pyiceberg.__version__) >= parse("0.8.0"):
            from pyiceberg.utils.properties import property_as_bool

            property_as_bool = property_as_bool
        else:
            from pyiceberg.table import PropertyUtil

            property_as_bool = PropertyUtil.property_as_bool

        tx = table.transaction()
        snapshot_properties = snapshot_properties or {}

        if mode == "overwrite":
            tx.delete(delete_filter=ALWAYS_TRUE, snapshot_properties=snapshot_properties)

        update_snapshot = tx.update_snapshot(snapshot_properties=snapshot_properties)

        manifest_merge_enabled = mode == "append" and property_as_bool(
            tx.table_metadata.properties,
            TableProperties.MANIFEST_MERGE_ENABLED,
            TableProperties.MANIFEST_MERGE_ENABLED_DEFAULT,
        )

        append_method = update_snapshot.merge_append if manifest_merge_enabled else update_snapshot.fast_append

        with append_method() as append_files:
            for data_file in data_files:
                append_files.append_data_file(data_file)

        tx.commit_transaction()
    else:
        from pyiceberg.table import _MergingSnapshotProducer
        from pyiceberg.table.snapshots import Operation

        operations_map = {
            "append": Operation.APPEND,
            "overwrite": Operation.OVERWRITE,
        }

        merge = _MergingSnapshotProducer(operation=operations_map[mode], table=table)

        for data_file in data_files:
            merge.append_data_file(data_file)

        merge.commit()

    with_operations = {
        "operation": pa.array(operations, type=pa.string()),
        "rows": pa.array(rows, type=pa.int64()),
        "file_size": pa.array(size, type=pa.int64()),
        "file_name": pa.array([fp for fp in path], type=pa.string()),
    }

    if partitioning:
        with_operations["partitioning"] = pa.StructArray.from_arrays(
            partitioning.values(), names=partitioning.keys()
        )

    from daft import from_pydict

    # NOTE: We are losing the history of the plan here.
    # This is due to the fact that the logical plan of the write_iceberg returns datafiles but we want to return the above data
    df = from_pydict(with_operations)
    df._metadata = write_df._metadata
    return df

write_json #

write_json(root_dir: str | Path, write_mode: Literal['append', 'overwrite', 'overwrite-partitions'] = 'append', partition_cols: list[ColumnInputType] | None = None, io_config: IOConfig | None = None, ignore_null_fields: bool | None = False, date_format: str | None = None, timestamp_format: str | None = None) -> DataFrame

Writes the DataFrame as JSON files, returning a new DataFrame with paths to the files that were written.

Files will be written to <root_dir>/* with randomly generated UUIDs as the file names.

Parameters:

Name	Type	Description	Default
`root_dir`	`str`	root file path to write JSON files to.	required
`write_mode`	`str`	Operation mode of the write. `append` will add new data, `overwrite` will replace the contents of the root directory with new data. `overwrite-partitions` will replace only the contents in the partitions that are being written to. Defaults to "append".	`'append'`
`partition_cols`	`Optional[List[ColumnInputType]]`	How to subpartition each partition further. Defaults to None.	`None`
`io_config`	`Optional[IOConfig]`	configurations to use when interacting with remote storage.	`None`
`ignore_null_fields`	`Optional[bool]`	Whether to ignore fields with null values when writing JSON. Defaults to False.	`False`
`date_format`	`Optional[str]`	Format string for date columns. Uses chrono strftime format (e.g., "%Y-%m-%d", "%d/%m/%Y"). Defaults to None (ISO 8601 format).	`None`
`timestamp_format`	`Optional[str]`	Format string for timestamp columns. Uses chrono strftime format (e.g., "%Y-%m-%d %H:%M:%S", "%+"). Defaults to None (ISO 8601 format).	`None`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	The filenames that were written out as strings.

Note

This call is blocking and will execute the DataFrame when called

Timezone handling: For timezone-aware timestamp columns, the timestamps are converted to the target timezone before formatting. For example, a timestamp stored as UTC but with timezone "America/New_York" will be formatted in Eastern Time, not UTC. If the timezone string is invalid, an error will be raised.

Examples:

Basic usage:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": ["a", "b", "c"]})
>>> df.write_json("output_dir", write_mode="overwrite")

Custom date format (e.g., DD/MM/YYYY):

>>> import datetime
>>> df = daft.from_pydict({"date": [datetime.date(2024, 1, 15)]})
>>> df.write_json("output_dir", date_format="%d/%m/%Y")

# Output: "15/01/2024"

Custom timestamp format:

>>> df = daft.from_pydict({"ts": [datetime.datetime(2024, 1, 15, 10, 30, 45)]})
>>> df.write_json("output_dir", timestamp_format="%Y-%m-%d %H:%M:%S")

# Output: "2024-01-15 10:30:45"

ISO 8601 / RFC 3339 timestamp format:

>>> df.write_json("output_dir", timestamp_format="%+")

# Output: "2024-01-15T10:30:45+00:00"

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def write_json(
    self,
    root_dir: str | pathlib.Path,
    write_mode: Literal["append", "overwrite", "overwrite-partitions"] = "append",
    partition_cols: list[ColumnInputType] | None = None,
    io_config: IOConfig | None = None,
    ignore_null_fields: bool | None = False,
    date_format: str | None = None,
    timestamp_format: str | None = None,
) -> "DataFrame":
    """Writes the DataFrame as JSON files, returning a new DataFrame with paths to the files that were written.

    Files will be written to `<root_dir>/*` with randomly generated UUIDs as the file names.

    Args:
        root_dir (str): root file path to write JSON files to.
        write_mode (str, optional): Operation mode of the write. `append` will add new data, `overwrite` will replace the contents of the root directory with new data. `overwrite-partitions` will replace only the contents in the partitions that are being written to. Defaults to "append".
        partition_cols (Optional[List[ColumnInputType]], optional): How to subpartition each partition further. Defaults to None.
        io_config (Optional[IOConfig], optional): configurations to use when interacting with remote storage.
        ignore_null_fields (Optional[bool], optional): Whether to ignore fields with null values when writing JSON. Defaults to False.
        date_format (Optional[str], optional): Format string for date columns. Uses chrono strftime format (e.g., "%Y-%m-%d", "%d/%m/%Y"). Defaults to None (ISO 8601 format).
        timestamp_format (Optional[str], optional): Format string for timestamp columns. Uses chrono strftime format (e.g., "%Y-%m-%d %H:%M:%S", "%+"). Defaults to None (ISO 8601 format).

    Returns:
        DataFrame: The filenames that were written out as strings.

    Note:
        This call is **blocking** and will execute the DataFrame when called

    **Timezone handling**: For timezone-aware timestamp columns, the timestamps are converted
    to the target timezone before formatting. For example, a timestamp stored as UTC but with
    timezone "America/New_York" will be formatted in Eastern Time, not UTC. If the timezone
    string is invalid, an error will be raised.

    Examples:
        Basic usage:

        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": ["a", "b", "c"]})
        >>> df.write_json("output_dir", write_mode="overwrite")  # doctest: +SKIP

        Custom date format (e.g., DD/MM/YYYY):

        >>> import datetime
        >>> df = daft.from_pydict({"date": [datetime.date(2024, 1, 15)]})
        >>> df.write_json("output_dir", date_format="%d/%m/%Y")  # doctest: +SKIP
        # Output: "15/01/2024"

        Custom timestamp format:

        >>> df = daft.from_pydict({"ts": [datetime.datetime(2024, 1, 15, 10, 30, 45)]})
        >>> df.write_json("output_dir", timestamp_format="%Y-%m-%d %H:%M:%S")  # doctest: +SKIP
        # Output: "2024-01-15 10:30:45"

        ISO 8601 / RFC 3339 timestamp format:

        >>> df.write_json("output_dir", timestamp_format="%+")  # doctest: +SKIP
        # Output: "2024-01-15T10:30:45+00:00"
    """
    if write_mode not in ["append", "overwrite", "overwrite-partitions"]:
        raise ValueError(
            f"Only support `append`, `overwrite`, or `overwrite-partitions` mode. {write_mode} is unsupported"
        )
    if write_mode == "overwrite-partitions" and partition_cols is None:
        raise ValueError("Partition columns must be specified to use `overwrite-partitions` mode.")

    io_config = get_context().daft_planning_config.default_io_config if io_config is None else io_config

    cols: list[Expression] | None = None
    if partition_cols is not None:
        cols = column_inputs_to_expressions(tuple(partition_cols))

    file_format_option = PyFormatSinkOption.json(
        ignore_null_fields=ignore_null_fields,
        date_format=date_format,
        timestamp_format=timestamp_format,
    )
    builder = self._builder.write_tabular(
        root_dir=root_dir,
        partition_cols=cols,
        write_mode=WriteMode.from_str(write_mode),
        file_format=FileFormat.Json,
        file_format_option=file_format_option,
        io_config=io_config,
    )
    # Block and write, then retrieve data
    write_df = DataFrame(builder)
    write_df.collect()
    assert write_df._result is not None

    # Populate and return a new disconnected DataFrame
    # Keep the original logical plan so explain() can still show upstream operators
    # (e.g. filters/projections before the write), instead of collapsing to an
    # in-memory source after collect() caches the result.
    result_df = DataFrame(write_df._get_current_builder())
    result_df._result_cache = write_df._result_cache
    result_df._preview = write_df._preview
    result_df._metadata = write_df._metadata
    return result_df

write_lance #

write_lance(uri: str | Path, mode: Literal['create', 'append', 'overwrite', 'merge'] = 'create', io_config: IOConfig | None = None, schema: Union[Schema, Schema] | None = None, left_on: str | None = None, right_on: str | None = None, **kwargs: Any) -> DataFrame

Writes the DataFrame to a Lance table.

Parameters:

Name	Type	Description	Default
`uri`	`str \| Path`	The URI of the Lance table to write to. Accepts a local path or an object-store URI like "s3://bucket/path".	required
`mode`	`Literal['create', 'append', 'overwrite', 'merge']`	The write mode. One of "create", "append", "overwrite", or "merge".	`'create'`
`io_config`	`IOConfig`	configurations to use when interacting with remote storage.	`None`
`schema`	`Schema \| Schema`	Desired schema to enforce during write. - If omitted, Daft will use the DataFrame's current schema. - If a pyarrow.Schema is provided, Daft will enforce the field order, types, and nullability by casting the data to the provided schema prior to write. Table-level (dataset) metadata present on the pyarrow schema is preserved during create/overwrite. - If the target Lance dataset already exists, the data will be cast to the existing table schema to ensure compatibility unless `mode="overwrite"`.	`None`
`left_on/right_on`	`Optional[str]`	Only supported in `mode="merge"`. Specify the join key for aligning rows when merging new columns. - If omitted, defaults to `"_rowaddr"`. - If `right_on` is omitted, it defaults to the value of `left_on`. - The DataFrame passed to `write_lance(mode="merge")` must contain `fragment_id` and the join key column specified by `right_on` (or `_rowaddr` by default).	required
`**kwargs`	`Any`	Additional keyword arguments to pass to the Lance writer.	`{}`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	A DataFrame containing metadata about the written Lance table, such as number of fragments, number of deleted rows, number of small files, and version.

Raises:

Type	Description
`TypeError`	If `schema` is provided but not a Daft Schema or a pyarrow.Schema
`ValueError`	When appending and the data schema cannot be cast to the existing table schema

Examples:

>>> import daft
>>> df = daft.from_pydict({"a": [1, 2, 3, 4]})
>>> df.write_lance("/tmp/lance/my_table.lance")
>>> daft.read_lance("/tmp/lance/my_table.lance").collect()
>>> # Pass additional keyword arguments to the Lance writer
>>> # All additional keyword arguments are passed to `lance.write_fragments`
>>> df.write_lance("/tmp/lance/my_table.lance", mode="overwrite", max_bytes_per_file=1024)

╭───────────────┬──────────────────┬─────────────────┬─────────╮
│ num_fragments ┆ num_deleted_rows ┆ num_small_files ┆ version │
│ ---           ┆ ---              ┆ ---             ┆ ---     │
│ Int64         ┆ Int64            ┆ Int64           ┆ Int64   │
╞═══════════════╪══════════════════╪═════════════════╪═════════╡
│ 1             ┆ 0                ┆ 1               ┆ 1       │
╰───────────────┴──────────────────┴─────────────────┴─────────╯
(Showing first 1 of 1 rows)
╭───────╮
│ a     │
│ ---   │
│ Int64 │
╞═══════╡
│ 1     │
├╌╌╌╌╌╌╌┤
│ 2     │
├╌╌╌╌╌╌╌┤
│ 3     │
├╌╌╌╌╌╌╌┤
│ 4     │
╰───────╯
(Showing first 4 of 4 rows)
╭───────────────┬──────────────────┬─────────────────┬─────────╮
│ num_fragments ┆ num_deleted_rows ┆ num_small_files ┆ version │
│ ---           ┆ ---              ┆ ---             ┆ ---     │
│ Int64         ┆ Int64            ┆ Int64           ┆ Int64   │
╞═══════════════╪══════════════════╪═════════════════╪═════════╡
│ 1             ┆ 0                ┆ 1               ┆ 2       │
╰───────────────┴──────────────────┴─────────────────┴─────────╯
(Showing first 1 of 1 rows)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def write_lance(
    self,
    uri: str | pathlib.Path,
    mode: Literal["create", "append", "overwrite", "merge"] = "create",
    io_config: IOConfig | None = None,
    schema: Union[Schema, "pyarrow.Schema"] | None = None,
    left_on: str | None = None,
    right_on: str | None = None,
    **kwargs: Any,
) -> "DataFrame":
    """Writes the DataFrame to a Lance table.

    Args:
      uri: The URI of the Lance table to write to. Accepts a local path or an
        object-store URI like "s3://bucket/path".
      mode: The write mode. One of "create", "append", "overwrite", or "merge".
      - "create" will create the dataset if it does not exist, otherwise raise an error.
      - "append" will append to the existing dataset if it exists, otherwise raise an error.
      - "overwrite" will overwrite the existing dataset if it exists, otherwise raise an error.
      - "merge" will add new columns to the existing dataset.
      io_config (IOConfig, optional): configurations to use when interacting with remote storage.
      schema (Schema | pyarrow.Schema, optional): Desired schema to enforce during write.
        - If omitted, Daft will use the DataFrame's current schema.
        - If a pyarrow.Schema is provided, Daft will enforce the field order, types, and nullability
          by casting the data to the provided schema prior to write. Table-level (dataset) metadata present
          on the pyarrow schema is preserved during create/overwrite.
        - If the target Lance dataset already exists, the data will be cast to the existing table schema
          to ensure compatibility unless ``mode="overwrite"``.
      left_on/right_on (Optional[str]): Only supported in ``mode="merge"``. Specify the join key for aligning rows when merging new columns.
          - If omitted, defaults to ``"_rowaddr"``.
          - If ``right_on`` is omitted, it defaults to the value of ``left_on``.
          - The DataFrame passed to ``write_lance(mode="merge")`` must contain ``fragment_id`` and the join key column specified by ``right_on`` (or ``_rowaddr`` by default).
      **kwargs: Additional keyword arguments to pass to the Lance writer.

    Returns:
        DataFrame: A DataFrame containing metadata about the written Lance table, such as number of fragments, number of deleted rows, number of small files, and version.

    Raises:
        TypeError: If ``schema`` is provided but not a Daft Schema or a pyarrow.Schema
        ValueError: When appending and the data schema cannot be cast to the existing table schema

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"a": [1, 2, 3, 4]})
        >>> df.write_lance("/tmp/lance/my_table.lance")  # doctest: +SKIP
        ╭───────────────┬──────────────────┬─────────────────┬─────────╮
        │ num_fragments ┆ num_deleted_rows ┆ num_small_files ┆ version │
        │ ---           ┆ ---              ┆ ---             ┆ ---     │
        │ Int64         ┆ Int64            ┆ Int64           ┆ Int64   │
        ╞═══════════════╪══════════════════╪═════════════════╪═════════╡
        │ 1             ┆ 0                ┆ 1               ┆ 1       │
        ╰───────────────┴──────────────────┴─────────────────┴─────────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)
        >>> daft.read_lance("/tmp/lance/my_table.lance").collect()  # doctest: +SKIP
        ╭───────╮
        │ a     │
        │ ---   │
        │ Int64 │
        ╞═══════╡
        │ 1     │
        ├╌╌╌╌╌╌╌┤
        │ 2     │
        ├╌╌╌╌╌╌╌┤
        │ 3     │
        ├╌╌╌╌╌╌╌┤
        │ 4     │
        ╰───────╯
        <BLANKLINE>
        (Showing first 4 of 4 rows)
        >>> # Pass additional keyword arguments to the Lance writer
        >>> # All additional keyword arguments are passed to `lance.write_fragments`
        >>> df.write_lance("/tmp/lance/my_table.lance", mode="overwrite", max_bytes_per_file=1024)  # doctest: +SKIP
        ╭───────────────┬──────────────────┬─────────────────┬─────────╮
        │ num_fragments ┆ num_deleted_rows ┆ num_small_files ┆ version │
        │ ---           ┆ ---              ┆ ---             ┆ ---     │
        │ Int64         ┆ Int64            ┆ Int64           ┆ Int64   │
        ╞═══════════════╪══════════════════╪═════════════════╪═════════╡
        │ 1             ┆ 0                ┆ 1               ┆ 2       │
        ╰───────────────┴──────────────────┴─────────────────┴─────────╯
        <BLANKLINE>
        (Showing first 1 of 1 rows)
    """
    from daft import context as _context
    from daft.io.lance.lance_data_sink import LanceDataSink
    from daft.io.object_store_options import io_config_to_storage_options

    if schema is None:
        schema = self.schema()

    uri_str = str(uri)
    if uri_str.startswith("rest://"):
        raise ValueError(
            "rest:// Lance URIs are no longer supported by DataFrame.write_lance. "
            "The previous REST-namespace integration did not match the real "
            "lance-namespace API and has been removed."
        )

    # Non-merge modes do not support schema evolution or custom join keys
    if mode != "merge":
        sanitized_kwargs = {k: v for k, v in kwargs.items() if k not in ("left_on", "right_on")}
        sink = LanceDataSink(uri, schema, mode, io_config, **sanitized_kwargs)
        return self.write_sink(sink)

    # Merge mode semantics
    try:
        import lance
    except ImportError as e:
        raise ImportError(
            "Unable to import the `lance` package, please ensure that Daft is installed with the lance extra dependency: `pip install daft[lance]`"
        ) from e

    io_config = _context.get_context().daft_planning_config.default_io_config if io_config is None else io_config
    storage_options = io_config_to_storage_options(io_config, str(uri) if isinstance(uri, pathlib.Path) else uri)

    # Attempt to load dataset; if not exists, behave like create
    lance_ds = None
    try:
        lance_ds = lance.dataset(uri, storage_options=storage_options)
    except (ValueError, FileNotFoundError, OSError) as _e:
        lance_ds = None

    if lance_ds is None:
        sanitized_kwargs = {k: v for k, v in kwargs.items() if k not in ("left_on", "right_on")}
        sink = LanceDataSink(uri, schema, "create", io_config, **sanitized_kwargs)
        return self.write_sink(sink)

    # Dataset exists: detect schema evolution by checking new columns in incoming DF
    existing_fields: set[str] = set()
    try:
        existing_fields = {getattr(f, "name", str(f)) for f in lance_ds.schema}
    except Exception:
        names = []
        try:
            names = list(getattr(lance_ds.schema, "names", []))
        except Exception:
            try:
                names = [getattr(f, "name", str(f)) for f in getattr(lance_ds.schema, "fields", [])]
            except Exception:
                names = []
        existing_fields = set(names)

    meta_exclusions = {"fragment_id", "_rowaddr", "_rowid"}
    new_cols = [c for c in self.column_names if c not in existing_fields and c not in meta_exclusions]

    if len(new_cols) == 0:
        # Pure append: no schema evolution. Ensure merge-specific params are not forwarded.
        sanitized_kwargs = {k: v for k, v in kwargs.items() if k not in ("left_on", "right_on")}

        sink = LanceDataSink(uri, schema, "append", io_config, **sanitized_kwargs)
        return self.write_sink(sink)

    # Schema evolution: route to per-fragment merge keyed by provided business key or default '_rowaddr'
    join_left = left_on or "_rowaddr"
    join_right = right_on or join_left
    if "fragment_id" not in self.column_names:
        raise ValueError(
            "DataFrame must contain 'fragment_id' column for per-fragment merge in mode='merge'. Read from Lance to include 'fragment_id'."
        )
    if join_right not in self.column_names:
        hint = (
            " Read from Lance with default_scan_options={'with_rowaddr': True} to include '_rowaddr'."
            if join_right == "_rowaddr"
            else ""
        )
        raise ValueError(
            f"DataFrame must contain join key column '{join_right}' for per-fragment merge in mode='merge'." + hint
        )

    from daft.io.lance.lance_merge_column import merge_columns_from_df

    merge_columns_from_df(
        df=self,
        lance_ds=lance_ds,
        uri=uri,
        left_on=join_left,
        right_on=join_right,
        storage_options=storage_options,
    )

    # Build and return stats DataFrame similar to sink.finalize
    dataset = lance.dataset(uri, storage_options=storage_options)
    stats = dataset.stats.dataset_stats()
    from daft.dependencies import pa as _pa
    from daft.recordbatch import MicroPartition

    return DataFrame._from_micropartitions(
        MicroPartition.from_pydict(
            {
                "num_fragments": _pa.array([stats["num_fragments"]], type=_pa.int64()),
                "num_deleted_rows": _pa.array([stats["num_deleted_rows"]], type=_pa.int64()),
                "num_small_files": _pa.array([stats["num_small_files"]], type=_pa.int64()),
                "version": _pa.array([dataset.version], type=_pa.int64()),
            }
        )
    )

write_paimon #

write_paimon(table: Table, mode: str = 'append') -> DataFrame

Writes the DataFrame to an Apache Paimon table, returning a summary DataFrame.

Parameters:

Name	Type	Description	Default
`table`	`Table`	Destination Paimon table obtained via `pypaimon.CatalogFactory.create(options).get_table(identifier)`.	required
`mode`	`str`	Write mode – `"append"` adds new data, `"overwrite"` replaces existing data. Defaults to `"append"`.	`'append'`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	A summary DataFrame with columns `operation`, `rows`,
	`DataFrame`	`file_size`, and `file_name` describing each written file.

Note

This call is blocking and will execute the DataFrame when called.

Examples:

>>> import pypaimon, daft
>>>
>>> catalog = pypaimon.CatalogFactory.create({"warehouse": "/tmp/warehouse"})
>>> table = catalog.get_table("mydb.mytable")
>>> df = daft.from_pydict({"id": [1, 2, 3], "name": ["a", "b", "c"]})
>>> df.write_paimon(table)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def write_paimon(
    self,
    table: "pypaimon.table.Table",
    mode: str = "append",
) -> "DataFrame":
    """Writes the DataFrame to an Apache Paimon table, returning a summary DataFrame.

    Args:
        table (pypaimon.table.Table): Destination Paimon table obtained via
            ``pypaimon.CatalogFactory.create(options).get_table(identifier)``.
        mode (str, optional): Write mode – ``"append"`` adds new data,
            ``"overwrite"`` replaces existing data. Defaults to ``"append"``.

    Returns:
        DataFrame: A summary DataFrame with columns ``operation``, ``rows``,
        ``file_size``, and ``file_name`` describing each written file.

    Note:
        This call is **blocking** and will execute the DataFrame when called.

    Examples:
        >>> import pypaimon, daft  # doctest: +SKIP
        >>>
        >>> catalog = pypaimon.CatalogFactory.create({"warehouse": "/tmp/warehouse"})  # doctest: +SKIP
        >>> table = catalog.get_table("mydb.mytable")  # doctest: +SKIP
        >>> df = daft.from_pydict({"id": [1, 2, 3], "name": ["a", "b", "c"]})  # doctest: +SKIP
        >>> df.write_paimon(table)  # doctest: +SKIP
    """
    try:
        import pypaimon  # noqa: F401
    except ImportError:
        raise ImportError("pypaimon is required to use write_paimon. Install it with: `pip install pypaimon`")

    from daft.io.paimon import PaimonDataSink

    return self.write_sink(PaimonDataSink(table, mode))

write_parquet #

write_parquet(root_dir: str | Path, compression: str = 'snappy', write_mode: Literal['append', 'overwrite', 'overwrite-partitions'] = 'append', write_success_file: bool = False, partition_cols: list[ColumnInputType] | None = None, io_config: IOConfig | None = None, column_compression: dict[str, str] | None = None) -> DataFrame

Writes the DataFrame as parquet files, returning a new DataFrame with paths to the files that were written.

Files will be written to <root_dir>/* with randomly generated UUIDs as the file names.

Parameters:

Name	Type	Description	Default
`root_dir`	`str`	root file path to write parquet files to.	required
`compression`	`str`	default compression codec applied to every column. Defaults to "snappy". Accepts "snappy", "gzip", "zstd", "lz4", "lz4_raw", "brotli", "uncompressed", or "none" (case-insensitive).	`'snappy'`
`write_mode`	`str`	Operation mode of the write. `append` will add new data, `overwrite` will replace the contents of the root directory with new data. `overwrite-partitions` will replace only the contents in the partitions that are being written to. Defaults to "append".	`'append'`
`write_success_file`	`bool`	Whether to write a `_SUCCESS` file upon successful completion. Defaults to False.	`False`
`partition_cols`	`Optional[List[ColumnInputType]]`	How to subpartition each partition further. Defaults to None.	`None`
`io_config`	`Optional[IOConfig]`	configurations to use when interacting with remote storage.	`None`
`column_compression`	`Optional[Dict[str, str]]`	per-column compression overrides. Keys are dot-separated column paths (e.g. `"user.name"` for a nested struct field); values are codec names accepted by `compression`. Columns not listed fall back to `compression`. Defaults to None.	`None`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	The filenames that were written out as strings.

Note

This call is blocking and will execute the DataFrame when called

Examples:

>>> import daft
>>> df = daft.from_pydict({"x": [1, 2, 3], "y": ["a", "b", "c"]})
>>> df.write_parquet("output_dir", write_mode="overwrite")

Tip

See also df.write_csv() and df.write_json() Other formats for writing DataFrames

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def write_parquet(
    self,
    root_dir: str | pathlib.Path,
    compression: str = "snappy",
    write_mode: Literal["append", "overwrite", "overwrite-partitions"] = "append",
    write_success_file: bool = False,
    partition_cols: list[ColumnInputType] | None = None,
    io_config: IOConfig | None = None,
    column_compression: dict[str, str] | None = None,
) -> "DataFrame":
    """Writes the DataFrame as parquet files, returning a new DataFrame with paths to the files that were written.

    Files will be written to `<root_dir>/*` with randomly generated UUIDs as the file names.

    Args:
        root_dir (str): root file path to write parquet files to.
        compression (str, optional): default compression codec applied to every column. Defaults to "snappy". Accepts "snappy", "gzip", "zstd", "lz4", "lz4_raw", "brotli", "uncompressed", or "none" (case-insensitive).
        write_mode (str, optional): Operation mode of the write. `append` will add new data, `overwrite` will replace the contents of the root directory with new data. `overwrite-partitions` will replace only the contents in the partitions that are being written to. Defaults to "append".
        write_success_file (bool, optional): Whether to write a `_SUCCESS` file upon successful completion. Defaults to False.
        partition_cols (Optional[List[ColumnInputType]], optional): How to subpartition each partition further. Defaults to None.
        io_config (Optional[IOConfig], optional): configurations to use when interacting with remote storage.
        column_compression (Optional[Dict[str, str]], optional): per-column compression overrides. Keys are dot-separated column paths (e.g. `"user.name"` for a nested struct field); values are codec names accepted by `compression`. Columns not listed fall back to `compression`. Defaults to None.

    Returns:
        DataFrame: The filenames that were written out as strings.

    Note:
        This call is **blocking** and will execute the DataFrame when called

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"x": [1, 2, 3], "y": ["a", "b", "c"]})
        >>> df.write_parquet("output_dir", write_mode="overwrite")  # doctest: +SKIP

    Tip:
        See also [`df.write_csv()`][daft.DataFrame.write_csv] and [`df.write_json()`][daft.DataFrame.write_json]
        Other formats for writing DataFrames
    """
    if write_mode not in ["append", "overwrite", "overwrite-partitions"]:
        raise ValueError(
            f"Only support `append`, `overwrite`, or `overwrite-partitions` mode. {write_mode} is unsupported"
        )
    if write_mode == "overwrite-partitions" and partition_cols is None:
        raise ValueError("Partition columns must be specified to use `overwrite-partitions` mode.")

    io_config = get_context().daft_planning_config.default_io_config if io_config is None else io_config

    cols: list[Expression] | None = None
    if partition_cols is not None:
        cols = column_inputs_to_expressions(tuple(partition_cols))

    file_format_option: PyFormatSinkOption | None = None
    if column_compression:
        file_format_option = PyFormatSinkOption.parquet(
            column_compression=list(column_compression.items()),
        )

    builder = self._builder.write_tabular(
        root_dir=root_dir,
        partition_cols=cols,
        write_mode=WriteMode.from_str(write_mode),
        write_success_file=write_success_file,
        file_format=FileFormat.Parquet,
        file_format_option=file_format_option,
        compression=compression,
        io_config=io_config,
    )
    # Block and write, then retrieve data
    write_df = DataFrame(builder)
    write_df.collect()
    assert write_df._result is not None

    # Populate and return a new disconnected DataFrame
    # Keep the original logical plan so explain() can still show upstream operators
    # (e.g. filters/projections before the write), instead of collapsing to an
    # in-memory source after collect() caches the result.
    result_df = DataFrame(write_df._get_current_builder())
    result_df._result_cache = write_df._result_cache
    result_df._preview = write_df._preview
    result_df._metadata = write_df._metadata
    return result_df

write_sink #

write_sink(sink: DataSink[WriteResultType]) -> DataFrame

Writes the DataFrame to the given DataSink.

Parameters:

Name	Type	Description	Default
`sink`	`DataSink[WriteResultType]`	The DataSink to write to.	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	A dataframe from the micropartition returned by the DataSink's `.finalize()` method.

Note

This call is blocking and will execute the DataFrame when called

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def write_sink(self, sink: "DataSink[WriteResultType]") -> "DataFrame":
    """Writes the DataFrame to the given DataSink.

    Args:
        sink: The DataSink to write to.

    Returns:
        DataFrame: A dataframe from the micropartition returned by the DataSink's `.finalize()` method.

    Note:
        This call is **blocking** and will execute the DataFrame when called
    """
    sink.start()

    builder = self._builder.write_datasink(sink.name(), sink)
    write_df = DataFrame(builder)
    write_df.collect()

    results = write_df.to_pydict()
    assert "write_results" in results
    micropartition = sink.finalize(results["write_results"])
    if micropartition.schema() != sink.schema():
        raise ValueError(
            f"Schema mismatch between the data sink's schema and the result's schema:\nSink schema:\n{sink.schema()}\nResult schema:\n{micropartition.schema()}"
        )
    # TODO(desmond): Connect the old and new logical plan builders so that a .explain() shows the
    # plan from the source all the way to the sink to the sink's results. In theory we can do this
    # for all other sinks too.
    df = DataFrame._from_micropartitions(micropartition)
    df._metadata = write_df._metadata
    return df

write_sql #

write_sql(table_name: str, conn: str | Callable[[], Connection], write_mode: Literal['append', 'overwrite', 'fail'] = 'append', column_types: dict[str, Any] | None = None, non_primitive_handling: Literal['bytes', 'str', 'error'] | None = None) -> DataFrame

Write the DataFrame to a SQL database and return write metrics.

The write is executed via :meth:daft.DataFrame.write_sink using an internal :class:daft.io._sql.SQLDataSink.

Primitive columns (ints, floats, bools, strings, binary, dates, timestamps) are written by converting to a pandas DataFrame and calling :meth:pandas.DataFrame.to_sql, letting SQLAlchemy or column_types choose concrete SQL types.

Non-primitive columns (lists, structs, maps, tensors, images, embeddings, python objects, etc.) are normalized according to non_primitive_handling (default None behaves like "str"): "str" serializes values to text (JSON for arrays/maps and other containers, str(..) otherwise), "bytes" writes UTF-8 bytes of that text, and "error" fails if such columns are present.

Parameters:

Name	Type	Description	Default
`table_name`	`str`	Name of the table to write to.	required
`conn`	`str \| Callable[[], Connection]`	Connection string or factory.	required
`write_mode`	`str`	Mode to write to the table. "append", "overwrite", or "fail". Defaults to "append".	`'append'`
`column_types`	`Optional[Dict[str, Any]]`	Optional mapping from column names to SQLAlchemy types to use when creating the table or casting columns. Passed through to the underlying SQL engine when creating or writing the table.	`None`
`non_primitive_handling`	`Literal['bytes', 'str', 'error'] \| None`	Controls how non-primitive columns are normalized before reaching SQL; default `None` behaves like `"str"`. Accepted values are `"str"`, `"bytes"`, and `"error"`.	`None`

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	A single-row DataFrame containing aggregate write metrics with columns `total_written_rows` and `total_written_bytes`.

Warning

This features is early in development and will likely experience API changes.

Note

Primitive columns still rely on pandas/SQLAlchemy (or column_types) for concrete SQL types, while non-primitive columns are pre-normalized in Python according to non_primitive_handling before reaching the SQL driver.

Examples:

Write to a SQL table using a database URL and explicit SQLAlchemy dtypes:

>>> from sqlalchemy import DateTime, Integer, String
>>> import datetime
>>> import daft
>>> df = daft.from_pydict(
...     {
...         "id": [1, 2],
...         "name": ["Alice", "Bob"],
...         "created_at": [
...             datetime.datetime(2024, 1, 1, 0, 0, 0),
...             datetime.datetime(2024, 1, 2, 0, 0, 0),
...         ],
...     }
... )
>>> column_types = {
...     "id": Integer(),
...     "name": String(length=255),
...     "created_at": DateTime(timezone=True),
... }
>>> metrics_df = df.write_sql("users", "sqlite:///my_database.db", column_types=column_types)

Write to a SQL table using a SQLAlchemy connection factory and dtypes:

>>> import sqlalchemy
>>> def create_conn():
...     return sqlalchemy.create_engine("sqlite:///my_database.db").connect()
>>> metrics_df = df.write_sql("users", create_conn, column_types=column_types)

Write to a SQL table using a database URL with column_types=None to rely on inferred types:

>>> df = daft.from_pydict({"id": [1], "name": ["Alice"]})
>>> metrics_df = df.write_sql("users", "sqlite:///my_database.db", column_types=None)

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def write_sql(
    self,
    table_name: str,
    conn: str | Callable[[], "Connection"],
    write_mode: Literal["append", "overwrite", "fail"] = "append",
    column_types: dict[str, Any] | None = None,
    non_primitive_handling: Literal["bytes", "str", "error"] | None = None,
) -> "DataFrame":
    """Write the DataFrame to a SQL database and return write metrics.

    The write is executed via :meth:`daft.DataFrame.write_sink` using an internal
    :class:`daft.io._sql.SQLDataSink`.

    Primitive columns (ints, floats, bools, strings, binary, dates, timestamps) are written by converting to a pandas DataFrame and calling :meth:`pandas.DataFrame.to_sql`, letting SQLAlchemy or ``column_types`` choose concrete SQL types.

    Non-primitive columns (lists, structs, maps, tensors, images, embeddings, python objects, etc.) are normalized according to ``non_primitive_handling`` (default ``None`` behaves like ``"str"``): ``"str"`` serializes values to text (JSON for arrays/maps and other containers, ``str(..)`` otherwise), ``"bytes"`` writes UTF-8 bytes of that text, and ``"error"`` fails if such columns are present.

    Args:
        table_name (str): Name of the table to write to.
        conn (str | Callable[[], "Connection"]): Connection string or factory.
        write_mode (str): Mode to write to the table. "append", "overwrite", or "fail". Defaults to "append".
        column_types (Optional[Dict[str, Any]]): Optional mapping from column names to
            SQLAlchemy types to use when creating the table or casting columns.
            Passed through to the underlying SQL engine when creating or writing
            the table.
        non_primitive_handling (Literal["bytes", "str", "error"] | None):
            Controls how non-primitive columns are normalized before reaching SQL; default ``None`` behaves like ``"str"``. Accepted values are ``"str"``, ``"bytes"``, and ``"error"``.

    Returns:
        DataFrame: A single-row DataFrame containing aggregate write metrics with
            columns ``total_written_rows`` and ``total_written_bytes``.

    Warning:
        This features is early in development and will likely experience API changes.

    Note:
        Primitive columns still rely on pandas/SQLAlchemy (or ``column_types``) for concrete SQL types, while non-primitive columns are pre-normalized in Python according to ``non_primitive_handling`` before reaching the SQL driver.

    Examples:
        Write to a SQL table using a database URL and explicit SQLAlchemy dtypes:

        >>> from sqlalchemy import DateTime, Integer, String
        >>> import datetime
        >>> import daft
        >>> df = daft.from_pydict(
        ...     {
        ...         "id": [1, 2],
        ...         "name": ["Alice", "Bob"],
        ...         "created_at": [
        ...             datetime.datetime(2024, 1, 1, 0, 0, 0),
        ...             datetime.datetime(2024, 1, 2, 0, 0, 0),
        ...         ],
        ...     }
        ... )
        >>> column_types = {
        ...     "id": Integer(),
        ...     "name": String(length=255),
        ...     "created_at": DateTime(timezone=True),
        ... }
        >>> metrics_df = df.write_sql("users", "sqlite:///my_database.db", column_types=column_types)

        Write to a SQL table using a SQLAlchemy connection factory and dtypes:

        >>> import sqlalchemy
        >>> def create_conn():
        ...     return sqlalchemy.create_engine("sqlite:///my_database.db").connect()
        >>> metrics_df = df.write_sql("users", create_conn, column_types=column_types)

        Write to a SQL table using a database URL with column_types=None to rely on inferred types:

        >>> df = daft.from_pydict({"id": [1], "name": ["Alice"]})
        >>> metrics_df = df.write_sql("users", "sqlite:///my_database.db", column_types=None)
    """
    from daft.io._sql import SQLDataSink

    sink = SQLDataSink(
        table_name=table_name,
        conn=conn,
        write_mode=write_mode,
        column_types=column_types,
        df_schema=self.schema(),
        non_primitive_handling=non_primitive_handling,
    )

    if non_primitive_handling is None:
        # Check for non-primitive types in the schema and warn if found
        non_primitive_cols = [
            field.name
            for field in self.schema()
            if field.dtype.is_python()
            or field.dtype.is_list()
            or field.dtype.is_struct()
            or field.dtype.is_map()
            or field.dtype.is_tensor()
            or field.dtype.is_image()
            or field.dtype.is_embedding()
        ]
        if non_primitive_cols:
            warnings.warn(
                f"Detected non-primitive columns: {non_primitive_cols}. Writing as text (default). Set `non_primitive_handling` to control or suppress.",
                UserWarning,
                stacklevel=2,
            )

    return self.write_sink(sink)

write_turbopuffer #

write_turbopuffer(namespace: str | Expression, api_key: str | None = None, region: str | None = None, distance_metric: Literal['cosine_distance', 'euclidean_squared'] | None = None, schema: dict[str, Any] | None = None, id_column: str | None = None, vector_column: str | None = None, client_kwargs: dict[str, Any] | None = None, write_kwargs: dict[str, Any] | None = None) -> DataFrame

Writes the DataFrame to a Turbopuffer namespace.

This method transforms each row of the dataframe into a turbopuffer document. This means that an id column is always required. Optionally, the id_column parameter can be used to specify the column name to used for the id column. Note that the column with the name specified by id_column will be renamed to "id" when written to turbopuffer.

A vector column is required if the namespace has a vector index. Optionally, the vector_column parameter can be used to specify the column name to used for the vector index. Note that the column with the name specified by vector_column will be renamed to "vector" when written to turbopuffer.

All other columns become attributes.

The namespace parameter can be either a string (for a single namespace) or an expression (for multiple namespaces). When using an expression, the data will be partitioned by the computed namespace values and written to each namespace separately.

For more details on parameters, please see the turbopuffer documentation: https://turbopuffer.com/docs/write

Parameters:

Name	Type	Description	Default
`namespace`	`str \| Expression`	The namespace to write to. Can be a string for a single namespace or an expression for multiple namespaces.	required
`api_key`	`str \| None`	Turbopuffer API key.	`None`
`region`	`str \| None`	Turbopuffer region.	`None`
`distance_metric`	`Literal['cosine_distance', 'euclidean_squared'] \| None`	Distance metric for vector similarity ("cosine_distance", "euclidean_squared").	`None`
`schema`	`dict[str, Any] \| None`	Optional manual schema specification.	`None`
`id_column`	`str \| None`	Optional column name for the id column. The data sink will automatically rename the column to "id" for the id column.	`None`
`vector_column`	`str \| None`	Optional column name for the vector index column. The data sink will automatically rename the column to "vector" for the vector index.	`None`
`client_kwargs`	`dict[str, Any] \| None`	Optional dictionary of arguments to pass to the Turbopuffer client constructor. Explicit arguments (api_key, region) will be merged into client_kwargs.	`None`
`write_kwargs`	`dict[str, Any] \| None`	Optional dictionary of arguments to pass to the namespace.write() method. Explicit arguments (distance_metric, schema) will be merged into write_kwargs.	`None`

Source code in daft/dataframe/dataframe.py

@DataframePublicAPI
def write_turbopuffer(
    self,
    namespace: str | Expression,
    api_key: str | None = None,
    region: str | None = None,
    distance_metric: Literal["cosine_distance", "euclidean_squared"] | None = None,
    schema: dict[str, Any] | None = None,
    id_column: str | None = None,
    vector_column: str | None = None,
    client_kwargs: dict[str, Any] | None = None,
    write_kwargs: dict[str, Any] | None = None,
) -> "DataFrame":
    """Writes the DataFrame to a Turbopuffer namespace.

    This method transforms each row of the dataframe into a turbopuffer document.
    This means that an `id` column is always required. Optionally, the `id_column` parameter can be used to specify the column name to used for the id column.
    Note that the column with the name specified by `id_column` will be renamed to "id" when written to turbopuffer.

    A `vector` column is required if the namespace has a vector index. Optionally, the `vector_column` parameter can be used to specify the column name to used for the vector index.
    Note that the column with the name specified by `vector_column` will be renamed to "vector" when written to turbopuffer.

    All other columns become attributes.

    The namespace parameter can be either a string (for a single namespace) or an expression (for multiple namespaces).
    When using an expression, the data will be partitioned by the computed namespace values and written to each namespace separately.

    For more details on parameters, please see the turbopuffer documentation: https://turbopuffer.com/docs/write

    Args:
        namespace: The namespace to write to. Can be a string for a single namespace or an expression for multiple namespaces.
        api_key: Turbopuffer API key.
        region: Turbopuffer region.
        distance_metric: Distance metric for vector similarity ("cosine_distance", "euclidean_squared").
        schema: Optional manual schema specification.
        id_column: Optional column name for the id column. The data sink will automatically rename the column to "id" for the id column.
        vector_column: Optional column name for the vector index column. The data sink will automatically rename the column to "vector" for the vector index.
        client_kwargs: Optional dictionary of arguments to pass to the Turbopuffer client constructor.
            Explicit arguments (api_key, region) will be merged into client_kwargs.
        write_kwargs: Optional dictionary of arguments to pass to the namespace.write() method.
            Explicit arguments (distance_metric, schema) will be merged into write_kwargs.
    """
    from daft.io.turbopuffer.turbopuffer_data_sink import TurbopufferDataSink

    sink = TurbopufferDataSink(
        namespace, api_key, region, distance_metric, schema, id_column, vector_column, client_kwargs, write_kwargs
    )
    return self.write_sink(sink)

DataFrame#

DataFrame #

column_names #

columns #

metrics #

__arrow_c_schema__ #

__arrow_c_stream__ #

__contains__ #

__getitem__ #

__iter__ #

__len__ #

agg #

agg_concat #

agg_list #

agg_set #

any_value #

collect #

concat #

count #

count_distinct #

count_rows #

describe #

distinct #

drop_duplicates #

drop_nan #

drop_null #

except_all #

except_distinct #

exclude #

explain #

explode #

filter #

groupby #

intersect #

intersect_all #

into_batches #

into_partitions #

iter_partitions #

iter_rows #

join #

join_asof #

limit #

max #

mean #

melt #

min #

num_partitions #

offset #

pipe #

pivot #

product #

repartition #

sample #

schema #

select #

show #

shuffle #

skew #

skip_existing #

sort #

stddev #

sum #

summarize #

to_arrow #

to_arrow_iter #

to_dask_dataframe #

to_pandas #

to_pydict #

to_pylist #

to_ray_dataset #

to_torch_iter_dataset #

to_torch_map_dataset #

transform #

union #

union_all #

union_all_by_name #

union_by_name #

unique #

unpivot #

var #

contains #

getitem #

iter #

len #