Skip to content

Aggregations#

When performing aggregations such as sum, mean and count, Daft enables you to group data by certain keys and aggregate within those keys.

Calling df.groupby() returns a GroupedDataFrame object which is a view of the original DataFrame but with additional context on which keys to group on. You can then call various aggregation methods to run the aggregation within each group, returning a new DataFrame.

Learn more about Aggregations and Grouping in Daft User Guide.

GroupedDataFrame #

GroupedDataFrame(
    df: DataFrame, group_by: ExpressionsProjection
)

Methods:

Name Description
agg

Perform aggregations on this GroupedDataFrame. Allows for mixed aggregations.

agg_concat

Performs grouped concat on this GroupedDataFrame.

agg_list

Performs grouped list on this GroupedDataFrame.

agg_set

Performs grouped set on this GroupedDataFrame (ignoring nulls).

any_value

Returns an arbitrary value on this GroupedDataFrame.

count

Performs grouped count on this GroupedDataFrame.

map_groups

Apply a user-defined function to each group. The name of the resultant column will default to the name of the first input column.

max

Performs grouped max on this GroupedDataFrame.

mean

Performs grouped mean on this GroupedDataFrame.

min

Perform grouped min on this GroupedDataFrame.

skew

Performs grouped skew on this GroupedDataFrame.

stddev

Performs grouped standard deviation on this GroupedDataFrame.

sum

Perform grouped sum on this GroupedDataFrame.

Attributes:

Name Type Description
df DataFrame
group_by ExpressionsProjection

df #

group_by #

group_by: ExpressionsProjection

agg #

agg(
    *to_agg: Union[Expression, Iterable[Expression]],
) -> DataFrame

Perform aggregations on this GroupedDataFrame. Allows for mixed aggregations.

Parameters:

Name Type Description Default
*to_agg Union[Expression, Iterable[Expression]]

aggregation expressions

()

Returns:

Name Type Description
DataFrame DataFrame

DataFrame with grouped aggregations

Examples:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
>>> import daft
>>> from daft import col
>>> df = daft.from_pydict(
...     {
...         "pet": ["cat", "dog", "dog", "cat"],
...         "age": [1, 2, 3, 4],
...         "name": ["Alex", "Jordan", "Sam", "Riley"],
...     }
... )
>>> grouped_df = df.groupby("pet").agg(
...     col("age").min().alias("min_age"),
...     col("age").max().alias("max_age"),
...     col("pet").count().alias("count"),
...     col("name").any_value(),
... )
>>> grouped_df = grouped_df.sort("pet")
>>> grouped_df.show()
╭──────┬─────────┬─────────┬────────┬────────╮
│ pet  ┆ min_age ┆ max_age ┆ count  ┆ name   │
│ ---  ┆ ---     ┆ ---     ┆ ---    ┆ ---    │
│ Utf8 ┆ Int64   ┆ Int64   ┆ UInt64 ┆ Utf8   │
╞══════╪═════════╪═════════╪════════╪════════╡
│ cat  ┆ 1       ┆ 4       ┆ 2      ┆ Alex   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ dog  ┆ 2       ┆ 3       ┆ 2      ┆ Jordan │
╰──────┴─────────┴─────────┴────────┴────────╯

(Showing first 2 of 2 rows)
Source code in daft/dataframe/dataframe.py
3707
3708
3709
3710
3711
3712
3713
3714
3715
3716
3717
3718
3719
3720
3721
3722
3723
3724
3725
3726
3727
3728
3729
3730
3731
3732
3733
3734
3735
3736
3737
3738
3739
3740
3741
3742
3743
3744
3745
3746
3747
3748
3749
3750
3751
3752
3753
3754
3755
3756
3757
def agg(self, *to_agg: Union[Expression, Iterable[Expression]]) -> DataFrame:
    """Perform aggregations on this GroupedDataFrame. Allows for mixed aggregations.

    Args:
        *to_agg (Union[Expression, Iterable[Expression]]): aggregation expressions

    Returns:
        DataFrame: DataFrame with grouped aggregations

    Examples:
        >>> import daft
        >>> from daft import col
        >>> df = daft.from_pydict(
        ...     {
        ...         "pet": ["cat", "dog", "dog", "cat"],
        ...         "age": [1, 2, 3, 4],
        ...         "name": ["Alex", "Jordan", "Sam", "Riley"],
        ...     }
        ... )
        >>> grouped_df = df.groupby("pet").agg(
        ...     col("age").min().alias("min_age"),
        ...     col("age").max().alias("max_age"),
        ...     col("pet").count().alias("count"),
        ...     col("name").any_value(),
        ... )
        >>> grouped_df = grouped_df.sort("pet")
        >>> grouped_df.show()
        ╭──────┬─────────┬─────────┬────────┬────────╮
        │ pet  ┆ min_age ┆ max_age ┆ count  ┆ name   │
        │ ---  ┆ ---     ┆ ---     ┆ ---    ┆ ---    │
        │ Utf8 ┆ Int64   ┆ Int64   ┆ UInt64 ┆ Utf8   │
        ╞══════╪═════════╪═════════╪════════╪════════╡
        │ cat  ┆ 1       ┆ 4       ┆ 2      ┆ Alex   │
        ├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
        │ dog  ┆ 2       ┆ 3       ┆ 2      ┆ Jordan │
        ╰──────┴─────────┴─────────┴────────┴────────╯
        <BLANKLINE>
        (Showing first 2 of 2 rows)

    """
    to_agg_list = (
        list(to_agg[0])
        if (len(to_agg) == 1 and not isinstance(to_agg[0], Expression))
        else list(typing.cast("tuple[Expression]", to_agg))
    )

    for expr in to_agg_list:
        if not isinstance(expr, Expression):
            raise ValueError(f"GroupedDataFrame.agg() only accepts expression type, received: {type(expr)}")

    return self.df._agg(to_agg_list, group_by=self.group_by)

agg_concat #

agg_concat(*cols: ColumnInputType) -> DataFrame

Performs grouped concat on this GroupedDataFrame.

Returns:

Name Type Description
DataFrame DataFrame

DataFrame with grouped concatenated list per column.

Source code in daft/dataframe/dataframe.py
3699
3700
3701
3702
3703
3704
3705
def agg_concat(self, *cols: ColumnInputType) -> DataFrame:
    """Performs grouped concat on this GroupedDataFrame.

    Returns:
        DataFrame: DataFrame with grouped concatenated list per column.
    """
    return self.df._apply_agg_fn(Expression.agg_concat, cols, self.group_by)

agg_list #

agg_list(*cols: ColumnInputType) -> DataFrame

Performs grouped list on this GroupedDataFrame.

Returns:

Name Type Description
DataFrame DataFrame

DataFrame with grouped list per column.

Source code in daft/dataframe/dataframe.py
3680
3681
3682
3683
3684
3685
3686
def agg_list(self, *cols: ColumnInputType) -> DataFrame:
    """Performs grouped list on this GroupedDataFrame.

    Returns:
        DataFrame: DataFrame with grouped list per column.
    """
    return self.df._apply_agg_fn(Expression.agg_list, cols, self.group_by)

agg_set #

agg_set(*cols: ColumnInputType) -> DataFrame

Performs grouped set on this GroupedDataFrame (ignoring nulls).

Parameters:

Name Type Description Default
*cols Union[str, Expression]

columns to form into a set

()

Returns:

Name Type Description
DataFrame DataFrame

DataFrame with grouped set per column.

Source code in daft/dataframe/dataframe.py
3688
3689
3690
3691
3692
3693
3694
3695
3696
3697
def agg_set(self, *cols: ColumnInputType) -> DataFrame:
    """Performs grouped set on this GroupedDataFrame (ignoring nulls).

    Args:
        *cols (Union[str, Expression]): columns to form into a set

    Returns:
        DataFrame: DataFrame with grouped set per column.
    """
    return self.df._apply_agg_fn(Expression.agg_set, cols, self.group_by)

any_value #

any_value(*cols: ColumnInputType) -> DataFrame

Returns an arbitrary value on this GroupedDataFrame.

Values for each column are not guaranteed to be from the same row.

Parameters:

Name Type Description Default
*cols Union[str, Expression]

columns to get

()

Returns:

Name Type Description
DataFrame DataFrame

DataFrame with any values.

Source code in daft/dataframe/dataframe.py
3651
3652
3653
3654
3655
3656
3657
3658
3659
3660
3661
3662
def any_value(self, *cols: ColumnInputType) -> DataFrame:
    """Returns an arbitrary value on this GroupedDataFrame.

    Values for each column are not guaranteed to be from the same row.

    Args:
        *cols (Union[str, Expression]): columns to get

    Returns:
        DataFrame: DataFrame with any values.
    """
    return self.df._apply_agg_fn(Expression.any_value, cols, self.group_by)

count #

count(*cols: ColumnInputType) -> DataFrame

Performs grouped count on this GroupedDataFrame.

Returns:

Name Type Description
DataFrame DataFrame

DataFrame with grouped count per column.

Source code in daft/dataframe/dataframe.py
3664
3665
3666
3667
3668
3669
3670
def count(self, *cols: ColumnInputType) -> DataFrame:
    """Performs grouped count on this GroupedDataFrame.

    Returns:
        DataFrame: DataFrame with grouped count per column.
    """
    return self.df._apply_agg_fn(Expression.count, cols, self.group_by)

map_groups #

map_groups(udf: Expression) -> DataFrame

Apply a user-defined function to each group. The name of the resultant column will default to the name of the first input column.

Parameters:

Name Type Description Default
udf Expression

User-defined function to apply to each group.

required

Returns:

Name Type Description
DataFrame DataFrame

DataFrame with grouped aggregations

Examples:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
>>> import daft, statistics
>>>
>>> df = daft.from_pydict({"group": ["a", "a", "a", "b", "b", "b"], "data": [1, 20, 30, 4, 50, 600]})
>>>
>>> @daft.udf(return_dtype=daft.DataType.float64())
... def std_dev(data):
...     return [statistics.stdev(data)]
>>>
>>> df = df.groupby("group").map_groups(std_dev(df["data"]))
>>> df = df.sort("group")
>>> df.show()
╭───────┬────────────────────╮
│ group ┆ data               │
│ ---   ┆ ---                │
│ Utf8  ┆ Float64            │
╞═══════╪════════════════════╡
│ a     ┆ 14.730919862656235 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b     ┆ 331.62026476076517 │
╰───────┴────────────────────╯

(Showing first 2 of 2 rows)
Source code in daft/dataframe/dataframe.py
3759
3760
3761
3762
3763
3764
3765
3766
3767
3768
3769
3770
3771
3772
3773
3774
3775
3776
3777
3778
3779
3780
3781
3782
3783
3784
3785
3786
3787
3788
3789
3790
3791
3792
3793
def map_groups(self, udf: Expression) -> DataFrame:
    """Apply a user-defined function to each group. The name of the resultant column will default to the name of the first input column.

    Args:
        udf (Expression): User-defined function to apply to each group.

    Returns:
        DataFrame: DataFrame with grouped aggregations

    Examples:
        >>> import daft, statistics
        >>>
        >>> df = daft.from_pydict({"group": ["a", "a", "a", "b", "b", "b"], "data": [1, 20, 30, 4, 50, 600]})
        >>>
        >>> @daft.udf(return_dtype=daft.DataType.float64())
        ... def std_dev(data):
        ...     return [statistics.stdev(data)]
        >>>
        >>> df = df.groupby("group").map_groups(std_dev(df["data"]))
        >>> df = df.sort("group")
        >>> df.show()
        ╭───────┬────────────────────╮
        │ group ┆ data               │
        │ ---   ┆ ---                │
        │ Utf8  ┆ Float64            │
        ╞═══════╪════════════════════╡
        │ a     ┆ 14.730919862656235 │
        ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ b     ┆ 331.62026476076517 │
        ╰───────┴────────────────────╯
        <BLANKLINE>
        (Showing first 2 of 2 rows)

    """
    return self.df._map_groups(udf, group_by=self.group_by)

max #

max(*cols: ColumnInputType) -> DataFrame

Performs grouped max on this GroupedDataFrame.

Parameters:

Name Type Description Default
*cols Union[str, Expression]

columns to max

()

Returns:

Name Type Description
DataFrame DataFrame

DataFrame with grouped max.

Source code in daft/dataframe/dataframe.py
3640
3641
3642
3643
3644
3645
3646
3647
3648
3649
def max(self, *cols: ColumnInputType) -> DataFrame:
    """Performs grouped max on this GroupedDataFrame.

    Args:
        *cols (Union[str, Expression]): columns to max

    Returns:
        DataFrame: DataFrame with grouped max.
    """
    return self.df._apply_agg_fn(Expression.max, cols, self.group_by)

mean #

mean(*cols: ColumnInputType) -> DataFrame

Performs grouped mean on this GroupedDataFrame.

Parameters:

Name Type Description Default
*cols Union[str, Expression]

columns to mean

()

Returns:

Name Type Description
DataFrame DataFrame

DataFrame with grouped mean.

Source code in daft/dataframe/dataframe.py
3588
3589
3590
3591
3592
3593
3594
3595
3596
3597
def mean(self, *cols: ColumnInputType) -> DataFrame:
    """Performs grouped mean on this GroupedDataFrame.

    Args:
        *cols (Union[str, Expression]): columns to mean

    Returns:
        DataFrame: DataFrame with grouped mean.
    """
    return self.df._apply_agg_fn(Expression.mean, cols, self.group_by)

min #

min(*cols: ColumnInputType) -> DataFrame

Perform grouped min on this GroupedDataFrame.

Parameters:

Name Type Description Default
*cols Union[str, Expression]

columns to min

()

Returns:

Name Type Description
DataFrame DataFrame

DataFrame with grouped min.

Source code in daft/dataframe/dataframe.py
3629
3630
3631
3632
3633
3634
3635
3636
3637
3638
def min(self, *cols: ColumnInputType) -> DataFrame:
    """Perform grouped min on this GroupedDataFrame.

    Args:
        *cols (Union[str, Expression]): columns to min

    Returns:
        DataFrame: DataFrame with grouped min.
    """
    return self.df._apply_agg_fn(Expression.min, cols, self.group_by)

skew #

skew(*cols: ColumnInputType) -> DataFrame

Performs grouped skew on this GroupedDataFrame.

Returns:

Name Type Description
DataFrame DataFrame

DataFrame with the grouped skew per column.

Source code in daft/dataframe/dataframe.py
3672
3673
3674
3675
3676
3677
3678
def skew(self, *cols: ColumnInputType) -> DataFrame:
    """Performs grouped skew on this GroupedDataFrame.

    Returns:
        DataFrame: DataFrame with the grouped skew per column.
    """
    return self.df._apply_agg_fn(Expression.skew, cols, self.group_by)

stddev #

stddev(*cols: ColumnInputType) -> DataFrame

Performs grouped standard deviation on this GroupedDataFrame.

Parameters:

Name Type Description Default
*cols Union[str, Expression]

columns to stddev

()

Returns:

Name Type Description
DataFrame DataFrame

DataFrame with grouped standard deviation.

Examples:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
>>> import daft
>>> df = daft.from_pydict({"keys": ["a", "a", "a", "b"], "col_a": [0, 1, 2, 100]})
>>> df = df.groupby("keys").stddev()
>>> df = df.sort("keys")
>>> df.show()
╭──────┬───────────────────╮
│ keys ┆ col_a             │
│ ---  ┆ ---               │
│ Utf8 ┆ Float64           │
╞══════╪═══════════════════╡
│ a    ┆ 0.816496580927726 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b    ┆ 0                 │
╰──────┴───────────────────╯

(Showing first 2 of 2 rows)
Source code in daft/dataframe/dataframe.py
3599
3600
3601
3602
3603
3604
3605
3606
3607
3608
3609
3610
3611
3612
3613
3614
3615
3616
3617
3618
3619
3620
3621
3622
3623
3624
3625
3626
3627
def stddev(self, *cols: ColumnInputType) -> DataFrame:
    """Performs grouped standard deviation on this GroupedDataFrame.

    Args:
        *cols (Union[str, Expression]): columns to stddev

    Returns:
        DataFrame: DataFrame with grouped standard deviation.

    Examples:
        >>> import daft
        >>> df = daft.from_pydict({"keys": ["a", "a", "a", "b"], "col_a": [0, 1, 2, 100]})
        >>> df = df.groupby("keys").stddev()
        >>> df = df.sort("keys")
        >>> df.show()
        ╭──────┬───────────────────╮
        │ keys ┆ col_a             │
        │ ---  ┆ ---               │
        │ Utf8 ┆ Float64           │
        ╞══════╪═══════════════════╡
        │ a    ┆ 0.816496580927726 │
        ├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
        │ b    ┆ 0                 │
        ╰──────┴───────────────────╯
        <BLANKLINE>
        (Showing first 2 of 2 rows)

    """
    return self.df._apply_agg_fn(Expression.stddev, cols, self.group_by)

sum #

sum(*cols: ColumnInputType) -> DataFrame

Perform grouped sum on this GroupedDataFrame.

Parameters:

Name Type Description Default
*cols Union[str, Expression]

columns to sum

()

Returns:

Name Type Description
DataFrame DataFrame

DataFrame with grouped sums.

Source code in daft/dataframe/dataframe.py
3577
3578
3579
3580
3581
3582
3583
3584
3585
3586
def sum(self, *cols: ColumnInputType) -> DataFrame:
    """Perform grouped sum on this GroupedDataFrame.

    Args:
        *cols (Union[str, Expression]): columns to sum

    Returns:
        DataFrame: DataFrame with grouped sums.
    """
    return self.df._apply_agg_fn(Expression.sum, cols, self.group_by)