如何按列集分组,也按列分组,并按其项目组合和值列上的聚合进行分组?

How to groupby set of columns also grouping by a column with combinations of its items and aggregate on the value column?

提问人:Rac 提问时间:9/9/2023 最后编辑:ReinderienRac 更新时间:9/11/2023 访问量:92

问:

这是数据帧Input df

df = pd.DataFrame({'county':['Laramie']*10 + ['Albany']*12,
                   'co': ['LU']*22,
                    'tech':['cable']*6+['copper']*4+['cable']*6+['copper']*4+['Fiber']*2,
                    'loc':[*'abcdefdefgmnopqrnostow']})

我想对 county、co 以及 tech 列中项目的所有组合进行分组,并在 loc 列上聚合以获得唯一和 nunique。

这是我正在寻找的结果:Output df

试过这个:

df = df.groupby(['county', 'co'], as_index=True).agg({'tech':'unique', 'loc':'unique', 'loc':'nunique'}).reset_index()

但这并没有给出技术专栏的所有可能组合。

Pandas Numpy 数学 分组组合

评论

0赞 Reinderien 9/9/2023
它应该是 log 还是 loc?这并不一致。
0赞 Reinderien 9/9/2023
为什么单个项目没有产值?例如,为什么不允许使用“仅电缆”?tech

答:

2赞 mozway 9/10/2023 #1

这看起来像一个部分电源集

from itertools import chain, combinations

# modified powerset recipe
def powerset(iterable, min=0):
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(min, len(s)+1))

group = ['county', 'co']

tmp = df.groupby(['tech', *group])['loc'].agg(set)

out = []
for k, g in tmp.groupby(group):
    for x in map(list, powerset(g.index.get_level_values('tech'), min=2)):
        out.append(dict(zip(group, k))
                  |{'tech': ', '.join(x),
                    'log': (s:=set().union(*g.loc[x])),
                    'log-nunique': len(s),
                   })

out = pd.DataFrame(out)

铌。这使用 Walrus 运算符 (:=, Python ≥3.8) 和字典联合 ({}|{}, Python ≥3.9),这些都不是严格必需的,代码可以适用于较旧的 Python 版本*。

输出:

    county  co                  tech                          log  log-nunique
0   Albany  LU          Fiber, cable        {o, n, m, r, q, w, p}            7
1   Albany  LU         Fiber, copper              {o, n, s, t, w}            5
2   Albany  LU         cable, copper     {o, n, m, s, r, q, t, p}            8
3   Albany  LU  Fiber, cable, copper  {o, n, m, s, r, q, t, w, p}            9
4  Laramie  LU         cable, copper        {b, a, c, e, f, g, d}            7

* 旧版 Python 的代码:

group = ['county', 'co']

tmp = df.groupby(['tech', 'county', 'co'])['loc'].agg(set)

out = []
for k, g in tmp.groupby(['county', 'co']):
    for x in map(list, powerset(g.index.get_level_values('tech'), min=2)):
        out.append({'county': k[0],
                    'co': k[1],
                    'tech': ', '.join(x),
                    'log': set().union(*g.loc[x]),
                   })
        out[-1]['log-nunique'] = len(out[-1]['log'])

out = pd.DataFrame(out)

评论

0赞 Scott Boston 9/10/2023
很棒的解决方案,@mozway。+1
0赞 Rac 9/10/2023
此代码在大型数据集上工作是否有效?
0赞 mozway 9/10/2023
@Rac在生成数据方面,我相信它与使用 python 一样高效。但是,请记住,生成组合可能会提供比输入大得多的输出。特别是如果你在一个团队中有很多“技术”,无论算法或编程语言如何。如果组的数量和每个组的唯一“技术”的最大数量怎么办?
0赞 Reinderien 9/11/2023
我正在对 100,000 帧进行初始性能测试,这种方法比矢量化方法慢大约一个数量级。一旦我有了它,我会发布更多细节。
0赞 Rac 9/15/2023
该解决方案非常有效。谢谢@mozway。
0赞 Reinderien 9/10/2023 #2

这可以通过近乎完全矢量化的方式实现。当您想要在单元格中列出列表时,矢量化会崩溃,这在 Pandas 中通常是一个坏主意;不过:

import cProfile
import pstats
from timeit import timeit
import seaborn as sns

import numpy as np
import pandas as pd
import itertools

from matplotlib import pyplot as plt


def op_data() -> pd.DataFrame:
    """Sample data more or less the same as OP"""
    return pd.DataFrame({
        'county': ['Laramie']*10 + ['Albany']*12,
        'co': ['LU']*22,
        'tech': ['cable']*6 + ['copper']*4 + ['cable']*6 + ['copper']*4 + ['fibre']*2,
        'loc': list('abcdefdefgmnopqrnostow'),
    })


def big_data(
    rand: np.random.Generator,
    n: int = 2_000,
    group_frac: int = 0.1,
    n_locs: int = 26,
    tech_dist: float = 0.33,
) -> pd.DataFrame:
    techs = np.full(shape=n, fill_value='copper')
    tech_asn = rand.random(size=n)
    techs[tech_asn > tech_dist] = 'cable'
    techs[tech_asn > tech_dist*0.5 + 0.5] = 'fibre'

    group_size = group_frac*n

    return pd.DataFrame({
        'county': np.linspace(start=0, stop=n/10, num=n, dtype=int),
        'co': np.linspace(start=0, stop=n/group_size, num=n, dtype=int),
        'tech': techs,
        'loc': rand.integers(low=0, high=n_locs, size=n),
    })


def notna_tuple(col: pd.Series) -> tuple:
    return tuple(col.dropna())


def agg_tech(df: pd.DataFrame) -> pd.DataFrame:
    # One-hot boolean columns for tech assignments
    tech_vals = df['tech'].unique()
    df[tech_vals] = pd.DataFrame(
        data=np.equal.outer(df['tech'].values, tech_vals),
        columns=tech_vals, index=df.index,
    )

    # Cartesian product of all tech values
    tech_combos = pd.MultiIndex.from_product((
        pd.Index(name=tech, data=(False, True))
        for tech in tech_vals
    )).to_frame()

    # each group must have at least two tech values in the combination
    tech_combos = tech_combos[tech_combos[tech_vals].sum(axis=1) > 1]

    # Make tech name tuples
    names = pd.DataFrame(index=tech_combos.index)
    names[tech_vals] = tech_vals
    names[~tech_combos[tech_vals].values] = np.nan
    tech_combos['name'] = names.agg(notna_tuple, axis=1)

    # Cartesian product of all tech values and all county-co groups,
    # where each group must have at least two tech values in the combination
    tech_combos = pd.merge(
        # For each county-co group, for each tech: is there at least one of that tech?
        left=df.groupby(['county', 'co'])[tech_vals]
        .any().reset_index(),
        right=tech_combos,
        how='cross', suffixes=['_any', '']
    )

    # Eliminate county-co/tech pairs where the tech combo uses a missing tech
    tech_combos = tech_combos.loc[
        (
            tech_combos[tech_vals] <= tech_combos[tech_vals + '_any'].values
        ).all(axis=1),
        ['county', 'co', 'name', *tech_vals],
    ]

    # loc values for every tech combination
    tech_combos = pd.merge(
        left=tech_combos, right=df,
        on=['county', 'co'], suffixes=['_group', ''],
    )

    tech_combos = (
        tech_combos.loc[
            (
                # eliminate rows for which the tech combination does not match
                tech_combos[tech_vals] <= tech_combos[tech_vals + '_group'].values
            ).all(axis=1),
            ['county', 'co', 'name', 'loc'],
        ]
        .drop_duplicates()
        .groupby(['county', 'co', 'name'])
        .agg([tuple, 'count'])
        .reset_index()
    )

    return tech_combos


def powerset(iterable, min=0):
    """modified powerset recipe"""
    s = list(iterable)
    return itertools.chain.from_iterable(itertools.combinations(s, r) for r in range(min, len(s)+1))


def mozway(df: pd.DataFrame) -> pd.DataFrame:
    group = ['county', 'co']

    tmp = df.groupby(['tech', *group])['loc'].agg(set)

    out = []
    for k, g in tmp.groupby(group):
        for x in map(list, powerset(g.index.get_level_values('tech'), min=2)):
            out.append(dict(zip(group, k))
                      |{'tech': ', '.join(x),
                        'log': (s:=set().union(*g.loc[x])),
                        'log-nunique': len(s),
                       })

    return pd.DataFrame(out)


def test() -> None:
    print(agg_tech(op_data()))
    print(mozway(op_data()))


def plot() -> None:
    rand = np.random.default_rng(seed=0)
    methods = (agg_tech, mozway)
    fig, axes = plt.subplots(nrows=2, ncols=2)

    print('By n...')
    by_n = []
    for n in (10**(np.linspace(1.5, 4, num=20))).astype(int):
        inp_orig = big_data(rand, n=n)
        for method in methods:
            inp = inp_orig.copy()
            def run():
                return method(inp)
            dur = timeit(stmt=run, number=1)
            by_n.append((n, method.__name__, dur))

    sns.lineplot(pd.DataFrame(by_n, columns=('n', 'method', 'dur')),
                 x='n', y='dur', hue='method', ax=axes[0][0])
    axes[0][0].set(xscale='log', yscale='log')

    print('By group_frac...')
    by_frac = []
    for frac in np.linspace(0, 0.5, num=20):
        inp_orig = big_data(rand, group_frac=frac)
        for method in methods:
            inp = inp_orig.copy()
            def run():
                return method(inp)
            dur = timeit(stmt=run, number=1)
            by_frac.append((frac, method.__name__, dur))

    sns.lineplot(pd.DataFrame(by_frac, columns=('group_frac', 'method', 'dur')),
                 x='group_frac', y='dur', hue='method', ax=axes[0][1])

    print('By n_locs...')
    by_locs = []
    for n_locs in np.linspace(1, 200, num=20, dtype=int):
        inp_orig = big_data(rand, n_locs=n_locs)
        for method in methods:
            inp = inp_orig.copy()
            def run():
                return method(inp)
            dur = timeit(stmt=run, number=1)
            by_locs.append((n_locs, method.__name__, dur))

    sns.lineplot(pd.DataFrame(by_locs, columns=('n_locs', 'method', 'dur')),
                 x='n_locs', y='dur', hue='method', ax=axes[1][0])

    print('By tech dist...')
    by_tech = []
    for tech_dist in np.linspace(0.05, 0.95, num=20):
        inp_orig = big_data(rand, tech_dist=tech_dist)
        for method in methods:
            inp = inp_orig.copy()
            def run():
                return method(inp)
            dur = timeit(stmt=run, number=1)
            by_tech.append((tech_dist, method.__name__, dur))

    sns.lineplot(pd.DataFrame(by_tech, columns=('tech_dist', 'method', 'dur')),
                 x='tech_dist', y='dur', hue='method', ax=axes[1][1])
    axes[1][1].set(yscale='log')

    plt.show()


def profile():
    rand = np.random.default_rng(seed=0)
    df = big_data(rand, n=100_000)

    pr = cProfile.Profile()
    pr.enable()
    agg_tech(df)
    pr.disable()
    ps = pstats.Stats(pr).sort_stats('cumulative')
    ps.print_stats(10)


if __name__ == '__main__':
    test()
    profile()
    plot()
    county  co                    name                          loc      
                                                              tuple count
0   Albany  LU         (cable, copper)     (m, n, o, p, q, r, s, t)     8
1   Albany  LU  (cable, copper, fibre)  (m, n, o, p, q, r, s, t, w)     9
2   Albany  LU          (cable, fibre)        (m, n, o, p, q, r, w)     7
3   Albany  LU         (copper, fibre)              (n, o, s, t, w)     5
4  Laramie  LU         (cable, copper)        (a, b, c, d, e, f, g)     7

对于任何大于约 100 个元素的输入,这总是优于非矢量化版本;多少主要取决于技术价值的分布:

perf comparison

请注意,到目前为止,最长的操作是 。如果您可以容忍不同的输出,例如agg(tuple)

  • 每个 一个条目的长格式表,或者loc
  • 只有唯一计数的简短表,或者
  • 列中带有 -or-NaN 的宽格式表,loc

这将快得多。例如,对于最后一个选项,您可以使用以下例程:

def agg_tech(df: pd.DataFrame) -> pd.DataFrame:
    # One-hot boolean columns for tech assignments
    tech_vals = df['tech'].unique()
    df[tech_vals] = pd.DataFrame(
        data=np.equal.outer(df['tech'].values, tech_vals),
        columns=tech_vals, index=df.index,
    )

    # Cartesian product of all tech values
    tech_combos = pd.MultiIndex.from_product((
        pd.Index(name=tech, data=(False, True))
        for tech in tech_vals
    )).to_frame()

    # each group must have at least two tech values in the combination
    tech_combos = tech_combos[tech_combos[tech_vals].sum(axis=1) > 1]

    # Make tech name tuples
    names = pd.DataFrame(index=tech_combos.index)
    names[tech_vals] = tech_vals
    names[~tech_combos[tech_vals].values] = np.nan
    tech_combos['name'] = names.agg(notna_tuple, axis=1)

    grouped = df.groupby(['county', 'co'])

    df['group_idx'] = grouped.cumcount()

    # Cartesian product of all tech values and all county-co groups,
    # where each group must have at least two tech values in the combination
    tech_combos = pd.merge(
        # For each county-co group, for each tech: is there at least one of that tech?
        left=grouped[tech_vals]
        .any().reset_index(),
        right=tech_combos,
        how='cross', suffixes=['_any', '']
    )

    # Eliminate county-co/tech pairs where the tech combo uses a missing tech
    tech_combos = tech_combos.loc[
        (
            tech_combos[tech_vals] <= tech_combos[tech_vals + '_any'].values
        ).all(axis=1),
        ['county', 'co', 'name', *tech_vals],
    ]

    # loc values for every tech combination
    tech_combos = pd.merge(
        left=tech_combos, right=df,
        on=['county', 'co'], suffixes=['_group', ''],
    )

    tech_combos = (
        tech_combos.loc[
            (
                # eliminate rows for which the tech combination does not match
                tech_combos[tech_vals] <= tech_combos[tech_vals + '_group'].values
            ).all(axis=1),
            ['county', 'co', 'name', 'group_idx', 'loc'],
        ]
        .set_index(['county', 'co', 'name', 'group_idx'])
        .unstack('group_idx')
    )

    return tech_combos

速度提升约 10 倍。

评论

0赞 mozway 9/10/2023
使用交叉合并进行筛选通常是低效的,尤其是在您使用了广播之后。您是否在大型数据集上测试过它?
0赞 Reinderien 9/10/2023
取决于 OP 未指定的许多内容,包括数据大小、每组技术价值的比例等。如果我尝试一个大型数据集,我会盲目飞行,如果没有更多信息,那将是浪费时间。
0赞 Rac 9/10/2023
这是一个非常大的数据集,由大约 1 亿 + 行组成,不确定每组技术值的比例。
0赞 Reinderien 9/11/2023
@Rac我没有耐心测试 100M 行,所以我演示了多达 10,000 行。
0赞 Reinderien 9/11/2023
@mozway 所以我现在已经试过了;并且出于几个原因 - Python 中非矢量化操作的低效率,以及交叉连接后过滤器的相对较低的拒绝率,这并不是低效的。