提问人:Rac 提问时间:9/9/2023 最后编辑:ReinderienRac 更新时间:9/11/2023 访问量:92
如何按列集分组,也按列分组,并按其项目组合和值列上的聚合进行分组?
How to groupby set of columns also grouping by a column with combinations of its items and aggregate on the value column?
问:
这是数据帧
df = pd.DataFrame({'county':['Laramie']*10 + ['Albany']*12,
'co': ['LU']*22,
'tech':['cable']*6+['copper']*4+['cable']*6+['copper']*4+['Fiber']*2,
'loc':[*'abcdefdefgmnopqrnostow']})
我想对 county、co 以及 tech 列中项目的所有组合进行分组,并在 loc 列上聚合以获得唯一和 nunique。
这是我正在寻找的结果:
试过这个:
df = df.groupby(['county', 'co'], as_index=True).agg({'tech':'unique', 'loc':'unique', 'loc':'nunique'}).reset_index()
但这并没有给出技术专栏的所有可能组合。
答:
2赞
mozway
9/10/2023
#1
这看起来像一个部分电源集
:
from itertools import chain, combinations
# modified powerset recipe
def powerset(iterable, min=0):
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(min, len(s)+1))
group = ['county', 'co']
tmp = df.groupby(['tech', *group])['loc'].agg(set)
out = []
for k, g in tmp.groupby(group):
for x in map(list, powerset(g.index.get_level_values('tech'), min=2)):
out.append(dict(zip(group, k))
|{'tech': ', '.join(x),
'log': (s:=set().union(*g.loc[x])),
'log-nunique': len(s),
})
out = pd.DataFrame(out)
铌。这使用 Walrus 运算符 (:=
, Python ≥3.8) 和字典联合 ({}|{}
, Python ≥3.9),这些都不是严格必需的,代码可以适用于较旧的 Python 版本*。
输出:
county co tech log log-nunique
0 Albany LU Fiber, cable {o, n, m, r, q, w, p} 7
1 Albany LU Fiber, copper {o, n, s, t, w} 5
2 Albany LU cable, copper {o, n, m, s, r, q, t, p} 8
3 Albany LU Fiber, cable, copper {o, n, m, s, r, q, t, w, p} 9
4 Laramie LU cable, copper {b, a, c, e, f, g, d} 7
* 旧版 Python 的代码:
group = ['county', 'co']
tmp = df.groupby(['tech', 'county', 'co'])['loc'].agg(set)
out = []
for k, g in tmp.groupby(['county', 'co']):
for x in map(list, powerset(g.index.get_level_values('tech'), min=2)):
out.append({'county': k[0],
'co': k[1],
'tech': ', '.join(x),
'log': set().union(*g.loc[x]),
})
out[-1]['log-nunique'] = len(out[-1]['log'])
out = pd.DataFrame(out)
评论
0赞
Scott Boston
9/10/2023
很棒的解决方案,@mozway。+1
0赞
Rac
9/10/2023
此代码在大型数据集上工作是否有效?
0赞
mozway
9/10/2023
@Rac在生成数据方面,我相信它与使用 python 一样高效。但是,请记住,生成组合可能会提供比输入大得多的输出。特别是如果你在一个团队中有很多“技术”,无论算法或编程语言如何。如果组的数量和每个组的唯一“技术”的最大数量怎么办?
0赞
Reinderien
9/11/2023
我正在对 100,000 帧进行初始性能测试,这种方法比矢量化方法慢大约一个数量级。一旦我有了它,我会发布更多细节。
0赞
Rac
9/15/2023
该解决方案非常有效。谢谢@mozway。
0赞
Reinderien
9/10/2023
#2
这可以通过近乎完全矢量化的方式实现。当您想要在单元格中列出列表时,矢量化会崩溃,这在 Pandas 中通常是一个坏主意;不过:
import cProfile
import pstats
from timeit import timeit
import seaborn as sns
import numpy as np
import pandas as pd
import itertools
from matplotlib import pyplot as plt
def op_data() -> pd.DataFrame:
"""Sample data more or less the same as OP"""
return pd.DataFrame({
'county': ['Laramie']*10 + ['Albany']*12,
'co': ['LU']*22,
'tech': ['cable']*6 + ['copper']*4 + ['cable']*6 + ['copper']*4 + ['fibre']*2,
'loc': list('abcdefdefgmnopqrnostow'),
})
def big_data(
rand: np.random.Generator,
n: int = 2_000,
group_frac: int = 0.1,
n_locs: int = 26,
tech_dist: float = 0.33,
) -> pd.DataFrame:
techs = np.full(shape=n, fill_value='copper')
tech_asn = rand.random(size=n)
techs[tech_asn > tech_dist] = 'cable'
techs[tech_asn > tech_dist*0.5 + 0.5] = 'fibre'
group_size = group_frac*n
return pd.DataFrame({
'county': np.linspace(start=0, stop=n/10, num=n, dtype=int),
'co': np.linspace(start=0, stop=n/group_size, num=n, dtype=int),
'tech': techs,
'loc': rand.integers(low=0, high=n_locs, size=n),
})
def notna_tuple(col: pd.Series) -> tuple:
return tuple(col.dropna())
def agg_tech(df: pd.DataFrame) -> pd.DataFrame:
# One-hot boolean columns for tech assignments
tech_vals = df['tech'].unique()
df[tech_vals] = pd.DataFrame(
data=np.equal.outer(df['tech'].values, tech_vals),
columns=tech_vals, index=df.index,
)
# Cartesian product of all tech values
tech_combos = pd.MultiIndex.from_product((
pd.Index(name=tech, data=(False, True))
for tech in tech_vals
)).to_frame()
# each group must have at least two tech values in the combination
tech_combos = tech_combos[tech_combos[tech_vals].sum(axis=1) > 1]
# Make tech name tuples
names = pd.DataFrame(index=tech_combos.index)
names[tech_vals] = tech_vals
names[~tech_combos[tech_vals].values] = np.nan
tech_combos['name'] = names.agg(notna_tuple, axis=1)
# Cartesian product of all tech values and all county-co groups,
# where each group must have at least two tech values in the combination
tech_combos = pd.merge(
# For each county-co group, for each tech: is there at least one of that tech?
left=df.groupby(['county', 'co'])[tech_vals]
.any().reset_index(),
right=tech_combos,
how='cross', suffixes=['_any', '']
)
# Eliminate county-co/tech pairs where the tech combo uses a missing tech
tech_combos = tech_combos.loc[
(
tech_combos[tech_vals] <= tech_combos[tech_vals + '_any'].values
).all(axis=1),
['county', 'co', 'name', *tech_vals],
]
# loc values for every tech combination
tech_combos = pd.merge(
left=tech_combos, right=df,
on=['county', 'co'], suffixes=['_group', ''],
)
tech_combos = (
tech_combos.loc[
(
# eliminate rows for which the tech combination does not match
tech_combos[tech_vals] <= tech_combos[tech_vals + '_group'].values
).all(axis=1),
['county', 'co', 'name', 'loc'],
]
.drop_duplicates()
.groupby(['county', 'co', 'name'])
.agg([tuple, 'count'])
.reset_index()
)
return tech_combos
def powerset(iterable, min=0):
"""modified powerset recipe"""
s = list(iterable)
return itertools.chain.from_iterable(itertools.combinations(s, r) for r in range(min, len(s)+1))
def mozway(df: pd.DataFrame) -> pd.DataFrame:
group = ['county', 'co']
tmp = df.groupby(['tech', *group])['loc'].agg(set)
out = []
for k, g in tmp.groupby(group):
for x in map(list, powerset(g.index.get_level_values('tech'), min=2)):
out.append(dict(zip(group, k))
|{'tech': ', '.join(x),
'log': (s:=set().union(*g.loc[x])),
'log-nunique': len(s),
})
return pd.DataFrame(out)
def test() -> None:
print(agg_tech(op_data()))
print(mozway(op_data()))
def plot() -> None:
rand = np.random.default_rng(seed=0)
methods = (agg_tech, mozway)
fig, axes = plt.subplots(nrows=2, ncols=2)
print('By n...')
by_n = []
for n in (10**(np.linspace(1.5, 4, num=20))).astype(int):
inp_orig = big_data(rand, n=n)
for method in methods:
inp = inp_orig.copy()
def run():
return method(inp)
dur = timeit(stmt=run, number=1)
by_n.append((n, method.__name__, dur))
sns.lineplot(pd.DataFrame(by_n, columns=('n', 'method', 'dur')),
x='n', y='dur', hue='method', ax=axes[0][0])
axes[0][0].set(xscale='log', yscale='log')
print('By group_frac...')
by_frac = []
for frac in np.linspace(0, 0.5, num=20):
inp_orig = big_data(rand, group_frac=frac)
for method in methods:
inp = inp_orig.copy()
def run():
return method(inp)
dur = timeit(stmt=run, number=1)
by_frac.append((frac, method.__name__, dur))
sns.lineplot(pd.DataFrame(by_frac, columns=('group_frac', 'method', 'dur')),
x='group_frac', y='dur', hue='method', ax=axes[0][1])
print('By n_locs...')
by_locs = []
for n_locs in np.linspace(1, 200, num=20, dtype=int):
inp_orig = big_data(rand, n_locs=n_locs)
for method in methods:
inp = inp_orig.copy()
def run():
return method(inp)
dur = timeit(stmt=run, number=1)
by_locs.append((n_locs, method.__name__, dur))
sns.lineplot(pd.DataFrame(by_locs, columns=('n_locs', 'method', 'dur')),
x='n_locs', y='dur', hue='method', ax=axes[1][0])
print('By tech dist...')
by_tech = []
for tech_dist in np.linspace(0.05, 0.95, num=20):
inp_orig = big_data(rand, tech_dist=tech_dist)
for method in methods:
inp = inp_orig.copy()
def run():
return method(inp)
dur = timeit(stmt=run, number=1)
by_tech.append((tech_dist, method.__name__, dur))
sns.lineplot(pd.DataFrame(by_tech, columns=('tech_dist', 'method', 'dur')),
x='tech_dist', y='dur', hue='method', ax=axes[1][1])
axes[1][1].set(yscale='log')
plt.show()
def profile():
rand = np.random.default_rng(seed=0)
df = big_data(rand, n=100_000)
pr = cProfile.Profile()
pr.enable()
agg_tech(df)
pr.disable()
ps = pstats.Stats(pr).sort_stats('cumulative')
ps.print_stats(10)
if __name__ == '__main__':
test()
profile()
plot()
county co name loc
tuple count
0 Albany LU (cable, copper) (m, n, o, p, q, r, s, t) 8
1 Albany LU (cable, copper, fibre) (m, n, o, p, q, r, s, t, w) 9
2 Albany LU (cable, fibre) (m, n, o, p, q, r, w) 7
3 Albany LU (copper, fibre) (n, o, s, t, w) 5
4 Laramie LU (cable, copper) (a, b, c, d, e, f, g) 7
对于任何大于约 100 个元素的输入,这总是优于非矢量化版本;多少主要取决于技术价值的分布:
请注意,到目前为止,最长的操作是 。如果您可以容忍不同的输出,例如agg(tuple)
- 每个 一个条目的长格式表,或者
loc
- 只有唯一计数的简短表,或者
- 列中带有 -or-NaN 的宽格式表,
loc
这将快得多。例如,对于最后一个选项,您可以使用以下例程:
def agg_tech(df: pd.DataFrame) -> pd.DataFrame:
# One-hot boolean columns for tech assignments
tech_vals = df['tech'].unique()
df[tech_vals] = pd.DataFrame(
data=np.equal.outer(df['tech'].values, tech_vals),
columns=tech_vals, index=df.index,
)
# Cartesian product of all tech values
tech_combos = pd.MultiIndex.from_product((
pd.Index(name=tech, data=(False, True))
for tech in tech_vals
)).to_frame()
# each group must have at least two tech values in the combination
tech_combos = tech_combos[tech_combos[tech_vals].sum(axis=1) > 1]
# Make tech name tuples
names = pd.DataFrame(index=tech_combos.index)
names[tech_vals] = tech_vals
names[~tech_combos[tech_vals].values] = np.nan
tech_combos['name'] = names.agg(notna_tuple, axis=1)
grouped = df.groupby(['county', 'co'])
df['group_idx'] = grouped.cumcount()
# Cartesian product of all tech values and all county-co groups,
# where each group must have at least two tech values in the combination
tech_combos = pd.merge(
# For each county-co group, for each tech: is there at least one of that tech?
left=grouped[tech_vals]
.any().reset_index(),
right=tech_combos,
how='cross', suffixes=['_any', '']
)
# Eliminate county-co/tech pairs where the tech combo uses a missing tech
tech_combos = tech_combos.loc[
(
tech_combos[tech_vals] <= tech_combos[tech_vals + '_any'].values
).all(axis=1),
['county', 'co', 'name', *tech_vals],
]
# loc values for every tech combination
tech_combos = pd.merge(
left=tech_combos, right=df,
on=['county', 'co'], suffixes=['_group', ''],
)
tech_combos = (
tech_combos.loc[
(
# eliminate rows for which the tech combination does not match
tech_combos[tech_vals] <= tech_combos[tech_vals + '_group'].values
).all(axis=1),
['county', 'co', 'name', 'group_idx', 'loc'],
]
.set_index(['county', 'co', 'name', 'group_idx'])
.unstack('group_idx')
)
return tech_combos
速度提升约 10 倍。
评论
0赞
mozway
9/10/2023
使用交叉合并进行筛选通常是低效的,尤其是在您使用了广播之后。您是否在大型数据集上测试过它?
0赞
Reinderien
9/10/2023
取决于 OP 未指定的许多内容,包括数据大小、每组技术价值的比例等。如果我尝试一个大型数据集,我会盲目飞行,如果没有更多信息,那将是浪费时间。
0赞
Rac
9/10/2023
这是一个非常大的数据集,由大约 1 亿 + 行组成,不确定每组技术值的比例。
0赞
Reinderien
9/11/2023
@Rac我没有耐心测试 100M 行,所以我演示了多达 10,000 行。
0赞
Reinderien
9/11/2023
@mozway 所以我现在已经试过了;并且出于几个原因 - Python 中非矢量化操作的低效率,以及交叉连接后过滤器的相对较低的拒绝率,这并不是低效的。
下一个:随机分布直方图
评论
tech