提问人:Nabi Shaikh 提问时间:10/13/2023 最后编辑:Nabi Shaikh 更新时间:10/13/2023 访问量:54
在 pandas 中为 groupby 创建序列中的模式
Creating a pattern in a sequence for groupby in pandas
问:
我有一个三列的数据集,“sort_seqlevelfor循环”,因为它需要更长的时间使用字典并附加到列表中。ID
and
. basically i want to identify id wise level sequence sort by sort_seq. please suggest any optimal code other then
输入数据集
import pandas as pd
import numpy as np
data = {'id': [1, 1, 1, 1,2, 2, 3, 3, 3, 3, 4, 5, 5, 6],
'sort_seq': [89, 24, 56, 8, 5, 64, 93, 88, 61, 31, 50, 75, 1, 81],
'level':['a', 'a', 'b', 'c', 'x', 'x', 'g', 'a', 'b', 'b', 'b', 'c', 'c','b']}
df = pd.DataFrame(data)
预期输出
尝试过的代码
collect = []
for ij in df.id.unique():
idict = {}
x = df[df['id'] == ij]
x = x.sort_values(by='sort_seq',ascending=True)
x = x.reset_index()
idict[ij] = x['level'].tolist()
collect.append(idict)
collect
答:
1赞
jezrael
10/13/2023
#1
首先按DataFrame.sort_values
按两列排序,然后按连续值分组,按计数器按Series.value_counts
,按值连接计数数,因此可以按以下方式聚合:level
level
join
np.random.seed(123)
data = {'id': [1, 1, 1, 1,2, 2, 3, 3, 3, 3, 4, 5, 5, 6],
'sort_seq': np.random.randint(0, 100, size=14),
'level':['a', 'a', 'b', 'c', 'x', 'x', 'g', 'a', 'b', 'b', 'b', 'c', 'c','b']}
df = pd.DataFrame(data)
df1 = df.sort_values(['id','sort_seq'])
df1 = (df1.groupby(['id', df1['level'].ne(df1['level'].shift()).cumsum()])['level']
.value_counts()
.droplevel(1)
.rename('col')
.reset_index()
.assign(level=lambda x: x['col'].astype(str) + x['level'])
.groupby('id')['level'].agg(','.join)
.reset_index(name='PATTERN')
)
print(df1)
id PATTERN
0 1 1c,2a,1b
1 2 2x
2 3 1b,1g,1b,1a
3 4 1b
4 5 2c
5 6 1b
或在理解中使用:itertools.groupby
Counter
import itertools
from collections import Counter
f = lambda x: ','.join(f'{b}{a}'
for _, g in itertools.groupby(x)
for a, b in Counter(g).items())
df1 = (df.sort_values(['id','sort_seq'])
.groupby('id')['level']
.agg(f)
.reset_index(name='PATTERN'))
print(df1)
id PATTERN
0 1 1c,2a,1b
1 2 2x
2 3 1b,1g,1b,1a
3 4 1b
4 5 2c
5 6 1b
评论
0赞
Nabi Shaikh
10/13/2023
在<单元格行中收到错误 ValueError Traceback(最近一次调用)<ipython-input-44-b3b1fcc2827f>: 10>() 10 df1 = (df1.groupby(['id', df1['level'].ne(df1['level'].shift()).cumsum()])['level'].value_counts() 11 .droplevel(1)\ ---> 12 .reset_index() -> 4817 raise ValueError(f“cannot insert {column}, already exists”) 4818 if not isinstance(loc, int): 4819 raise TypeError(“loc must be int”) ValueError: 无法插入级别,已存在
0赞
jezrael
10/13/2023
@NabiShaikh - 你现在可以测试了吗?
0赞
Panda Kim
10/13/2023
#2
法典
grp = df.sort_values(['id', 'sort_seq'])['level'].ne(df.sort_values(['id', 'sort_seq']).groupby('id')['level'].shift()).cumsum()
df.groupby(['id', grp])['level'].agg(['count', 'first'])\
.assign(pattern=lambda x: x['count'].astype('str') + x['first'])\
.groupby(level=0)['pattern'].agg(','.join).reset_index()
输出:
id pattern
0 1 1c,1a,1b,1a
1 2 2x
2 3 2b,1a,1g
3 4 1b
4 5 2c
5 6 1b
与所需的输出不同,示例中的输出应为 。2b
id 3
中间
GRP的 :
df
id sort_seq level grp
0 1 89 a 4
1 1 24 a 2
2 1 56 b 3
3 1 8 c 1
4 2 5 x 5
5 2 64 x 5
6 3 93 g 8
7 3 88 a 7
8 3 61 b 6
9 3 31 b 6
10 4 50 b 9
11 5 75 c 10
12 5 1 c 10
13 6 81 b 11
评论