在 pandas 中为 groupby 创建序列中的模式

Creating a pattern in a sequence for groupby in pandas

提问人:Nabi Shaikh 提问时间:10/13/2023 最后编辑:Nabi Shaikh 更新时间:10/13/2023 访问量:54

问:

我有一个三列的数据集,“sort_seqlevelfor循环”,因为它需要更长的时间使用字典并附加到列表中。IDand. basically i want to identify id wise level sequence sort by sort_seq. please suggest any optimal code other then

输入数据集

    import pandas as pd
import numpy as np
data = {'id': [1, 1, 1, 1,2, 2, 3, 3, 3, 3, 4, 5, 5, 6],
        'sort_seq': [89, 24, 56,  8,  5, 64, 93, 88, 61, 31, 50, 75,  1, 81],
        'level':['a', 'a',  'b', 'c', 'x', 'x', 'g', 'a', 'b', 'b', 'b', 'c', 'c','b']}
df = pd.DataFrame(data)

预期输出

enter image description here

尝试过的代码

collect = []
for ij in df.id.unique():
  idict = {}
  x =  df[df['id'] == ij]
  x = x.sort_values(by='sort_seq',ascending=True)
  x = x.reset_index()
  idict[ij] =  x['level'].tolist()
  collect.append(idict)
collect
pandas DataFrame 字典 模式 序列

评论

1赞 Panda Kim 10/13/2023
我问您是否提供了一个将随机变量分配给sort_seq列的示例,即使您按sort_seq列排序得出结果。我正在检查我是否误读了问题。

答:

1赞 jezrael 10/13/2023 #1

首先按DataFrame.sort_values按两列排序,然后按连续值分组,按计数器按Series.value_counts,按值连接计数数,因此可以按以下方式聚合:levelleveljoin

np.random.seed(123)
    
data = {'id': [1, 1, 1, 1,2, 2, 3, 3, 3, 3, 4, 5, 5, 6],
        'sort_seq': np.random.randint(0, 100, size=14),
        'level':['a', 'a',  'b', 'c', 'x', 'x', 'g', 'a', 'b', 'b', 'b', 'c', 'c','b']}
df = pd.DataFrame(data)

df1 = df.sort_values(['id','sort_seq'])

df1 = (df1.groupby(['id', df1['level'].ne(df1['level'].shift()).cumsum()])['level']
          .value_counts()
          .droplevel(1)
          .rename('col')
          .reset_index()
          .assign(level=lambda x: x['col'].astype(str) + x['level'])
          .groupby('id')['level'].agg(','.join)
          .reset_index(name='PATTERN')
           )
print(df1)

   id      PATTERN
0   1     1c,2a,1b
1   2           2x
2   3  1b,1g,1b,1a
3   4           1b
4   5           2c
5   6           1b

或在理解中使用:itertools.groupbyCounter

import itertools
from collections import Counter


f = lambda x: ','.join(f'{b}{a}' 
                        for _, g in itertools.groupby(x) 
                        for a, b in Counter(g).items())
df1 = (df.sort_values(['id','sort_seq'])
         .groupby('id')['level']
         .agg(f)
         .reset_index(name='PATTERN'))
           
print(df1)
   id      PATTERN
0   1     1c,2a,1b
1   2           2x
2   3  1b,1g,1b,1a
3   4           1b
4   5           2c
5   6           1b

评论

0赞 Nabi Shaikh 10/13/2023
在<单元格行中收到错误 ValueError Traceback(最近一次调用)<ipython-input-44-b3b1fcc2827f>: 10>() 10 df1 = (df1.groupby(['id', df1['level'].ne(df1['level'].shift()).cumsum()])['level'].value_counts() 11 .droplevel(1)\ ---> 12 .reset_index() -> 4817 raise ValueError(f“cannot insert {column}, already exists”) 4818 if not isinstance(loc, int): 4819 raise TypeError(“loc must be int”) ValueError: 无法插入级别,已存在
0赞 jezrael 10/13/2023
@NabiShaikh - 你现在可以测试了吗?
0赞 Panda Kim 10/13/2023 #2

法典

grp = df.sort_values(['id', 'sort_seq'])['level'].ne(df.sort_values(['id', 'sort_seq']).groupby('id')['level'].shift()).cumsum()
df.groupby(['id', grp])['level'].agg(['count', 'first'])\
  .assign(pattern=lambda x: x['count'].astype('str') + x['first'])\
  .groupby(level=0)['pattern'].agg(','.join).reset_index()

输出:

    id  pattern
0   1   1c,1a,1b,1a
1   2   2x
2   3   2b,1a,1g
3   4   1b
4   5   2c
5   6   1b

与所需的输出不同,示例中的输出应为 。2bid 3


中间

GRP的 :

              df
    id  sort_seq    level               grp
0   1   89          a                   4
1   1   24          a                   2
2   1   56          b                   3
3   1   8           c                   1
4   2   5           x                   5
5   2   64          x                   5
6   3   93          g                   8
7   3   88          a                   7
8   3   61          b                   6
9   3   31          b                   6
10  4   50          b                   9
11  5   75          c                   10
12  5   1           c                   10
13  6   81          b                   11