如何更改 pandas 数据透视表中列多索引的分组?

How to change the grouping of a column multi-index in a pandas pivot table?

提问人:Alexander Zorin 提问时间:10/20/2023 更新时间:10/20/2023 访问量:38

问:

假设我有一个这样的数据帧:

data = {'City': ['Rochester', 'Anaheim', 'Toledo', 'Rochester', 'Anaheim', 'Anaheim', 'Toledo', 'Rochester', 'Rochester', 'Rochester', 'Toledo', 'Toledo', 'Toledo', 'Anaheim'],
        'PersonID': [4930, 7343, 4368, 6909, 4574, 4086, 5024, 3642, 9997, 4745, 1207, 6081, 7832, 6309],
        'MoneySpent': [100, 1710, 20, 910, 2040, 1100, 490, 70, 1940, 100, 1240, 80, 1420, 2090],
        'StayDuration': ['< 2 days', '2-7 days', '2-7 days', '7-30 days', '7-30 days', '< 2 days', '2-7 days', '7-30 days', '7-30 days', '2-7 days', '7-30 days', '< 2 days', '< 2 days', '7-30 days']
       }

df = pd.DataFrame(data)
    City        PersonID    MoneySpent  StayDuration
0   Rochester   4930        100         < 2 days
1   Anaheim     7343        1710        2-7 days
2   Toledo      4368        20          2-7 days
3   Rochester   6909        910         7-30 days
4   Anaheim     4574        2040        7-30 days
5   Anaheim     4086        1100        < 2 days
6   Toledo      5024        490         2-7 days
7   Rochester   3642        70          7-30 days
8   Rochester   9997        1940        7-30 days
9   Rochester   4745        100         2-7 days
10  Toledo      1207        1240        7-30 days
11  Toledo      6081        80          < 2 days
12  Toledo      7832        1420        < 2 days
13  Anaheim     6309        2090        7-30 days

然后,我正在构建一个数据透视表,以显示每个城市的停留时间的人数及其总支出:

pv = pd.pivot_table(df,
                    index='City',
                    columns='StayDuration',
                    values=['PersonID', 'MoneySpent'],
                    aggfunc={'PersonID': 'count', 'MoneySpent': 'sum'}
                   )

我看到的是第一级的指标(员工人数或费用),然后是其中的类别:

                                      MoneySpent                            PersonID
StayDuration    2-7 days    7-30 days   < 2 days    2-7 days    7-30 days   < 2 days
City                        
Anaheim         1710        4130        1100        1           2           1
Rochester       100         2920        100         1           3           1
Toledo          510         1240        1500        2           1           2

我想要的是首先有类别,并在其中有指标,如下所示:

            2-7 days                7-30 days               < 2 days    
            PersonID   MoneySpent   PersonID   MoneySpent   PersonID   MoneySpent  
Anaheim     1          1710         2          4130         1          1100
Rochester   1          100          3          2920         1          100
Toledo      2          510          1          1240         2          1500

顺便说一句,这是 Excel 数据透视表的默认视图。

我花了很长时间才弄清楚如何让 Python 产生相同的结果。是否可以更改列的分组顺序?

Python Pandas 数据透视表 多索引

评论


答:

0赞 russhoppa 10/20/2023 #1

解决此问题的一种方法是将 columns 属性中的值反转为新的 MultiIndex

new_multiindex = [(stay_dur,mon_spent) for stay_dur in df.StayDuration.unique() for mon_spent in ['MoneySpent', 'PersonID']]
pv.columns = pd.MultiIndex.from_tuples(new_multiindex, names=('StayDuration', None))
pv
>>>

enter image description here

enter image description here

0赞 Suraj Shourie 10/20/2023 #2

据我所知,pandas pivot 将始终以这种方式对列进行排序。您将需要一些操作才能获得所需的输出:

pv.swaplevel(0,1,axis=1).sort_index(axis=1).reindex(['PersonID', 'MoneySpent'], level=1, axis=1)

输出:

StayDuration 2-7 days            7-30 days            < 2 days           
             PersonID MoneySpent  PersonID MoneySpent PersonID MoneySpent
City                                                                     
Anaheim             1       1710         2       4130        1       1100
Rochester           1        100         3       2920        1        100
Toledo              2        510         1       1240        2       1500