有没有更有效的方法可以按行应用,然后按列应用?

Is there a more efficient way to Apply by Row, then by Column?

提问人:Ted 提问时间:11/17/2023 更新时间:11/17/2023 访问量:41

问:

我的数据集包含每天进行的 5 次测量,时间跨度为 700 天。我希望能够按星期几对这些值进行分组,然后将函数 from 应用于 5 个测量值中的每一个,用作参数。trim_meanscipy.stats1/stddevproportiontocut

我的数据:

import pandas as pd
import numpy as np
from scipy.stats import trim_mean

np.random.seed(42)

data = np.random.randint(0, 100, size=(5, 700))
col_names = pd.date_range('11-16-2023', periods=700)
df = pd.DataFrame(data, columns=col_names)

# df
    2023-11-16  2023-11-17 ...  2025-10-15
0   51          92         ...  57
1   88          48         ...  32
2   89          52         ...  96
3   61          99         ...  48
4   0           7          ...  34

现在,我可以使用以下(不是很优雅的)过程来做到这一点:

df_T = df.T
df_T['Day of Week'] = pd.to_datetime(df_T.index).isocalendar().day

## Room for improvement here ##
# Apply calculation to each type of measurement
gb = df_T.groupby('Day of Week')
m0 = gb[0].apply(lambda x: trim_mean(x, proportiontocut=1/np.std(x)))
m1 = gb[1].apply(lambda x: trim_mean(x, proportiontocut=1/np.std(x)))
m2 = gb[2].apply(lambda x: trim_mean(x, proportiontocut=1/np.std(x)))
m3 = gb[3].apply(lambda x: trim_mean(x, proportiontocut=1/np.std(x)))
m4 = gb[4].apply(lambda x: trim_mean(x, proportiontocut=1/np.std(x)))

results_df = pd.DataFrame([m0, m1, m2, m3, m4])
results_df.columns = columns=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

# results_df
    Mon         Tue         Wed         Thu         Fri         Sat         Sun
0   50.936170   51.712766   44.659574   49.117021   48.702128   47.414894   51.223404
1   49.244681   49.000000   49.138298   49.191489   45.872340   49.010638   47.074468
2   49.436170   46.404255   49.021277   46.553191   55.031915   51.265957   50.638298
3   43.744681   47.787234   48.574468   45.882979   47.255319   47.914894   49.606383
4   49.265957   46.255319   50.276596   50.872340   46.723404   45.255319   49.904255

这是非常低效的,如果我有很多测量值,就没有多大意义了。有没有一种聪明的方法来应用/映射我的函数来实现相同的目标?trim_mean

python pandas numpy group-by

评论


答:

1赞 Timeless 11/17/2023 #1

一个可能的选择:

from calendar import day_abbr

results_df = (
   (ser:=df.T.stack()).droplevel(0).groupby(
     [ser.index.get_level_values(0).dayofweek, pd.Grouper(level=0)])
      .apply(lambda g: trim_mean(g, proportiontocut=1/np.std(g)))
      .unstack(0).set_axis(list(day_abbr), axis=1)
)

输出:

print(results_df)

         Mon        Tue        Wed        Thu        Fri        Sat        Sun
0  50.936170  51.712766  44.659574  49.117021  48.702128  47.414894  51.223404
1  49.244681  49.000000  49.138298  49.191489  45.872340  49.010638  47.074468
2  49.436170  46.404255  49.021277  46.553191  55.031915  51.265957  50.638298
3  43.744681  47.787234  48.574468  45.882979  47.255319  47.914894  49.606383
4  49.265957  46.255319  50.276596  50.872340  46.723404  45.255319  49.904255

[5 rows x 7 columns]

评论

1赞 Ted 11/17/2023
好吧,哇,这正是我所追求的那种东西!这是一个不可思议的方法,谢谢