提问人:Ted 提问时间:11/17/2023 更新时间:11/17/2023 访问量:41
有没有更有效的方法可以按行应用,然后按列应用?
Is there a more efficient way to Apply by Row, then by Column?
问:
我的数据集包含每天进行的 5 次测量,时间跨度为 700 天。我希望能够按星期几对这些值进行分组,然后将函数 from 应用于 5 个测量值中的每一个,用作参数。trim_mean
scipy.stats
1/stddev
proportiontocut
我的数据:
import pandas as pd
import numpy as np
from scipy.stats import trim_mean
np.random.seed(42)
data = np.random.randint(0, 100, size=(5, 700))
col_names = pd.date_range('11-16-2023', periods=700)
df = pd.DataFrame(data, columns=col_names)
# df
2023-11-16 2023-11-17 ... 2025-10-15
0 51 92 ... 57
1 88 48 ... 32
2 89 52 ... 96
3 61 99 ... 48
4 0 7 ... 34
现在,我可以使用以下(不是很优雅的)过程来做到这一点:
df_T = df.T
df_T['Day of Week'] = pd.to_datetime(df_T.index).isocalendar().day
## Room for improvement here ##
# Apply calculation to each type of measurement
gb = df_T.groupby('Day of Week')
m0 = gb[0].apply(lambda x: trim_mean(x, proportiontocut=1/np.std(x)))
m1 = gb[1].apply(lambda x: trim_mean(x, proportiontocut=1/np.std(x)))
m2 = gb[2].apply(lambda x: trim_mean(x, proportiontocut=1/np.std(x)))
m3 = gb[3].apply(lambda x: trim_mean(x, proportiontocut=1/np.std(x)))
m4 = gb[4].apply(lambda x: trim_mean(x, proportiontocut=1/np.std(x)))
results_df = pd.DataFrame([m0, m1, m2, m3, m4])
results_df.columns = columns=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
# results_df
Mon Tue Wed Thu Fri Sat Sun
0 50.936170 51.712766 44.659574 49.117021 48.702128 47.414894 51.223404
1 49.244681 49.000000 49.138298 49.191489 45.872340 49.010638 47.074468
2 49.436170 46.404255 49.021277 46.553191 55.031915 51.265957 50.638298
3 43.744681 47.787234 48.574468 45.882979 47.255319 47.914894 49.606383
4 49.265957 46.255319 50.276596 50.872340 46.723404 45.255319 49.904255
这是非常低效的,如果我有很多测量值,就没有多大意义了。有没有一种聪明的方法来应用/映射我的函数来实现相同的目标?trim_mean
答:
1赞
Timeless
11/17/2023
#1
一个可能的选择:
from calendar import day_abbr
results_df = (
(ser:=df.T.stack()).droplevel(0).groupby(
[ser.index.get_level_values(0).dayofweek, pd.Grouper(level=0)])
.apply(lambda g: trim_mean(g, proportiontocut=1/np.std(g)))
.unstack(0).set_axis(list(day_abbr), axis=1)
)
输出:
print(results_df)
Mon Tue Wed Thu Fri Sat Sun
0 50.936170 51.712766 44.659574 49.117021 48.702128 47.414894 51.223404
1 49.244681 49.000000 49.138298 49.191489 45.872340 49.010638 47.074468
2 49.436170 46.404255 49.021277 46.553191 55.031915 51.265957 50.638298
3 43.744681 47.787234 48.574468 45.882979 47.255319 47.914894 49.606383
4 49.265957 46.255319 50.276596 50.872340 46.723404 45.255319 49.904255
[5 rows x 7 columns]
评论
1赞
Ted
11/17/2023
好吧,哇,这正是我所追求的那种东西!这是一个不可思议的方法,谢谢
评论