使用条件从不同 DataFrame 获取聚合到当前 DataFrame-解网

问：

我有一个收获数据框和一个天气数据框。我想获得所有区块在收获前 x 个月高于温度阈值的天数。请注意，收获数据帧包含多个年份，并且帧之间的 id 不是 1-1，即收获 df 中的 2 个块可以共享一个与天气帧中的位置相对应的 ID。

我当前的（工作）代码如下，但它非常慢，大约几分钟。我想加快速度，但不清楚如何。

def days_above_thresh(x, weather_df):
    return weather_df.loc[
            (weather_df["id"]==x.id) & \
            (weather_df["day"]>=x['harvest_date']-DateOffset(months=2)) & \
            (weather_df["day"]<=x['harvest_date']) & \
            (weather_df["temperature_max"]>30),
            "temperature_max"].count()

harvest_df["days_above_30"] = harvest_df.apply(days_above_thresh , args=(weather_df,), axis=1)

数据帧将如下所示 -

weather_df
id      day      temperature_max
1    2020-01-01    30
1    2020-01-02    32
1    2020-01-03    28
1    2020-01-04    25 
         .
         .
         .
2    2020-01-01    10
2    2020-01-02    15
2    2020-01-03    17
2    2020-01-04    12
         .
         .
         .

harvest_df
id   farm_id  harvest_date
1       87    2020-01-02 
1       86    2020-01-03
2       13    2020-01-30

python-3.x pandas 数据帧

tmp = harvest_df.reset_index().merge(weather_df[['id', 'day', 'temperature_max']], on='id', how='left')
msk = tmp['day'].between(tmp['harvest_date'].sub(np.timedelta(2, 'M')).dt.floor('D'), tmp['harvest_date']) & tmp['temperature_max'].gt(30)
harvest_df["days_above_30"] = tmp[msk].groupby('index').size().reindex(harvest_df.index, fill_value=0)

也可以写成一行：

harvest_df["days_above_30"] = (
    harvest_df.reset_index().merge(weather_df[['id', 'day', 'temperature_max']], on='id', how='left')
    .assign(two_month_prior=lambda x: x['harvest_date'].sub(np.timedelta64(2, 'M')).dt.floor('D'))
    .query("two_month_prior <= day <= harvest_date and temperature_max > 30")
    .groupby('index').size()
    .reindex(harvest_df.index, fill_value=0)
)

使用条件从不同 DataFrame 获取聚合到当前 DataFrame

Get aggregates from different Dataframe to current Dataframe with conditions

评论

评论