Pandas 数据掩码,其中条件来自其他变量

Pandas data masking where the conditions come from other variables

提问人:A Doe 提问时间:3/21/2023 最后编辑:A Doe 更新时间:3/21/2023 访问量:73

问:

我有一个数据帧和两个列表,如下所示:

seller1 = [5, 4, 3]
seller2 = [4, 2, 1] 

df = {'customer': [1, 1, 1, 2, 2, 2], 'time': [1,2,3,1,2,3], 'location': [3,4,2,4,3,3], 'demand':[10,12,15,20,8,16], 'price':[3,4,4,5,2,1]}
df = pd.DataFrame(df)

结果如下表所示:

   customer  time    location  demand price   
0   1          1        3        10     3   
1   1          2        4        12     4
2   1          3        2        15     4            
3   2          1        4        20     5 
4   2          2        3         8     2 
5   2          3        3        16     1 

和列表显示卖家在时间 1、2 和 3 的位置。我想知道需求和价格,如果其中一个卖家在确切的时间在那里,否则会掩盖需求数据。例如,在时间 1,卖家 1 位于位置 5,卖家 2 位于位置 4。同样,客户 1 位于位置 3,客户 2 位于位置 4。因此,卖家在 t=1 时遇到了第一个客户,但没有遇到第二个客户。seller1seller2

我想要的结束的桌子是

   customer  time    location  demand  price   
0   1          1        3        None    None   
1   1          2        4        12      4
2   1          3        2        None    None            
3   2          1        4        20      5 
4   2          2        3        None    None 
5   2          3        3        16      1 

到目前为止,我有

for i in range(df.shape[0]):
    if df["location"][i] != seller1[int(df["time"][i])-1] and df["location"][i] != seller2[int(df["time"][i])-1]:
        df["demand"][i] = np.nan
        df["price"][i] = np.nan

这会产生一个,而且 for 循环看起来也效率不高。SettingWithCopyWarning:

有没有办法用df.mask()做到这一点?

Python Pandas DataFrame 数据掩码

评论


答:

1赞 jjsantoso 3/21/2023 #1

在我看来,最好使用时间变量将卖家列表制作成 DataFrame 并将其合并到主 df,这样您就可以比较客户和卖家的位置。

这应该有效:

import pandas as pd

seller1 = [5, 4, 3]
seller2 = [4, 2, 1]

data = {'customer': [1, 1, 1, 2, 2, 2], 'time': [1,2,3,1,2,3], 'location': [3,4,2,4,3,3], 'demand':[10,12,15,20,8,16], 'price':[3,4,4,5,2,1]}
df = pd.DataFrame(data)
sellers = pd.DataFrame(dict(seller1=seller1, seller2=seller2, time=range(1, len(seller1) + 1)))
df_sellers = df.merge(sellers, on='time').sort_values(['customer', 'time'])
match_sellers = (df_sellers['location'] == df_sellers['seller1']) | (df_sellers['location'] == df_sellers['seller2'])
df_sellers.loc[match_sellers, 'price_match'] = df_sellers.loc[match_sellers, 'price']

resulting dataframe

1赞 Shubham Sharma 3/21/2023 #2

带注释的代码

# Create tuples corresponding to time and location for each seller
c = ['time', 'location']
sellers = [(i, x) 
           for s in (seller1, seller2) 
           for i, x in enumerate(s, 1)]

# Identify the rows where tuple pairs match with
# the time and location from the given dataframe
mask = df.set_index(c).index.isin(sellers)

# Mask the rows where codition doesn;t hold true
c = ['demand', 'price']
df.loc[~mask, c] = np.nan

结果

   customer  time  location  demand  price
0         1     1         3     NaN    NaN
1         1     2         4    12.0    4.0
2         1     3         2     NaN    NaN
3         2     1         4    20.0    5.0
4         2     2         3     NaN    NaN
5         2     3         3    16.0    1.0