提问人:A Doe 提问时间:3/21/2023 最后编辑:A Doe 更新时间:3/21/2023 访问量:73
Pandas 数据掩码,其中条件来自其他变量
Pandas data masking where the conditions come from other variables
问:
我有一个数据帧和两个列表,如下所示:
seller1 = [5, 4, 3]
seller2 = [4, 2, 1]
df = {'customer': [1, 1, 1, 2, 2, 2], 'time': [1,2,3,1,2,3], 'location': [3,4,2,4,3,3], 'demand':[10,12,15,20,8,16], 'price':[3,4,4,5,2,1]}
df = pd.DataFrame(df)
结果如下表所示:
customer time location demand price
0 1 1 3 10 3
1 1 2 4 12 4
2 1 3 2 15 4
3 2 1 4 20 5
4 2 2 3 8 2
5 2 3 3 16 1
和列表显示卖家在时间 1、2 和 3 的位置。我想知道需求和价格,如果其中一个卖家在确切的时间在那里,否则会掩盖需求数据。例如,在时间 1,卖家 1 位于位置 5,卖家 2 位于位置 4。同样,客户 1 位于位置 3,客户 2 位于位置 4。因此,卖家在 t=1 时遇到了第一个客户,但没有遇到第二个客户。seller1
seller2
我想要的结束的桌子是
customer time location demand price
0 1 1 3 None None
1 1 2 4 12 4
2 1 3 2 None None
3 2 1 4 20 5
4 2 2 3 None None
5 2 3 3 16 1
到目前为止,我有
for i in range(df.shape[0]):
if df["location"][i] != seller1[int(df["time"][i])-1] and df["location"][i] != seller2[int(df["time"][i])-1]:
df["demand"][i] = np.nan
df["price"][i] = np.nan
这会产生一个,而且 for 循环看起来也效率不高。SettingWithCopyWarning:
有没有办法用df.mask()做到这一点?
答:
1赞
jjsantoso
3/21/2023
#1
在我看来,最好使用时间变量将卖家列表制作成 DataFrame 并将其合并到主 df,这样您就可以比较客户和卖家的位置。
这应该有效:
import pandas as pd
seller1 = [5, 4, 3]
seller2 = [4, 2, 1]
data = {'customer': [1, 1, 1, 2, 2, 2], 'time': [1,2,3,1,2,3], 'location': [3,4,2,4,3,3], 'demand':[10,12,15,20,8,16], 'price':[3,4,4,5,2,1]}
df = pd.DataFrame(data)
sellers = pd.DataFrame(dict(seller1=seller1, seller2=seller2, time=range(1, len(seller1) + 1)))
df_sellers = df.merge(sellers, on='time').sort_values(['customer', 'time'])
match_sellers = (df_sellers['location'] == df_sellers['seller1']) | (df_sellers['location'] == df_sellers['seller2'])
df_sellers.loc[match_sellers, 'price_match'] = df_sellers.loc[match_sellers, 'price']
1赞
Shubham Sharma
3/21/2023
#2
带注释的代码
# Create tuples corresponding to time and location for each seller
c = ['time', 'location']
sellers = [(i, x)
for s in (seller1, seller2)
for i, x in enumerate(s, 1)]
# Identify the rows where tuple pairs match with
# the time and location from the given dataframe
mask = df.set_index(c).index.isin(sellers)
# Mask the rows where codition doesn;t hold true
c = ['demand', 'price']
df.loc[~mask, c] = np.nan
结果
customer time location demand price
0 1 1 3 NaN NaN
1 1 2 4 12.0 4.0
2 1 3 2 NaN NaN
3 2 1 4 20.0 5.0
4 2 2 3 NaN NaN
5 2 3 3 16.0 1.0
评论