提问人:trojan horse 提问时间:7/26/2022 最后编辑:trojan horse 更新时间:7/26/2022 访问量:219
从一列列表中提取每个项目,然后选择排名靠前的项目
Extract each item from a column of lists and then pick the top items
问:
我有以下 DateFrame:
| tag | list |
| -------- | ----------------------------------------------------|
| icecream | [['A',0.9],['B',0.6],['C',0.5],['D',0.3],['E',0.1]] |
| potato | [['U',0.8],['V',0.7],['W',0.4],['X',0.3],['Y',0.2]] |
列列表是一个列表列表,每个列表都有一个项目和一个介于 1 到 0 之间的值。列表按此值的降序排列。
我想从这里提取每个项目并获得前 3 个项目,而不是项目本身。生成的数据框应为:
| item | top_3 |
| ---- | --------------------------------|
| A | [['B',0.6],['C',0.5],['D',0.3]] |
| B | [['A',0.9],['C',0.5],['D',0.3]] |
| C | [['A',0.9],['B',0.6],['D',0.3]] |
| D | [['A',0.9],['B',0.6],['C',0.5]] |
| E | [['A',0.9],['B',0.6],['C',0.5]] |
| U | [['V',0.7],['W',0.4],['X',0.3]] |
| V | [['U',0.8],['W',0.4],['X',0.3]] |
| W | [['U',0.8],['V',0.7],['X',0.3]] |
| X | [['U',0.8],['V',0.7],['W',0.4]] |
| Y | [['U',0.8],['V',0.7],['W',0.4]] |
我尝试过并且能够提取值,我被困在我想在创建top_3时忽略项目本身的部分。这是我所做的:
data = [['icecream', [['A', 0.9],['B', 0.6],['C',0.5],['D',0.3],['E',0.1]]],
['potato', [['U', 0.8],['V', 0.7],['W',0.4],['X',0.3],['Y',0.2]]]]
df = pd.DataFrame(data, columns=['tag', 'list'])
df
--
temp = {}
for idx, row in df.iterrows():
for item in row["list"]:
temp[item[0]] = row["tag"]
top_items = {}
for idx, row in df.iterrows():
top_items[row["tag"]] = row["list"]
similar = []
for item, category in temp.items():
top_3 = top_items.get(category)
sample = top_3[:3]
similar.append([item, sample])
df = pd.DataFrame(similar)
df.columns = ["item", "top_3"]
我的结果:
| item | top_3 |
| ---- | --------------------------------|
| A | [['A',0.9],['B',0.6],['C',0.5]] |
| B | [['A',0.9],['B',0.6],['C',0.5]] |
| C | [['A',0.9],['B',0.6],['C',0.5]] |
| D | [['A',0.9],['B',0.6],['C',0.5]] |
| E | [['A',0.9],['B',0.6],['C',0.5]] |
| U | [['U',0.8],['V',0.7],['W',0.4]] |
| V | [['U',0.8],['V',0.7],['W',0.4]] |
| W | [['U',0.8],['V',0.7],['W',0.4]] |
| X | [['U',0.8],['V',0.7],['W',0.4]] |
| Y | [['U',0.8],['V',0.7],['W',0.4]] |
你看,A、B、C、U、V、W 的top_3是错误的,因为在所有情况下它都占据前 3 名,因此不关心项目本身。
我得到的结果总是带来前 3 名,我试图放置过滤器但无法让它工作。
如果有比我更好的方法来提取数据,请让我知道优化它的方法。
答:
1赞
Alvaro
7/26/2022
#1
在这一部分中,您缺少 if/else 条件,您只选择前 3 个项目,而忽略了在前 3 个中,您不应该使用相同的项目键
for item, category in temp.items():
top_3 = top_items.get(category)
sample = top_3[:3]
similar.append([item, sample])
解决方案是,首先从top_3中删除该项目,然后获取“样本”
for item, category in temp.items():
top_3 = top_items.get(category)
top_3_without_item = [x for x in top_3 if x[0] != item]
sample = top_3_without_item[:3]
similar.append([item, sample])
评论
0赞
trojan horse
7/26/2022
是的,这正是我想做的,但我搞砸了指数。该死的。
1赞
Corralien
7/26/2022
#2
作为起点,您可以分解您的列,然后自行合并。接下来,您必须删除两个列表列相等的行,最后对前 3 个值进行分组:list
out = df.explode('list')
out = (out.merge(df1, on='tag').query('list_x != list_y')
.sort_values('list_y', key=lambda x: x.str[1], ascending=False)
.assign(item=lambda x: x.pop('list_x').str[0])
.groupby(['tag', 'item'])['list_y'].apply(lambda x: x.head(3).tolist())
.rename('top_3').reset_index())
输出:
>>> out
tag item top_3
0 icecream A [[B, 0.6], [C, 0.5], [D, 0.3]]
1 icecream B [[A, 0.9], [C, 0.5], [D, 0.3]]
2 icecream C [[A, 0.9], [B, 0.6], [D, 0.3]]
3 icecream D [[A, 0.9], [B, 0.6], [C, 0.5]]
4 icecream E [[A, 0.9], [B, 0.6], [C, 0.5]]
5 potato U [[V, 0.7], [W, 0.4], [X, 0.3]]
6 potato V [[U, 0.8], [W, 0.4], [X, 0.3]]
7 potato W [[U, 0.8], [V, 0.7], [X, 0.3]]
8 potato X [[U, 0.8], [V, 0.7], [W, 0.4]]
9 potato Y [[U, 0.8], [V, 0.7], [W, 0.4]]
1赞
Himanshu Poddar
7/26/2022
#3
您可以使用 pandas 复制每个列表及其包含的元素数量。DataFrame.reindex
,然后您可以使用 pandas 对元素进行分组。DataFrame.group,
然后遍历组
df = df.reindex(df.index.repeat(df.list.apply(len)))
similar = pd.DataFrame(columns = ['item', 'top3'])
for group_name, df_group in df.groupby('tag')['list']:
for index, rows in enumerate(df_group):
similar.loc[similar.shape[0]] = ([rows[index][0], (rows[:index] + rows[index + 1:])[:3]])
输出:
这为您提供了预期的输出:
item top3
0 A [[B, 0.6], [C, 0.5], [D, 0.3]]
1 B [[A, 0.9], [C, 0.5], [D, 0.3]]
2 C [[A, 0.9], [B, 0.6], [D, 0.3]]
3 D [[A, 0.9], [B, 0.6], [C, 0.5]]
4 E [[A, 0.9], [B, 0.6], [C, 0.5]]
5 U [[V, 0.7], [W, 0.4], [X, 0.3]]
6 V [[U, 0.8], [W, 0.4], [X, 0.3]]
7 W [[U, 0.8], [V, 0.7], [X, 0.3]]
8 X [[U, 0.8], [V, 0.7], [W, 0.4]]
9 Y [[U, 0.8], [V, 0.7], [W, 0.4]]
或者,
您也可以尝试在不显式循环访问组的情况下进行。
df = df.reindex(df.index.repeat(df.list.apply(len)))
temp = df.groupby('tag')['list'].apply(lambda x : [([rows[index][0], (rows[:index] + rows[index + 1:])[:3]]) for index, rows in enumerate(x)])
df['item'] = temp.explode().str[0].values
df['top3'] = temp.explode().str[1].values
输出:
这为您提供了相同的输出
评论