从一列列表中提取每个项目,然后选择排名靠前的项目

Extract each item from a column of lists and then pick the top items

提问人:trojan horse 提问时间:7/26/2022 最后编辑:trojan horse 更新时间:7/26/2022 访问量:219

问:

我有以下 DateFrame:

| tag      | list                                                |
| -------- | ----------------------------------------------------|
| icecream | [['A',0.9],['B',0.6],['C',0.5],['D',0.3],['E',0.1]] |
| potato   | [['U',0.8],['V',0.7],['W',0.4],['X',0.3],['Y',0.2]] |

列列表是一个列表列表,每个列表都有一个项目和一个介于 1 到 0 之间的值。列表按此值的降序排列。

我想从这里提取每个项目并获得前 3 个项目,而不是项目本身。生成的数据框应为:

| item | top_3                           |
| ---- | --------------------------------|
| A    | [['B',0.6],['C',0.5],['D',0.3]] |
| B    | [['A',0.9],['C',0.5],['D',0.3]] |
| C    | [['A',0.9],['B',0.6],['D',0.3]] |
| D    | [['A',0.9],['B',0.6],['C',0.5]] |
| E    | [['A',0.9],['B',0.6],['C',0.5]] |
| U    | [['V',0.7],['W',0.4],['X',0.3]] |
| V    | [['U',0.8],['W',0.4],['X',0.3]] |
| W    | [['U',0.8],['V',0.7],['X',0.3]] |
| X    | [['U',0.8],['V',0.7],['W',0.4]] |
| Y    | [['U',0.8],['V',0.7],['W',0.4]] |

我尝试过并且能够提取值,我被困在我想在创建top_3时忽略项目本身的部分。这是我所做的:

data = [['icecream', [['A', 0.9],['B', 0.6],['C',0.5],['D',0.3],['E',0.1]]], 
        ['potato', [['U', 0.8],['V', 0.7],['W',0.4],['X',0.3],['Y',0.2]]]]

df = pd.DataFrame(data, columns=['tag', 'list'])
df

--

temp = {}
for idx, row in df.iterrows():
    for item in row["list"]:
        temp[item[0]] = row["tag"]

top_items = {}
for idx, row in df.iterrows():
    top_items[row["tag"]] = row["list"]

similar = []
for item, category in temp.items():
    top_3 = top_items.get(category)
    sample = top_3[:3]
    similar.append([item, sample])

df = pd.DataFrame(similar)
df.columns = ["item", "top_3"]

我的结果:

| item | top_3                           |
| ---- | --------------------------------|
| A    | [['A',0.9],['B',0.6],['C',0.5]] |
| B    | [['A',0.9],['B',0.6],['C',0.5]] |
| C    | [['A',0.9],['B',0.6],['C',0.5]] |
| D    | [['A',0.9],['B',0.6],['C',0.5]] |
| E    | [['A',0.9],['B',0.6],['C',0.5]] |
| U    | [['U',0.8],['V',0.7],['W',0.4]] |
| V    | [['U',0.8],['V',0.7],['W',0.4]] |
| W    | [['U',0.8],['V',0.7],['W',0.4]] |
| X    | [['U',0.8],['V',0.7],['W',0.4]] |
| Y    | [['U',0.8],['V',0.7],['W',0.4]] |

你看,A、B、C、U、V、W 的top_3是错误的,因为在所有情况下它都占据前 3 名,因此不关心项目本身。

我得到的结果总是带来前 3 名,我试图放置过滤器但无法让它工作。

如果有比我更好的方法来提取数据,请让我知道优化它的方法。

Python Pandas 列表 帧数据 操作

评论


答:

1赞 Alvaro 7/26/2022 #1

在这一部分中,您缺少 if/else 条件,您只选择前 3 个项目,而忽略了在前 3 个中,您不应该使用相同的项目键

for item, category in temp.items():
    top_3 = top_items.get(category)
    sample = top_3[:3]
    similar.append([item, sample])

解决方案是,首先从top_3中删除该项目,然后获取“样本”

for item, category in temp.items():
    top_3 = top_items.get(category)
    top_3_without_item = [x for x in top_3 if x[0] != item]
    sample = top_3_without_item[:3]
    similar.append([item, sample])

评论

0赞 trojan horse 7/26/2022
是的,这正是我想做的,但我搞砸了指数。该死的。
1赞 Corralien 7/26/2022 #2

作为起点,您可以分解您的列,然后自行合并。接下来,您必须删除两个列表列相等的行,最后对前 3 个值进行分组:list

out = df.explode('list')

out = (out.merge(df1, on='tag').query('list_x != list_y')
          .sort_values('list_y', key=lambda x: x.str[1], ascending=False)
          .assign(item=lambda x: x.pop('list_x').str[0])
          .groupby(['tag', 'item'])['list_y'].apply(lambda x: x.head(3).tolist())
          .rename('top_3').reset_index())

输出:

>>> out
        tag item                           top_3
0  icecream    A  [[B, 0.6], [C, 0.5], [D, 0.3]]
1  icecream    B  [[A, 0.9], [C, 0.5], [D, 0.3]]
2  icecream    C  [[A, 0.9], [B, 0.6], [D, 0.3]]
3  icecream    D  [[A, 0.9], [B, 0.6], [C, 0.5]]
4  icecream    E  [[A, 0.9], [B, 0.6], [C, 0.5]]
5    potato    U  [[V, 0.7], [W, 0.4], [X, 0.3]]
6    potato    V  [[U, 0.8], [W, 0.4], [X, 0.3]]
7    potato    W  [[U, 0.8], [V, 0.7], [X, 0.3]]
8    potato    X  [[U, 0.8], [V, 0.7], [W, 0.4]]
9    potato    Y  [[U, 0.8], [V, 0.7], [W, 0.4]]
1赞 Himanshu Poddar 7/26/2022 #3

您可以使用 pandas 复制每个列表及其包含的元素数量。DataFrame.reindex,然后您可以使用 pandas 对元素进行分组。DataFrame.group,然后遍历组

df = df.reindex(df.index.repeat(df.list.apply(len)))

similar = pd.DataFrame(columns = ['item', 'top3'])
for group_name, df_group in df.groupby('tag')['list']:
    for index, rows in enumerate(df_group):
        similar.loc[similar.shape[0]] = ([rows[index][0], (rows[:index] + rows[index + 1:])[:3]])

输出:

这为您提供了预期的输出:

  item                            top3
0    A  [[B, 0.6], [C, 0.5], [D, 0.3]]
1    B  [[A, 0.9], [C, 0.5], [D, 0.3]]
2    C  [[A, 0.9], [B, 0.6], [D, 0.3]]
3    D  [[A, 0.9], [B, 0.6], [C, 0.5]]
4    E  [[A, 0.9], [B, 0.6], [C, 0.5]]
5    U  [[V, 0.7], [W, 0.4], [X, 0.3]]
6    V  [[U, 0.8], [W, 0.4], [X, 0.3]]
7    W  [[U, 0.8], [V, 0.7], [X, 0.3]]
8    X  [[U, 0.8], [V, 0.7], [W, 0.4]]
9    Y  [[U, 0.8], [V, 0.7], [W, 0.4]]

或者

您也可以尝试在不显式循环访问组的情况下进行。

df = df.reindex(df.index.repeat(df.list.apply(len)))
temp = df.groupby('tag')['list'].apply(lambda x : [([rows[index][0], (rows[:index] + rows[index + 1:])[:3]]) for index, rows in enumerate(x)])
df['item'] = temp.explode().str[0].values
df['top3'] = temp.explode().str[1].values

输出:

这为您提供了相同的输出

enter image description here