从一列列表中提取每个项目，然后选择排名靠前的项目-解网

问：

我有以下 DateFrame：

| tag      | list                                                |
| -------- | ----------------------------------------------------|
| icecream | [['A',0.9],['B',0.6],['C',0.5],['D',0.3],['E',0.1]] |
| potato   | [['U',0.8],['V',0.7],['W',0.4],['X',0.3],['Y',0.2]] |

列列表是一个列表列表，每个列表都有一个项目和一个介于 1 到 0 之间的值。列表按此值的降序排列。

我想从这里提取每个项目并获得前 3 个项目，而不是项目本身。生成的数据框应为：

| item | top_3                           |
| ---- | --------------------------------|
| A    | [['B',0.6],['C',0.5],['D',0.3]] |
| B    | [['A',0.9],['C',0.5],['D',0.3]] |
| C    | [['A',0.9],['B',0.6],['D',0.3]] |
| D    | [['A',0.9],['B',0.6],['C',0.5]] |
| E    | [['A',0.9],['B',0.6],['C',0.5]] |
| U    | [['V',0.7],['W',0.4],['X',0.3]] |
| V    | [['U',0.8],['W',0.4],['X',0.3]] |
| W    | [['U',0.8],['V',0.7],['X',0.3]] |
| X    | [['U',0.8],['V',0.7],['W',0.4]] |
| Y    | [['U',0.8],['V',0.7],['W',0.4]] |

我尝试过并且能够提取值，我被困在我想在创建top_3时忽略项目本身的部分。这是我所做的：

data = [['icecream', [['A', 0.9],['B', 0.6],['C',0.5],['D',0.3],['E',0.1]]], 
        ['potato', [['U', 0.8],['V', 0.7],['W',0.4],['X',0.3],['Y',0.2]]]]

df = pd.DataFrame(data, columns=['tag', 'list'])
df

--

temp = {}
for idx, row in df.iterrows():
    for item in row["list"]:
        temp[item[0]] = row["tag"]

top_items = {}
for idx, row in df.iterrows():
    top_items[row["tag"]] = row["list"]

similar = []
for item, category in temp.items():
    top_3 = top_items.get(category)
    sample = top_3[:3]
    similar.append([item, sample])

df = pd.DataFrame(similar)
df.columns = ["item", "top_3"]

我的结果：

| item | top_3                           |
| ---- | --------------------------------|
| A    | [['A',0.9],['B',0.6],['C',0.5]] |
| B    | [['A',0.9],['B',0.6],['C',0.5]] |
| C    | [['A',0.9],['B',0.6],['C',0.5]] |
| D    | [['A',0.9],['B',0.6],['C',0.5]] |
| E    | [['A',0.9],['B',0.6],['C',0.5]] |
| U    | [['U',0.8],['V',0.7],['W',0.4]] |
| V    | [['U',0.8],['V',0.7],['W',0.4]] |
| W    | [['U',0.8],['V',0.7],['W',0.4]] |
| X    | [['U',0.8],['V',0.7],['W',0.4]] |
| Y    | [['U',0.8],['V',0.7],['W',0.4]] |

你看，A、B、C、U、V、W 的top_3是错误的，因为在所有情况下它都占据前 3 名，因此不关心项目本身。

我得到的结果总是带来前 3 名，我试图放置过滤器但无法让它工作。

如果有比我更好的方法来提取数据，请让我知道优化它的方法。

Python Pandas 列表帧数据操作

out = df.explode('list')

out = (out.merge(df1, on='tag').query('list_x != list_y')
          .sort_values('list_y', key=lambda x: x.str[1], ascending=False)
          .assign(item=lambda x: x.pop('list_x').str[0])
          .groupby(['tag', 'item'])['list_y'].apply(lambda x: x.head(3).tolist())
          .rename('top_3').reset_index())

输出：

>>> out
        tag item                           top_3
0  icecream    A  [[B, 0.6], [C, 0.5], [D, 0.3]]
1  icecream    B  [[A, 0.9], [C, 0.5], [D, 0.3]]
2  icecream    C  [[A, 0.9], [B, 0.6], [D, 0.3]]
3  icecream    D  [[A, 0.9], [B, 0.6], [C, 0.5]]
4  icecream    E  [[A, 0.9], [B, 0.6], [C, 0.5]]
5    potato    U  [[V, 0.7], [W, 0.4], [X, 0.3]]
6    potato    V  [[U, 0.8], [W, 0.4], [X, 0.3]]
7    potato    W  [[U, 0.8], [V, 0.7], [X, 0.3]]
8    potato    X  [[U, 0.8], [V, 0.7], [W, 0.4]]
9    potato    Y  [[U, 0.8], [V, 0.7], [W, 0.4]]

1赞 Himanshu Poddar 7/26/2022 #3

您可以使用 pandas 复制每个列表及其包含的元素数量。DataFrame.reindex，然后您可以使用 pandas 对元素进行分组。DataFrame.group，然后遍历组

df = df.reindex(df.index.repeat(df.list.apply(len)))

similar = pd.DataFrame(columns = ['item', 'top3'])
for group_name, df_group in df.groupby('tag')['list']:
    for index, rows in enumerate(df_group):
        similar.loc[similar.shape[0]] = ([rows[index][0], (rows[:index] + rows[index + 1:])[:3]])

输出：

这为您提供了预期的输出：

  item                            top3
0    A  [[B, 0.6], [C, 0.5], [D, 0.3]]
1    B  [[A, 0.9], [C, 0.5], [D, 0.3]]
2    C  [[A, 0.9], [B, 0.6], [D, 0.3]]
3    D  [[A, 0.9], [B, 0.6], [C, 0.5]]
4    E  [[A, 0.9], [B, 0.6], [C, 0.5]]
5    U  [[V, 0.7], [W, 0.4], [X, 0.3]]
6    V  [[U, 0.8], [W, 0.4], [X, 0.3]]
7    W  [[U, 0.8], [V, 0.7], [X, 0.3]]
8    X  [[U, 0.8], [V, 0.7], [W, 0.4]]
9    Y  [[U, 0.8], [V, 0.7], [W, 0.4]]

或者，

您也可以尝试在不显式循环访问组的情况下进行。

df = df.reindex(df.index.repeat(df.list.apply(len)))
temp = df.groupby('tag')['list'].apply(lambda x : [([rows[index][0], (rows[:index] + rows[index + 1:])[:3]]) for index, rows in enumerate(x)])
df['item'] = temp.explode().str[0].values
df['top3'] = temp.explode().str[1].values

输出：

这为您提供了相同的输出

上一个：将列表中的选定项除以 DataFrame 中的另一列，然后选择排名靠前的结果

下一个：如何将多个 DataFrame 行合并为 1 行，其中包含列表值

从一列列表中提取每个项目，然后选择排名靠前的项目

Extract each item from a column of lists and then pick the top items

评论

评论

输出：

输出：