在 python 数据帧中执行 vlookup 类型链，该链标记完成的迭代次数并在另一个数据帧中发布结果

Doing a vlookup type chain in a python dataframe, which labels the number of iterations complete and publishes results in another dataframe

提问人：8ull53y3 提问时间：9/7/2023 最后编辑：8ull53y3 更新时间：9/7/2023 访问量：32

问：

我希望在数据帧中尝试 vlookup 链，并将结果发布在新的数据帧中。

实际的数据帧与列一样多，但对于此任务，我只对 2 列感兴趣。放置在数据帧中的数据或元素不是随机的，并且都链接在一起，因此 a 列中的值与 b 列相关。下面的虚拟数据可能无法正确显示这一点，但也许我的输出可能有意义。

data = {'a': [111, 112, 113, 114, 115, 215, 214, 213, 212, 211],
        'b': [112, 113, 114, 115, 116, 214, 213, 212, 211, 210]}

有了上面的输出，我想首先检查一下，例如，b列中存在的a列中的每个元素是否都是我的初始搜索开始。

例如，如果我们使用 115 作为示例（我知道它不是第一个元素，但无论第一个元素如何，我都希望每个元素重复这个结果）

115 存在于两列中，在 B 列中搜索它，然后找到后，查看与 A 列 b 列中 115 相同的索引值，即 114，现在在 B 列中搜索 114，找到后在 A 列中搜索 found114 的索引，即 113 并完成，直到找不到更多。这是第一个完整的迭代循环，然后第二个循环将是相同的，但对于 215。

新的 df 将如下所示：

new_df =  {'found': [112, 113, 114, 115, 214, 213, 212, 211],
        'iteration': [1, 1, 1, 1, 2, 2, 2, 2]}

如果可能的话，省略半循环，所以如果 114 是一个新的起点，它将导致 114、113 和 112，但由于它是 115 中更大链的子集，我想省略它。

示例代码如下：

matches = df[df['a'].isin(df['b'])]

current_value = [df.loc[matches.index, 'a']][0]


result_dict = {'a': [], 'Chain':[]}

iteration = 1

for iteration, start_value in enumerate(starting_values, start=1):
    current_value = start_value
    visited_values = set()
    #iteration += 1
    values_in_loop = []

    while True:
        result.append(current_value)
        #iteration_results.add(iteration, start_value)
        if current_value in df['b'].values and current_value not in visited_values:
            current_index = df.index[df['b'] == current_value][0]
            current_value = df['a'].iloc[current_index]
            visited_values.add(current_value)
            values_in_loop.append(current_value)

        else:
            break
        
    iteration += 1   
    
new_df = pd.DataFrame({'found': result[::], 'Chain': iteration})

new_df

我在 stackoverflow 上找到的另一个类似问题的例子链接在这里：在 python 中使用两列构建链

   b      a
Type1   Type2
Type3   Type4
Type8   Type13
Type3   Type15
Type2   Type6
Type4   Type9
Type6   Type11
Type9   Type18
Type13  Type20

输出结果如下，该问题答案使用类似的方法，但以不同的方式应用它，但输出与我的要求不同。希望这会有所帮助

found  iteration
Type2     1
Type6     1
Type11    1
Type 4    2
Type 9    2
Type 18   2
Type 13   3
Type 20   3

蟒熊猫数据帧迭代伊辛

0赞 mozway 9/7/2023

所以你所说的“迭代”是指 115/114/113/112/111 所属的组？

答：

0赞 mozway 9/7/2023 #1

这是一个图问题，使用带有weakly_connected_components的网络，并选择 pandas.factorize：

# pip install networkx
import networkx as nx

G = nx.from_pandas_edgelist(df, source='b', target='a', create_using=nx.DiGraph)

groups = {n: i for i, g in enumerate(nx.weakly_connected_components(G), start=1)
          for n in g}

m = df['a'].isin(df['b'])

out = pd.DataFrame({'found': df.loc[m, 'a'],
                    'iteration': df.loc[m, 'a'].map(groups)
                    })
# optional
# if you must ensure that the numbers are 1 -> n without missing ones
out['iteration'] = pd.factorize(out['iteration'])[0]+1

输出：

   found  iteration
1    112          1
2    113          1
3    114          1
4    115          1
6    214          2
7    213          2
8    212          2
9    211          2

图：

示例数据集，用于演示以下效果：factorize

data = {'a': [111, 112, 113, 114, 300, 115, 215, 214, 213, 212, 211],
        'b': [112, 113, 114, 115, 301, 116, 214, 213, 212, 211, 210]}
df = pd.DataFrame(data)

输出：

    found  iteration  iteration_factorize
1     112          1                    1
2     113          1                    1
3     114          1                    1
5     115          1                    1
7     214          3                    2
8     213          3                    2
9     212          3                    2
10    211          3                    2

上一个：在 Polars 中，有没有更好的方法，如果字符串中的项目使用 .is_in 匹配列表中的项目，则仅返回字符串中的项目？

下一个：如何在 python 中将第 1 列的每一行与第 2 列的所有行进行比较？

在 python 数据帧中执行 vlookup 类型链，该链标记完成的迭代次数并在另一个数据帧中发布结果

Doing a vlookup type chain in a python dataframe, which labels the number of iterations complete and publishes results in another dataframe

评论