如何在 Python 中基于直接和间接连接对数据进行聚类-解网

问：

我有一个数据列表，我需要在 Python 中有效地聚类（总共 8mio KV 对）。

例：

K1 = {V1, V2, V3}
K2 = {V1, V4}
K3 = {V5}
K4 = {V6}
K5 = {V1, V6}

结果：

C1 = {K1, K2, K4, K5}
C2 = {K3}

C1 包含 K1、K2、K5，因为它们都包含 V1（直接连接）
C1 包含 K4，因为 V6 存在于 K5 中，并且它与其他连接（间接连接）
C2 包含 K3，因为它与 C1 中存在的任何 KV 对都没有连接

解决此问题的一种方法是首先创建仅考虑直接连接的临时集群 TC：

TC1 = {K1, K2, K5}
TC2 = {K3}
TC2 = {K4, K5}

然后将所有重复出现的 K 合并到最终的簇 C 中：

C1 = {K1, K2, K4, K5}
C2 = {K3}

但是，这种方法会导致很大的复杂性，我认为可能有更有效的解决方案。

是否有任何数学问题可以描述这一点？还是相关问题？

python list dictionary 聚类分析

import networkx as nx

graph = {
    "K1": {"V1", "V2", "V3"},
    "K2": {"V1", "V4"},
    "K3": {"V5"},
    "K4": {"V6"},
    "K5": {"V1", "V6"},
}

inv_graph = {}
G = nx.Graph()

for k, v in graph.items():
    nx.add_path(G, v)
    for i in v:
        inv_graph.setdefault(i, set()).add(k)

out = []
for c in nx.connected_components(G):
    tmp = set()
    for i in c:
        tmp |= inv_graph[i]
    out.append(tmp)

print(out)

指纹：

 [
   {'K2', 'K1', 'K5', 'K4'}, 
   {'K3'}
 ]

上一个：搜索两个列表，并按正确的顺序为每对填写“无”

下一个：如何在不遍历列表中的所有项目的情况下创建 else 语句

如何在 Python 中基于直接和间接连接对数据进行聚类

How to cluster data based on direct and indirect connections in Python

评论