使用 networkx MultiDiGraph 计算非常大的邻接矩阵崩溃的 Python 代码

Python code for calculation of very large adjacency matrix crashes using networkx MultiDiGraph

提问人:Eurico Covas 提问时间:11/13/2023 最后编辑:Eurico Covas 更新时间:11/13/2023 访问量:33

问:

我需要计算一个非常大的图形的邻接矩阵(平面格式)。节点数为 54327,数边数为 4600 万。输入是 4600 万条边,因此输入如下所示

1 6
2 7
1 6
3 8
...

这意味着节点 1 连接到 6,节点 2 连接到 7,并可能重复。在本例中,邻接矩阵如下所示

  1 2 3 6 7 8
1 0 0 0 2 0 0
2 0 0 0 0 1 0
3 0 0 0 0 0 1
6 1 0 0 0 0 0
7 0 2 0 0 0 0
8 0 0 1 0 0 0

事实上,我只需要非零条目的平面版本,所以:

node_x node_y count_edges
1       6       2
2       7       1
3       8       1
6       1       2
7       2       1
8       3       1

使用 networkx(或交叉表),python 代码会因内存不足而崩溃(在 8GB RAM 个人机器或 64GB RAM Linux 服务器上)。

以缓慢的方式进行,只需使用 for 循环即可,但需要 24+ 小时才能运行。

以下是完整的测试代码,关于如何获取 networkx(或交叉表或任何巧妙的替代方案)的任何想法?

import random
import pandas as pd
import networkx as nx
import sys

number_nodes = 54327 # real one is 54,327
number_edges_for_all_nodes = 46000000 # real one is around 46 million

print("number_nodes=",number_nodes)
print("number_edges_for_all_nodes=",number_edges_for_all_nodes)

# Create a list of possible list of nodes
list_nodes = [f'{i}' for i in range(1, number_nodes)]

# Create random combinations of possible values.
nodeid_x = random.choices(list_nodes, k=number_edges_for_all_nodes)
nodeid_y = random.choices(list_nodes, k=number_edges_for_all_nodes)

# This would look like this in a dataframe.
df = pd.DataFrame({'nodeid_x': nodeid_x, 'nodeid_y': nodeid_y})

# remove self nodes
df = df.query('nodeid_x != nodeid_y')

# df
print("Graph in df ->")
print(df.head(10))

print("Create edges for MultiDiGraph")            
edges = df[['nodeid_x', 'nodeid_y']].to_numpy(dtype=int)  
# 
print("Create nodes for MultiDiGraph")            
nodes = sorted(set(node for edge in edges for node in edge))
# create graph that allows parallel connections, so that one node may be connected twice to another       
# Multi means parallel connections, Di means directional, since we want both directions
G = nx.MultiDiGraph()
print("Create add_nodes_from for MultiDiGraph")            
G.add_nodes_from(nodes)
print("Create add_edges_from for MultiDiGraph")            
G.add_edges_from(edges)
# memory usage
edge = sum([sys.getsizeof(e) for e in G.edges])
node = sum([sys.getsizeof(n) for n in G.nodes])
print("Total memory graph GB:", (edge + node)/1024/1024/1024)

# count connections, store to int to save space            
print("Create to_pandas_adjacency for MultiDiGraph")            
df = nx.to_pandas_adjacency(G, dtype=int)  
print("Total memory adjacency GB:", df.memory_usage(deep=True).sum()/1024/1024/1024)
print(df)

# Reset the index to flatten the DataFrame from a matrix to a long table
print("Reset the index to flatten the DataFrame from a matrix to a long table")            
df = df.stack().reset_index() 
df.index = df.index.astype('int32', copy = False) 

# rename the column/row edges
print("rename the column/row edges")            
df.rename(columns={'level_0': 'nodeid_x'}, inplace=True)
df.rename(columns={'level_1': 'nodeid_y'}, inplace=True)  
                        
# rename crosstab intersection column
print("rename crosstab intersection column")
df.rename(columns={0: 'count_intersections'}, inplace=True)

# change count_intersections to int to save memory
print("change count_intersections to int to save memory")
df['count_intersections'] = df['count_intersections'].astype('int32', copy = False)
        
# now drop rows for which the nodeid_x = nodeid_y
print("now drop rows for which the nodeid_x = nodeid_y")
df = df.query('nodeid_x != nodeid_y')

# now drop zero intersections rows
print("now drop zero intersections rows")
df = df.query('count_intersections != 0')  

# sort
df.sort_values(by=['count_intersections', 'nodeid_x', 'nodeid_y'], ascending=False, inplace = True)

print("len(df)=",len(df))

print("pandas_adjacency matrix flat->")
print(df.head(9))

Networkx、交叉表、慢速 for 循环

python 数据透视表 networkx 邻接矩阵 接列表

评论


答:

1赞 Andrej Kesely 11/13/2023 #1

也许我错过了什么,买你可以用+这里:.groupby.sum

out = (
    df.assign(count_intersections=1)
    .groupby(["nodeid_x", "nodeid_y"], as_index=False)["count_intersections"]
    .sum()
)
out.sort_values(
    by=["count_intersections", "nodeid_x", "nodeid_y"], ascending=False, inplace=True
)
print(out.head(10))

打印(在一分钟内使用 ,有 46mio 边缘):random.seed(42)

         nodeid_x nodeid_y  count_intersections
41523794     5587    29315                    4
40851660    53761    14855                    4
40587824    53478    42683                    4
40429793    53307       75                    4
36763731     4938    42230                    4
30719061     4290    32316                    4
6800705     17279    42566                    4
45616995     9972    39801                    3
45613051     9968     5797                    3
45597210     9951    10498                    3

评论

1赞 Eurico Covas 11/14/2023
谢谢,这似乎有效,非常好!聪明!