使用 networkx MultiDiGraph 计算非常大的邻接矩阵崩溃的 Python 代码-解网

问：

我需要计算一个非常大的图形的邻接矩阵（平面格式）。节点数为 54327，数边数为 4600 万。输入是 4600 万条边，因此输入如下所示

这意味着节点 1 连接到 6，节点 2 连接到 7，并可能重复。在本例中，邻接矩阵如下所示

  1 2 3 6 7 8
1 0 0 0 2 0 0
2 0 0 0 0 1 0
3 0 0 0 0 0 1
6 1 0 0 0 0 0
7 0 2 0 0 0 0
8 0 0 1 0 0 0

事实上，我只需要非零条目的平面版本，所以：

node_x node_y count_edges
1       6       2
2       7       1
3       8       1
6       1       2
7       2       1
8       3       1

使用 networkx（或交叉表），python 代码会因内存不足而崩溃（在 8GB RAM 个人机器或 64GB RAM Linux 服务器上）。

以缓慢的方式进行，只需使用 for 循环即可，但需要 24+ 小时才能运行。

以下是完整的测试代码，关于如何获取 networkx（或交叉表或任何巧妙的替代方案）的任何想法？

import random
import pandas as pd
import networkx as nx
import sys

number_nodes = 54327 # real one is 54,327
number_edges_for_all_nodes = 46000000 # real one is around 46 million

print("number_nodes=",number_nodes)
print("number_edges_for_all_nodes=",number_edges_for_all_nodes)

# Create a list of possible list of nodes
list_nodes = [f'{i}' for i in range(1, number_nodes)]

# Create random combinations of possible values.
nodeid_x = random.choices(list_nodes, k=number_edges_for_all_nodes)
nodeid_y = random.choices(list_nodes, k=number_edges_for_all_nodes)

# This would look like this in a dataframe.
df = pd.DataFrame({'nodeid_x': nodeid_x, 'nodeid_y': nodeid_y})

# remove self nodes
df = df.query('nodeid_x != nodeid_y')

# df
print("Graph in df ->")
print(df.head(10))

print("Create edges for MultiDiGraph")            
edges = df[['nodeid_x', 'nodeid_y']].to_numpy(dtype=int)  
# 
print("Create nodes for MultiDiGraph")            
nodes = sorted(set(node for edge in edges for node in edge))
# create graph that allows parallel connections, so that one node may be connected twice to another       
# Multi means parallel connections, Di means directional, since we want both directions
G = nx.MultiDiGraph()
print("Create add_nodes_from for MultiDiGraph")            
G.add_nodes_from(nodes)
print("Create add_edges_from for MultiDiGraph")            
G.add_edges_from(edges)
# memory usage
edge = sum([sys.getsizeof(e) for e in G.edges])
node = sum([sys.getsizeof(n) for n in G.nodes])
print("Total memory graph GB:", (edge + node)/1024/1024/1024)

# count connections, store to int to save space            
print("Create to_pandas_adjacency for MultiDiGraph")            
df = nx.to_pandas_adjacency(G, dtype=int)  
print("Total memory adjacency GB:", df.memory_usage(deep=True).sum()/1024/1024/1024)
print(df)

# Reset the index to flatten the DataFrame from a matrix to a long table
print("Reset the index to flatten the DataFrame from a matrix to a long table")            
df = df.stack().reset_index() 
df.index = df.index.astype('int32', copy = False) 

# rename the column/row edges
print("rename the column/row edges")            
df.rename(columns={'level_0': 'nodeid_x'}, inplace=True)
df.rename(columns={'level_1': 'nodeid_y'}, inplace=True)  
                        
# rename crosstab intersection column
print("rename crosstab intersection column")
df.rename(columns={0: 'count_intersections'}, inplace=True)

# change count_intersections to int to save memory
print("change count_intersections to int to save memory")
df['count_intersections'] = df['count_intersections'].astype('int32', copy = False)
        
# now drop rows for which the nodeid_x = nodeid_y
print("now drop rows for which the nodeid_x = nodeid_y")
df = df.query('nodeid_x != nodeid_y')

# now drop zero intersections rows
print("now drop zero intersections rows")
df = df.query('count_intersections != 0')  

# sort
df.sort_values(by=['count_intersections', 'nodeid_x', 'nodeid_y'], ascending=False, inplace = True)

print("len(df)=",len(df))

print("pandas_adjacency matrix flat->")
print(df.head(9))

Networkx、交叉表、慢速 for 循环

python 数据透视表 networkx 邻接矩阵接列表

         nodeid_x nodeid_y  count_intersections
41523794     5587    29315                    4
40851660    53761    14855                    4
40587824    53478    42683                    4
40429793    53307       75                    4
36763731     4938    42230                    4
30719061     4290    32316                    4
6800705     17279    42566                    4
45616995     9972    39801                    3
45613051     9968     5797                    3
45597210     9951    10498                    3

使用 networkx MultiDiGraph 计算非常大的邻接矩阵崩溃的 Python 代码

Python code for calculation of very large adjacency matrix crashes using networkx MultiDiGraph

评论

评论