提问人:Eurico Covas 提问时间:11/13/2023 最后编辑:Eurico Covas 更新时间:11/13/2023 访问量:33
使用 networkx MultiDiGraph 计算非常大的邻接矩阵崩溃的 Python 代码
Python code for calculation of very large adjacency matrix crashes using networkx MultiDiGraph
问:
我需要计算一个非常大的图形的邻接矩阵(平面格式)。节点数为 54327,数边数为 4600 万。输入是 4600 万条边,因此输入如下所示
1 6
2 7
1 6
3 8
...
这意味着节点 1 连接到 6,节点 2 连接到 7,并可能重复。在本例中,邻接矩阵如下所示
1 2 3 6 7 8
1 0 0 0 2 0 0
2 0 0 0 0 1 0
3 0 0 0 0 0 1
6 1 0 0 0 0 0
7 0 2 0 0 0 0
8 0 0 1 0 0 0
事实上,我只需要非零条目的平面版本,所以:
node_x node_y count_edges
1 6 2
2 7 1
3 8 1
6 1 2
7 2 1
8 3 1
使用 networkx(或交叉表),python 代码会因内存不足而崩溃(在 8GB RAM 个人机器或 64GB RAM Linux 服务器上)。
以缓慢的方式进行,只需使用 for 循环即可,但需要 24+ 小时才能运行。
以下是完整的测试代码,关于如何获取 networkx(或交叉表或任何巧妙的替代方案)的任何想法?
import random
import pandas as pd
import networkx as nx
import sys
number_nodes = 54327 # real one is 54,327
number_edges_for_all_nodes = 46000000 # real one is around 46 million
print("number_nodes=",number_nodes)
print("number_edges_for_all_nodes=",number_edges_for_all_nodes)
# Create a list of possible list of nodes
list_nodes = [f'{i}' for i in range(1, number_nodes)]
# Create random combinations of possible values.
nodeid_x = random.choices(list_nodes, k=number_edges_for_all_nodes)
nodeid_y = random.choices(list_nodes, k=number_edges_for_all_nodes)
# This would look like this in a dataframe.
df = pd.DataFrame({'nodeid_x': nodeid_x, 'nodeid_y': nodeid_y})
# remove self nodes
df = df.query('nodeid_x != nodeid_y')
# df
print("Graph in df ->")
print(df.head(10))
print("Create edges for MultiDiGraph")
edges = df[['nodeid_x', 'nodeid_y']].to_numpy(dtype=int)
#
print("Create nodes for MultiDiGraph")
nodes = sorted(set(node for edge in edges for node in edge))
# create graph that allows parallel connections, so that one node may be connected twice to another
# Multi means parallel connections, Di means directional, since we want both directions
G = nx.MultiDiGraph()
print("Create add_nodes_from for MultiDiGraph")
G.add_nodes_from(nodes)
print("Create add_edges_from for MultiDiGraph")
G.add_edges_from(edges)
# memory usage
edge = sum([sys.getsizeof(e) for e in G.edges])
node = sum([sys.getsizeof(n) for n in G.nodes])
print("Total memory graph GB:", (edge + node)/1024/1024/1024)
# count connections, store to int to save space
print("Create to_pandas_adjacency for MultiDiGraph")
df = nx.to_pandas_adjacency(G, dtype=int)
print("Total memory adjacency GB:", df.memory_usage(deep=True).sum()/1024/1024/1024)
print(df)
# Reset the index to flatten the DataFrame from a matrix to a long table
print("Reset the index to flatten the DataFrame from a matrix to a long table")
df = df.stack().reset_index()
df.index = df.index.astype('int32', copy = False)
# rename the column/row edges
print("rename the column/row edges")
df.rename(columns={'level_0': 'nodeid_x'}, inplace=True)
df.rename(columns={'level_1': 'nodeid_y'}, inplace=True)
# rename crosstab intersection column
print("rename crosstab intersection column")
df.rename(columns={0: 'count_intersections'}, inplace=True)
# change count_intersections to int to save memory
print("change count_intersections to int to save memory")
df['count_intersections'] = df['count_intersections'].astype('int32', copy = False)
# now drop rows for which the nodeid_x = nodeid_y
print("now drop rows for which the nodeid_x = nodeid_y")
df = df.query('nodeid_x != nodeid_y')
# now drop zero intersections rows
print("now drop zero intersections rows")
df = df.query('count_intersections != 0')
# sort
df.sort_values(by=['count_intersections', 'nodeid_x', 'nodeid_y'], ascending=False, inplace = True)
print("len(df)=",len(df))
print("pandas_adjacency matrix flat->")
print(df.head(9))
Networkx、交叉表、慢速 for 循环
答:
1赞
Andrej Kesely
11/13/2023
#1
也许我错过了什么,买你可以用+这里:.groupby
.sum
out = (
df.assign(count_intersections=1)
.groupby(["nodeid_x", "nodeid_y"], as_index=False)["count_intersections"]
.sum()
)
out.sort_values(
by=["count_intersections", "nodeid_x", "nodeid_y"], ascending=False, inplace=True
)
print(out.head(10))
打印(在一分钟内使用 ,有 46mio 边缘):random.seed(42)
nodeid_x nodeid_y count_intersections
41523794 5587 29315 4
40851660 53761 14855 4
40587824 53478 42683 4
40429793 53307 75 4
36763731 4938 42230 4
30719061 4290 32316 4
6800705 17279 42566 4
45616995 9972 39801 3
45613051 9968 5797 3
45597210 9951 10498 3
评论
1赞
Eurico Covas
11/14/2023
谢谢,这似乎有效,非常好!聪明!
评论