如何使用 python 在非常大的数据集中进行搜索和替换？-解网

问：

我有一个大型数据集（100 万行以上）和几个定期更改的 GB。它通过将每个条目与其在流网络中的上游邻居相关联来对流特征进行建模。我想要的工具的基本逻辑是使用 ID 字段，搜索相关的上游设备，并将存储在上游设备条目的不同列（Num2）中的数字写入原始列。这使我能够确定我所处的流网络的哪个“级别”。以下是数据示例：

这是我正在使用的代码：

import pandas as pd
import numpy as np



# Read the CSV file into a Pandas DataFrame
df = pd.read_csv("L:\\Dev_h\\xxxxx.csv")

#put values into the Number field initially:
df['Num2'] = np.where(df['Num1'] > 0, df['Num1'], df['Num2'])
print(df)

null_num2_rows = df[df["Num2"].isnull()]

# For each row with a null Num2 field, find the row with the same ID and a non-null Num2 field
for row in null_num2_rows.iterrows():
    row_index, row_data = row

    # Get the Upstream device field from the current row
    up_field = row_data["Up"]

    # Find the row with the same ID and a non-null Num2 field
    matched_row = df.loc[(df["DevNo"] == up_field) & ~df["Num2"].isnull()]

    # Set the Num2 field of the current row with the Num2 field of the matched row
    df.at[row_index, "Num2"] = matched_row["Num2"].iloc[0]
print(df)
# Save the DataFrame to an excel
df.to_excel("L:\\ROAD ROUTING\\xxxxx.xlsx")

这似乎可以很好地作为非常小的文件的方法;下面是输出的示例。

DevNo,Up,Down,Num1,Num2
F1      S1    1     1
F2  S1  S2    2     2
F3  S4  S6    3     3
F4  S8        4     4
S1  F1  F2          1
S2  F2  S4          2
S3  F2  S5          2
S4  S2  F3          2
S5  S3  T1          2
S6  F3  S6          3
S7  S6  S8          3
S8  S7  F4          3

但是，它在大型数据集上缩放得非常厉害，耗尽了我的内存。我对 python 非常陌生，所以我真的不知道如何更新我的逻辑以适应如此大的数据集。在 pandas 中加载区块不起作用，因为匹配值可能与搜索行不在同一区块中。

我应该如何更新以更好地处理大型数据集？

Python Pandas 索引类型错误大数据

蒂姆·罗伯茨（Tim Roberts）的观点是：如果你正在编写一个python程序来“做这件事”，那么你应该能够接触到你的工具箱中的任何工具。Python 有一个“包含电池”的 SQLite3 库，其中 SQLite 可以在单个数据库中处理多达 281 TB 的数据，专门用于快速检索和更新，并更有效地存储字段数据。因此，换句话说，熊猫可能不是您工作的正确解决方案。话虽如此，熊猫可以让你分块阅读

1赞 Anon Coward 8/22/2023

如果您有超过一百万行，为什么要将其保存在 xlsx 文件中？Excel 无法处理超过 1048576 行。

答：

1赞 Code Different 8/22/2023 #1

试试这个：

df = pd.read_csv("data.csv")
df['Num2'] = np.where(df['Num1'] > 0, df['Num1'], df['Num2'])

arr = []
lookup = df.set_index("DevNo")["Num2"].to_dict()

for devno, up, num2 in zip(df["DevNo"], df["Up"], df["Num2"]):
    if not up:
        arr.append(num2)
    elif pd.isna(num2):
        lookup[devno] = lookup[up]
        arr.append(lookup[devno])
    else:
        arr.append(num2)

df["num2"] = arr

为什么它这么快：

仅通过数据帧一次
使用而不是：创建一个元组，这比为每行创建的元组要简单得多。zipdf.iterrowszipSeriesiterrows
字典查找（）比使用快几个数量级lookup[devno], lookup[up]df.loc

我不认为您的问题可以通过迁移到数据库来解决。可以改进一些操作，例如以块形式获取 CSV 文件以避免内存使用量激增。但是，由于循环具有副作用，因此任何数据库解决方案也会产生额外的写入成本。使用 Python，您可以将这些写入保存到内存（字典）中。lookup

如何使用 python 在非常大的数据集中进行搜索和替换？

How do I do a search and replace in a very large dataset using python?

评论

评论