MemoryError 来自熔化或与大数据的连接-解网

问：

当我尝试运行时出现错误。
我检查了这篇文章并尝试修改代码，但仍然出现错误。（链接pd.melt())

这是我的原始代码：

melted = pd.melt(df, ['ID', 'Col2', 'Col3', 'Year'], var_name='New_Var', value_name='Value').sort_values('ID')

修改后：

pivot_list = list()
chunk_size = 100000
for i in range(0, len(df), chunk_size):
    row_pivot = pd.melt(df.iloc[i:i+chunk_size], ['ID', 'Col2', 'Col3', 'Year'], var_name='New_Var', value_name='Value')
    pivot_list.append(row_pivot)
melted = pd.concat(pivot_list).sort_values('ID')

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File /path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/path/Current_Proj/Main_Dir/Python_Program.py", line 122, in My_Function
    melted = pd.concat(pivot_list).sort_values('ID')
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/reshape/concat.py", line 307, in concat
    return op.get_result()
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/reshape/concat.py", line 532, in get_result
    new_data = concatenate_managers(
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 222, in concatenate_managers
    values = _concatenate_join_units(join_units, concat_axis, copy=copy)
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 486, in _concatenate_join_units
    to_concat = [
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 487, in <listcomp>
    ju.get_reindexed_values(empty_dtype=empty_dtype, upcasted_na=upcasted_na)
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 466, in get_reindexed_values
    values = algos.take_nd(values, indexer, axis=ax)
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/array_algos/take.py", line 108, in take_nd
    return _take_nd_ndarray(arr, indexer, axis, fill_value, allow_fill)
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/array_algos/take.py", line 149, in _take_nd_ndarray
    out = np.empty(out_shape, dtype=dtype)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 27.1 GiB for an array with shape (2, 1819900000) and data type object
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File /path/Current_Proj/Main_Dir/Python_Program.py", line 222, in <module>
    result = pool.starmap(My_Function, zip(arg1, arg2, arg3))
  File "/path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 372, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
numpy.core._exceptions.MemoryError: Unable to allocate 27.1 GiB for an array with shape (2, 1819900000) and data type object

我认为主要问题来自和部分。
任何要处理的想法都应该心存感激。melt()concat()

python pandas numpy 内存不足数据操作

答：

1赞 Michael Delgado 6/11/2022 #1

通常，当您收到“MemoryError： unable to allocate”错误时，这属于请求重新调整操作的“用户错误”类别，该操作太大而无法放入内存中。

pd.melt 是一项占用大量内存的操作，它不仅需要为数据中的所有元素创建新数组，还需要将数据重塑为效率较低的格式，从而为当前值创建许多重复项。结果和内存损失将取决于数据的结构和值列的数量。

仔细阅读 pandas 文档中关于通过熔炼重塑的文档，并计算您是否有能力在列中创建所有元素的数组，并对指定的所有列重复这些数组。id_varsvalue_vars

例如，如果 DataFrame 有 1M 行和 1000 列，并且所有单元格都为 float32，则 DataFrame 将占用大约 4GB 的内存。如果您随后尝试熔化并指定 4 ，那么您将有 4*1M id 单元格，每个单元格将被重复（996）次，为您提供 4*1e6*996 为您提供 40 亿个单元格作为索引。此外，您将有一个包含 1e6*996 个“变量”的列，最后是相同数量的“值”。您需要知道所有列名的长度和 dtype 以及单元格的数据类型，但这个简单的示例将产生一个 23 GB 的数组，即使所有值都是相对紧凑的 float32s。id_vars

Melt 是用于重塑小型数据帧的实用便捷函数。如果你有一个接近我在这个例子中谈论的大小的数据帧，我主要建议你不要这样做，或者如果你确实需要以这种方式重塑，那么你需要认真对待操作，并以一种适合你的数据大小的方式对数据进行分块。您可能希望以迭代方式写出数据，而不是尝试在最后连接数据。这不是开箱即用的东西 - 期待一些试验和错误。你也可以考虑使用核外计算工具——dask.dataframe 有一个 melt 端口，可以利用多个内核并行写入磁盘。

上一个：Pandas：查找一列的值比率，然后按另一列的分组

下一个：从列到行元素的 Pandas 数据操作 [复制]

MemoryError 来自熔化或与大数据的连接

MemoryError from melt or concat with large data

评论