PerformanceWarning:DataFrame 高度碎片化。这通常是多次调用'frame.insert'的结果,性能很差。

PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance.'

提问人:孟泽楷 提问时间:5/27/2023 最后编辑:孟泽楷 更新时间:5/27/2023 访问量:95

问:

我有一个数据帧,每行都是一个像素,每列都表示一个特征图。所以情况是这样的:我有一个名为“pix”的列,可以分成行,col。我可以使用它向此数据帧添加列,因为我有几个功能尚未加入。我导入每个地图并将它们传输到矩阵中

所以我使用这样的代码伪代码来添加列:

for i in tqdm(df_new.index):
            row = int(ds.pix[i][1:-1].split(',')[0])
            col = int(ds.pix[i][1:-1].split(',')[1])
            df_new[i, 'Aridity'] = arr3[row, col]
            df_new[i, 'Trend in PDSI'] = arr4[row, col]
            df_new[i, 'rainfall seasonality'] = arr5[row, col]
            df_new[i, 'Precipitation variability'] = arr6[row, col]
            df_new[i, 'Trend in Precipitation'] = arr7[row, col]
            df_new[i, 'rainfall at coldest quarter'] = arr8[row, col]
            df_new[i, 'rainfall at hottest quarter'] = arr9[row, col]
            df_new[i, 'Temperature seasonality'] = arr10[row, col]
            df_new[i, 'Elevation'] = arr11[row, col]
            df_new[i, 'Depth of water table'] = arr12[row, col]
            df_new[i, 'Depth to bedrock'] = arr13[row, col]

正如倾斜所示,我收到了警告。我有超过 40000 行,处理速度很慢。 评估超过 30 分钟。tqdm

这就是我解决问题的方法:我稍微更改了我的代码,如下所示,警告消失,速度比以前快 10 倍。希望这可以帮助其他人。如果有人有更漂亮的解决方案,我将不胜感激。

for i in tqdm(df_new.index):
            row = int(ds.pix[i][1:-1].split(',')[0])
            col = int(ds.pix[i][1:-1].split(',')[1])
            df_new['Aridity'][i] = arr3[row, col]
            df_new['Trend in PDSI'][i]= arr4[row, col]
            df_new['rainfall seasonality'][i] = arr5[row, col]
            df_new['Precipitation variability'][i] = arr6[row, col]
            df_new['Trend in Precipitation'][i] = arr7[row, col]
            df_new['rainfall at coldest quarter'][i] = arr8[row, col]
            df_new['rainfall at hottest quarter'][i] = arr9[row, col]
            df_new['Temperature seasonality'][i] = arr10[row, col]
            df_new['Elevation'][i] = arr11[row, col]
            df_new['Depth of water table'][i] = arr12[row, col]
            df_new['Depth to bedrock'][i] = arr13[row, col]
Python Pandas 数据帧 索引 警告

评论


答:

0赞 Saxtheowl 5/27/2023 #1

第二种方法更好,因为它避免了逐一添加列,但是直接使用索引之类的方式访问 pandas DataFrame 单元格很慢。df_new['Aridity'][i]

为了获得高性能,让我们先计算所有新列,然后将它们一次性分配给 DataFrame。

例如:

aridity_values = []
trend_in_pdsi_values = []

for pix in tqdm(df_new['pix']):
    row, col = map(int, pix[1:-1].split(','))
    aridity_values.append(arr3[row, col])
    trend_in_pdsi_values.append(arr4[row, col])

df_new['Aridity'] = aridity_values
df_new['Trend in PDSI'] = trend_in_pdsi_values

评论

1赞 孟泽楷 7/18/2023
感谢您的解释,应用您的代码后的性能真的让我感到震惊。速度评估从 300 it/s 更改为 120000 it/s。 以前需要 3 分钟,现在不到 1 秒。 QAQ