提问人:user29839 提问时间:3/17/2023 更新时间:3/18/2023 访问量:2188
性能警告:创建新的 DataFrame 列时,DataFrame 高度碎片化
PerformanceWarning: DataFrame is highly fragmented when creating new DataFrame columns
问:
我正在尝试将新的 DataFrame 列设置为现有 DataFrame 的简单计算,但是当我运行脚本时,我收到了来自 Pandas 的警告。 这是主要代码
data_join['Ele_total'] = data_ele.sum(axis=1)
data_join['PV_total'] = data_pv.sum(axis=1)
data_join['SC'] = np.where(data_join['PV_total']>data_join['Ele_total'], data_join['Ele_total'], data_join['PV_total'])
data_join['SC%'] = np.where(data_join['PV_total']!= 0,round((data_join['SC']/data_join['PV_total'])*100,0),0)
data_join['SS%'] = np.where(data_join['Ele_total']!= 0,round((data_join['SC']/data_join['Ele_total'])*100,0),0)
data_join['LOLP'] = data_join['Ele_total']>data_join['PV_total']
data_join['E_tg'] = data_join['PV_total']-data_join['SC']
data_join['E_fg'] = data_join['Ele_total']-data_join['SC']
data_join['Ei'] = data_join['E_tg']-data_join['E_fg']
data_join['NGIP'] = data_join['Ei'].abs()<(GRID_LIM*n_build)
data_join['PAL'] = data_join['Ei'].abs()>(PEAK_LIM*n_build)
data_join['CO2'] = data_CO2['GWP']
data_join['CO2_net'] = data_CO2['GWP']*data_join['SC']
data_join['CO2_tot'] = data_CO2['GWP']*(data_join['E_tg']+data_join['SC'])
cash_flow = 0
npv = []
data_join_npv = pd.DataFrame()
for i in range (0,25):
if i == 0:
data_join_npv['PV_total_res_{}'.format(i)] = data_join_res['PV_total']
data_join_npv['PV_total_ind_{}'.format(i)] = data_join_ind['PV_total']
else:
data_join_npv['PV_total_res_{}'.format(i)] = data_join_npv['PV_total_res_{}'.format(i-1)]*(1-d)
data_join_npv['PV_total_ind_{}'.format(i)] = data_join_npv['PV_total_ind_{}'.format(i-1)]*(1-d)
data_join_npv['SC_res_{}'.format(i)] = np.where(data_join_npv['PV_total_res_{}'.format(i)]>data_join_res['Ele_total'], data_join_res['Ele_total'], data_join_npv['PV_total_res_{}'.format(i)])
data_join_npv['SC_ind_{}'.format(i)] = np.where(data_join_npv['PV_total_ind_{}'.format(i)]>data_join_ind['Ele_total'], data_join_ind['Ele_total'], data_join_npv['PV_total_ind_{}'.format(i)])
data_join_npv['E_tg_res_{}'.format(i)] = data_join_npv['PV_total_res_{}'.format(i)]-data_join_npv['SC_res_{}'.format(i)]
data_join_npv['E_tg_ind_{}'.format(i)] = data_join_npv['PV_total_ind_{}'.format(i)]-data_join_npv['SC_ind_{}'.format(i)]
data_join_npv['E_fg_res_{}'.format(i)] = data_join_res['Ele_total']-data_join_npv['SC_res_{}'.format(i)]
data_join_npv['E_fg_ind_{}'.format(i)] = data_join_ind['Ele_total']-data_join_npv['SC_ind_{}'.format(i)]
cash = float(data_join_npv['SC_res_{}'.format(i)].sum())*COST_OF_ENERGY_RES + float(data_join_npv['E_tg_res_{}'.format(i)].sum())*VALUE_OF_ENERGY - float(data_join_npv['E_fg_res_{}'.format(i)].sum())*COST_OF_ENERGY_RES + float(data_join_npv['SC_ind_{}'.format(i)].sum())*COST_OF_ENERGY_IND + float(data_join_npv['E_tg_ind_{}'.format(i)].sum())*VALUE_OF_ENERGY - float(data_join_npv['E_fg_ind_{}'.format(i)].sum())*COST_OF_ENERGY_IND - OM_COST*total_pv
cash_flow += cash/((1+DISC_RATE)**(i+1))
npv.append(-in_inv+cash_flow)
这些是我收到的警告:
C:\Users\Giacomo\Desktop\150\insert_data.py:342:性能警告:数据帧高度碎片化。这通常是多次调用的结果,性能较差。请考虑改用 pd.concat(axis=1) 一次联接所有列。要获取碎片整理的帧,请使用 data_join_npv['E_tg_res_{}'.format(i)] = data_join_npv['PV_total_res_{}'.format(i)]-data_join_npv['SC_res_{}'.format(i)] C:\Users\Giacomo\Desktop\150\insert_data.py:343:性能警告:数据帧高度碎片化。这通常是多次调用的结果,性能较差。请考虑改用 pd.concat(axis=1) 一次联接所有列。要获取碎片整理的帧,请使用 data_join_npv['E_tg_ind_{}'.format(i)] = data_join_npv['PV_total_ind_{}'.format(i)]-data_join_npv['SC_ind_{}'.format(i)] C:\Users\Giacomo\Desktop\150\insert_data.py:344:性能警告:数据帧高度碎片化。这通常是多次调用的结果,性能较差。请考虑改用 pd.concat(axis=1) 一次联接所有列。要获取碎片整理的帧,请使用 data_join_npv['E_fg_res_{}'.format(i)] = data_join_res['Ele_total']-data_join_npv['SC_res_{}'.format(i)] C:\Users\Giacomo\Desktop\150\insert_data.py:345:性能警告:数据帧高度碎片化。这通常是多次调用的结果,性能较差。请考虑改用 pd.concat(axis=1) 一次联接所有列。要获取碎片整理的帧,请使用 data_join_npv['E_fg_ind_{}'.format(i)] = data_join_ind['Ele_total']-data_join_npv['SC_ind_{}'.format(i)] C:\Users\Giacomo\Desktop\150\insert_data.py:337:性能警告:数据帧高度碎片化。这通常是多次调用的结果,性能较差。请考虑改用 pd.concat(axis=1) 一次联接所有列。要获取碎片整理的帧,请使用 data_join_npv['PV_total_res_{}'.format(i)] = data_join_npv['PV_total_res_{}'.format(i-1)](1-d) C:\Users\Giacomo\Desktop\150\insert_data.py:338:性能警告:数据帧高度碎片化。这通常是多次调用
frame.insert
的结果,性能较差。请考虑改用 pd.concat(axis=1) 一次联接所有列。要获取碎片整理的帧,请使用newframe = frame.copy
() data_join_npv['PV_total_ind_{}'.format(i)] = data_join_npv['PV_total_ind_{}'.format(i-1)](1-d) C:\Users\Giacomo\Desktop\150\insert_data.py:340:性能警告:数据帧高度碎片化。这通常是多次调用的结果,性能较差。请考虑改用 pd.concat(axis=1) 一次联接所有列。要获取碎片整理的帧,请使用 data_join_npv['SC_res_{}'.format(i)] = np.where(data_join_npv['PV_total_res_{}'.format(i)]>data_join_res['Ele_total'], data_join_res['Ele_total'], data_join_npv['PV_total_res_{}'.format(i)]) C:\Users\Giacomo\Desktop\150\insert_data.py:341:性能警告:数据帧高度碎片化。这通常是多次调用的结果,性能较差。请考虑改用 pd.concat(axis=1) 一次联接所有列。要获取碎片整理的帧,请使用 data_join_npv['SC_ind_{}'.format(i)] = np.where(data_join_npv['PV_total_ind_{}'.format(i)]>data_join_ind['Ele_total'], data_join_ind['Ele_total'], data_join_npv['PV_total_ind_{}'.format(i)]) C:\Users\Giacomo\Desktop\150\insert_data.py:342:性能警告:数据帧高度碎片化。这通常是多次调用的结果,性能较差。请考虑改用 pd.concat(axis=1) 一次联接所有列。若要获取碎片删除的帧,请使用frame.insert
newframe = frame.copy()
frame.insert
newframe = frame.copy()
frame.insert
newframe = frame.copy()
frame.insert
newframe = frame.copy()
frame.insert
newframe = frame.copy()
frame.insert
newframe = frame.copy()
frame.insert
newframe = frame.copy()
frame.insert
newframe = frame.copy()
我没有按照警告建议使用 frame.insert(),所以我不明白为什么我会收到关于碎片的警告。我得到了正确的结果,但是由于我必须在优化器中多次运行代码,因此我认为我收到的大量警告是在分析过程中的某个时刻停止优化器,我想解决它们。
答:
您收到这些多个警告是因为您反复将列插入到数据帧中,而不是在 for 循环之后和外部将它们连接在一起,这在内存方面效率要高得多。data_join_npv
例如,运行以下玩具代码:
import pandas as pd
df = pd.DataFrame({f"col{i}": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] for i in range(1_000)})
new_df = pd.DataFrame()
for i in range(1_000): # insert one thousand columns
new_df[f"new_df_col{i}"] = df[f"col{i}"]+i
print(new_df)
你将得到以下输出:
PerformanceWarning:DataFrame 高度碎片化。这通常是多次调用的结果,性能较差。请考虑改用 pd.concat(axis=1) 一次联接所有列。若要获取碎片整理的帧,请使用 new_df[f“new_df_col{i}”] = df[f“col{i}”]+i
frame.insert
newframe = frame.copy()
new_df_col0 new_df_col1 new_df_col2 ... new_df_col997 new_df_col998 new_df_col999
0 0 1 2 ... 997 998 999
1 1 2 3 ... 998 999 1000
2 2 3 4 ... 999 1000 1001
3 3 4 5 ... 1000 1001 1002
4 4 5 6 ... 1001 1002 1003
5 5 6 7 ... 1002 1003 1004
6 6 7 8 ... 1003 1004 1005
7 7 8 9 ... 1004 1005 1006
8 8 9 10 ... 1005 1006 1007
9 9 10 11 ... 1006 1007 1008
[10 rows x 1000 columns]
例如,初始化一个空字典而不是一个数据帧,并使用 Pandas concat:
import pandas as pd
df = pd.DataFrame({f"col{i}": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] for i in range(1_000)})
data = {}
for i in range(1_000):
data[f"new_col{i}"] = df[f"col{i}"] + i
new_df = pd.concat(data.values(), axis=1, ignore_index=True)
new_df.columns = data.keys() # since Python 3.7, order of insertion is preserved
print(new_df)
您将在没有任何警告的情况下获得相同的数据帧:
new_col0 new_col1 new_col2 new_col3 ... new_col996 new_col997 new_col998 new_col999
0 0 1 2 3 ... 996 997 998 999
1 1 2 3 4 ... 997 998 999 1000
2 2 3 4 5 ... 998 999 1000 1001
3 3 4 5 6 ... 999 1000 1001 1002
4 4 5 6 7 ... 1000 1001 1002 1003
5 5 6 7 8 ... 1001 1002 1003 1004
6 6 7 8 9 ... 1002 1003 1004 1005
7 7 8 9 10 ... 1003 1004 1005 1006
8 8 9 10 11 ... 1004 1005 1006 1007
9 9 10 11 12 ... 1005 1006 1007 1008
[10 rows x 1000 columns]
因此,请尝试像这样重构代码:
cash_flow = 0
npv = []
data_join_npv = {} # instead of pd.DataFrame()
for i in range (0,25): # code unchanged
...
df = pd.concat(data_join_npv.values(), axis=1, ignore_index=True)
df.columns = data_join_npv.keys()
评论
df = pd.concat({k: pd.Series(v) for k, v in data_join_npv.items()}, axis=1, ignore_index=False)
df.columns = data_join_npv.keys()
评论