Pandas DataFrame 时间序列的样条插值产生不一致的结果

Spline interpolation of Pandas DataFrame time-series produces inconsistent results

提问人:ps_tw 提问时间:11/8/2023 最后编辑:ps_tw 更新时间:11/12/2023 访问量:39

问:

我有(很多)时间序列数据,在离散的时间柱上建立索引,我正在重新索引和插值以使它们保持一致。

例如:

# index of (time) series
idx_2 = [1, 2, 3, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 140, 160, 180, 220, 260, 300, 400, 488, 500, 600, 700, 800, 900, 902, 1000, 1100, 1200, 1298, 1400, 1600, 1800, 2400, 2494, 3000, 3600, 3604, 3748, 3904, 4200, 4501, 4509, 4800, 5196, 5400, 6000, 6600, 7200]

# original time series 1 values
pwr_vals_1 = [885, 874.5, 855, 770, 739.4, 712.25, 604.7, 494.43, 471.3, 431.3, 409.96, 371.5, 357.06, 351.78, 338.55, 332.96, 340.08, 337.64, 337.05, 329.23, 325.28, 322.76, 314, None, 300.87, 272.07, 248.55, 240.83, 239.42, None, 235.26, 235.51, 235.69, None, 231.85, 233.87, 222.88, 212.81, None, 207.3, 186.51, None, None, None, 177.3, None, None, 164.3, None, 155.88, 150.5, 147.35, 143.58]

生成的 DataFrames / Series 抛出一些非常奇怪的结果: 早期行中的 NaN 值值为比周围值大一个数量级的值,并且也是负值。

如果我删除序列(或索引)中的最终值,则此行为将消失。

(就上下文而言,数据都是正数,并且在大多数情况下是单调递减的。它们代表了一段时间内产生功率的锻炼活动的最大平均功率输出,仅限于多个期限。

Plot of pwr_vals

这是用于测试插值的代码。

选择样条,因为它最接近地表示时间序列近似的索引间度量值。

选择 s=1 的平滑处理,因为 s=0 会导致完全忽略任何单调性的结果,并且还会插值负值(尽管数量级相同)。

(据我了解,s=0 会让样条线遍历所有数据点,这在数据上下文中没有多大意义。

s >= 1 的任何值都会导致相同的行为。

我所期望/希望的结果是除“pwr_vals_1”列之外的所有结果

# testing sampling of integer index df
import pandas as pd


# copy of pwr_vals_1, with final value as None
pwr_vals_2 = [885, 874.5, 855, 770, 739.4, 712.25, 604.7, 494.43, 471.3, 431.3, 409.96, 371.5, 357.06, 351.78, 338.55, 332.96, 340.08, 337.64, 337.05, 329.23, 325.28, 322.76, 314, None, 300.87, 272.07, 248.55, 240.83, 239.42, None, 235.26, 235.51, 235.69, None, 231.85, 233.87, 222.88, 212.81, None, 207.3, 186.51, None, None, None, 177.3, None, None, 164.3, None, 155.88, 150.5, 147.35, None]

# copy of pwr_vals_1, multiplied by 0.99
pwr_vals_3 = [876, 865.8, 846, 762, 732, 705.13, 598.7, 489.49, 466.6, 427, 405.86, 367.8, 353.49, 348.26, 335.16, 329.63, 336.68, 334.26, 333.68, 325.94, 322.03, 319.53, 311, None, 297.86, 269.35, 246.06, 238.42, 237.03, None, 232.91, 233.15, 233.33, None, 229.53, 231.53, 220.65, 210.68, None, 205.2, 184.64, None, None, None, 175.5, None, None, 162.7, None, 154.32, 149, 145.88, 142.14]

# another random time series, with NaNs in the same places
pwr_vals_4 = [692, 687.5, 665.67, 662.4, 635.7, 480.55, 427.73, 395.33, 374.68, 342.2, 309.31, 296.58, 285.33, 289.67, 291, 292.97, 273.84, 258.34, 260.43, 259.6, 240.56, 238.13, 211.79, None, 197.83, 184.06, 174.16, 177.16, 179.86, None, 177.42, 176.47, 168.61, None, 164.4, 166.1, 164.42, 152.55, None, 134.96, 124.62, None, None, None, 123.16, None, None, 117.89, None, 109.78, 105.97, 103.73, 101.96]

# set the new target index
zip_vals = zip(idx_2, pwr_vals_1, pwr_vals_2, pwr_vals_3, pwr_vals_4)

# make a dataframe with the lists
raw_df = pd.DataFrame(zip_vals, columns=['time', 'pwr_vals_1', 'pwr_vals_2', 'pwr_vals_3', 'pwr_vals_4']).set_index('time')

# set the new target index
interp_tenors = list(range(1, max(idx_2) + 1))

reind_df = raw_df.reindex(interp_tenors)
# interpolate the dataframe, using cubic spline
intrp_df = reind_df.interpolate(method="spline", order=3, s=1, limit_area='inside')

# output
print(f"{intrp_df[:10]},\n{intrp_df[-10:].to_string(header=False)}")


时间 pwr_vals_1 pwr_vals_2 pwr_vals_3 pwr_vals_4
1 885.0 885.0 876.0 692.0
2 874.5 874.5 865.8 687.5
3 855.0 855.0 846.0 665.6
4 -5985.8 789.1 -5916.8 659.3
5 770.0 770.0 762.0 662.4
6 5912.0 761.1 5845.4 663.4
7 7323.4 753.5 7240.8 661.0
8 6160.7 747.5 6091.4 655.3
9 3580.5 742.8 3540.7 646.7
10 739.4 739.4 732.0 635.7
... ... ... ... ...
7198 143.5 142.1 101.9
7199 143.5 142.1 101.9
7200 143.5 142.1 101.9

这似乎与这两个数据有关,但我完全无法确定什么或为什么。

Pandas DataFrame scikit-learn 插值 样条曲线

评论


答: 暂无答案