提问人:ps_tw 提问时间:11/8/2023 最后编辑:ps_tw 更新时间:11/12/2023 访问量:39
Pandas DataFrame 时间序列的样条插值产生不一致的结果
Spline interpolation of Pandas DataFrame time-series produces inconsistent results
问:
我有(很多)时间序列数据,在离散的时间柱上建立索引,我正在重新索引和插值以使它们保持一致。
例如:
# index of (time) series
idx_2 = [1, 2, 3, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 140, 160, 180, 220, 260, 300, 400, 488, 500, 600, 700, 800, 900, 902, 1000, 1100, 1200, 1298, 1400, 1600, 1800, 2400, 2494, 3000, 3600, 3604, 3748, 3904, 4200, 4501, 4509, 4800, 5196, 5400, 6000, 6600, 7200]
# original time series 1 values
pwr_vals_1 = [885, 874.5, 855, 770, 739.4, 712.25, 604.7, 494.43, 471.3, 431.3, 409.96, 371.5, 357.06, 351.78, 338.55, 332.96, 340.08, 337.64, 337.05, 329.23, 325.28, 322.76, 314, None, 300.87, 272.07, 248.55, 240.83, 239.42, None, 235.26, 235.51, 235.69, None, 231.85, 233.87, 222.88, 212.81, None, 207.3, 186.51, None, None, None, 177.3, None, None, 164.3, None, 155.88, 150.5, 147.35, 143.58]
生成的 DataFrames / Series 抛出一些非常奇怪的结果: 早期行中的 NaN 值值为比周围值大一个数量级的值,并且也是负值。
如果我删除序列(或索引)中的最终值,则此行为将消失。
(就上下文而言,数据都是正数,并且在大多数情况下是单调递减的。它们代表了一段时间内产生功率的锻炼活动的最大平均功率输出,仅限于多个期限。
这是用于测试插值的代码。
选择样条,因为它最接近地表示时间序列近似的索引间度量值。
选择 s=1 的平滑处理,因为 s=0 会导致完全忽略任何单调性的结果,并且还会插值负值(尽管数量级相同)。
(据我了解,s=0 会让样条线遍历所有数据点,这在数据上下文中没有多大意义。
s >= 1 的任何值都会导致相同的行为。
我所期望/希望的结果是除“pwr_vals_1”列之外的所有结果
# testing sampling of integer index df
import pandas as pd
# copy of pwr_vals_1, with final value as None
pwr_vals_2 = [885, 874.5, 855, 770, 739.4, 712.25, 604.7, 494.43, 471.3, 431.3, 409.96, 371.5, 357.06, 351.78, 338.55, 332.96, 340.08, 337.64, 337.05, 329.23, 325.28, 322.76, 314, None, 300.87, 272.07, 248.55, 240.83, 239.42, None, 235.26, 235.51, 235.69, None, 231.85, 233.87, 222.88, 212.81, None, 207.3, 186.51, None, None, None, 177.3, None, None, 164.3, None, 155.88, 150.5, 147.35, None]
# copy of pwr_vals_1, multiplied by 0.99
pwr_vals_3 = [876, 865.8, 846, 762, 732, 705.13, 598.7, 489.49, 466.6, 427, 405.86, 367.8, 353.49, 348.26, 335.16, 329.63, 336.68, 334.26, 333.68, 325.94, 322.03, 319.53, 311, None, 297.86, 269.35, 246.06, 238.42, 237.03, None, 232.91, 233.15, 233.33, None, 229.53, 231.53, 220.65, 210.68, None, 205.2, 184.64, None, None, None, 175.5, None, None, 162.7, None, 154.32, 149, 145.88, 142.14]
# another random time series, with NaNs in the same places
pwr_vals_4 = [692, 687.5, 665.67, 662.4, 635.7, 480.55, 427.73, 395.33, 374.68, 342.2, 309.31, 296.58, 285.33, 289.67, 291, 292.97, 273.84, 258.34, 260.43, 259.6, 240.56, 238.13, 211.79, None, 197.83, 184.06, 174.16, 177.16, 179.86, None, 177.42, 176.47, 168.61, None, 164.4, 166.1, 164.42, 152.55, None, 134.96, 124.62, None, None, None, 123.16, None, None, 117.89, None, 109.78, 105.97, 103.73, 101.96]
# set the new target index
zip_vals = zip(idx_2, pwr_vals_1, pwr_vals_2, pwr_vals_3, pwr_vals_4)
# make a dataframe with the lists
raw_df = pd.DataFrame(zip_vals, columns=['time', 'pwr_vals_1', 'pwr_vals_2', 'pwr_vals_3', 'pwr_vals_4']).set_index('time')
# set the new target index
interp_tenors = list(range(1, max(idx_2) + 1))
reind_df = raw_df.reindex(interp_tenors)
# interpolate the dataframe, using cubic spline
intrp_df = reind_df.interpolate(method="spline", order=3, s=1, limit_area='inside')
# output
print(f"{intrp_df[:10]},\n{intrp_df[-10:].to_string(header=False)}")
时间 | pwr_vals_1 | pwr_vals_2 | pwr_vals_3 | pwr_vals_4 |
---|---|---|---|---|
1 | 885.0 | 885.0 | 876.0 | 692.0 |
2 | 874.5 | 874.5 | 865.8 | 687.5 |
3 | 855.0 | 855.0 | 846.0 | 665.6 |
4 | -5985.8 | 789.1 | -5916.8 | 659.3 |
5 | 770.0 | 770.0 | 762.0 | 662.4 |
6 | 5912.0 | 761.1 | 5845.4 | 663.4 |
7 | 7323.4 | 753.5 | 7240.8 | 661.0 |
8 | 6160.7 | 747.5 | 6091.4 | 655.3 |
9 | 3580.5 | 742.8 | 3540.7 | 646.7 |
10 | 739.4 | 739.4 | 732.0 | 635.7 |
... | ... | ... | ... | ... |
7198 | 143.5 | 南 | 142.1 | 101.9 |
7199 | 143.5 | 南 | 142.1 | 101.9 |
7200 | 143.5 | 南 | 142.1 | 101.9 |
这似乎与这两个数据有关,但我完全无法确定什么或为什么。
答: 暂无答案
评论