提问人:Dmayall 提问时间:11/17/2023 最后编辑:Dmayall 更新时间:11/17/2023 访问量:46
在应存在的索引上索引数据帧时出错
KeyError on indexing a dataframe on an index that should exist
问:
这基本上只是我试图对我的数据进行归一化,以便我的模型表现得更好。它没有工作,但我正在尝试提高性能。它是 LSTM 中的一个时间序列,因此必须对其进行排序,这工作得很好,以及下面的第一个代码块,该代码块将我的数据分成基于年份的训练测试拆分。
test = {}
tracks = incident_data["geoid10"].unique().tolist()
train_x, train_y = creatsequence(
incident_data[incident_data["geoid10"] == tracks[0]], 4
)
tracks.pop(0)
for track in tracks:
test_data = incident_data[incident_data["geoid10"] == track]
if test_data[test_data["Year"] < 2016].shape[0] > 4:
trainx, trainy = creatsequence(test_data[test_data["Year"] < 2016], 4)
train_x = np.concatenate((train_x, trainx))
train_y = np.concatenate((train_y, trainy))
if test_data[test_data["Year"] >= 2015].shape[0] > 4:
test_x, test_y = creatsequence(test_data[test_data["Year"] >= 2015], 4)
test[track] = {"X": test_x, "y": test_y}
def creatsequence(data, length):
x = []
y = []
for column in ["All Other Thefts_y", "Simple Assault_y", "Theft From Motor Vehicle_y"]:
data[column] = normalizeSeries(data[column], 4)
for i in range(len(data) - length):
x.append(data.drop(columns=['geoid10',
'Year','Quarter'])[i:i+length])
y.append(np.array(data[["All Other Thefts_y", "Simple Assault_y", "Theft From Motor Vehicle_y"]])[i+length])
return(np.array(x), np.array(y))
def normalizeSeries(x, priorDays):
new_x = []
start = len(x) - 1
while start > 0:
print(x)
if start - priorDays > 4:
subarray = x[start - 2 * priorDays : start - priorDays]
else:
subarray = x[0:4]
max_value = max([max(subarray),0.001])
for i in range(4):
value = x[start]
new_x.append(value / max_value)
start -= 1
return new_x
错误:
KeyError: 11
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
/tmp/ipykernel_12/1047915744.py in <cell line: 11>()
12 test_data = incident_data[incident_data["geoid10"] == track]
13 if test_data[test_data["Year"] < 2016].shape[0] > 4:
---> 14 trainx, trainy = creatsequence(test_data[test_data["Year"] < 2016], 4)
15 train_x = np.concatenate((train_x, trainx))
16 train_y = np.concatenate((train_y, trainy))
/tmp/ipykernel_12/2018859376.py in creatsequence(data, length)
8 y = []
9 for column in ["All Other Thefts_y", "Simple Assault_y", "Theft From Motor Vehicle_y"]:
---> 10 data[column] = normalizeSeries(data[column], 4)
11 for i in range(len(data) - length):
12 x.append(data.drop(columns=['geoid10',
/tmp/ipykernel_12/2642226622.py in normalizeSeries(x, priorDays)
17 max_value = max([max(subarray),0.001])
18 for i in range(4):
---> 19 value = x[start]
20 new_x.append(value / max_value)
21 start -= 1
~/.cache/pypoetry/virtualenvs/python-kernel-OtKFaj5M-py3.9/lib/python3.9/site-packages/pandas/core/series.py in __getitem__(self, key)
979
980 elif key_is_scalar:
--> 981 return self._get_value(key)
982
983 if is_hashable(key):
~/.cache/pypoetry/virtualenvs/python-kernel-OtKFaj5M-py3.9/lib/python3.9/site-packages/pandas/core/series.py in _get_value(self, label, takeable)
1087
1088 # Similar to Index.get_value, but we do not fall back to positional
-> 1089 loc = self.index.get_loc(label)
1090 return self.index._get_values_for_loc(self, loc, label)
1091
~/.cache/pypoetry/virtualenvs/python-kernel-OtKFaj5M-py3.9/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3802 return self._engine.get_loc(casted_key)
3803 except KeyError as err:
-> 3804 raise KeyError(key) from err
3805 except TypeError:
3806 # If we have a listlike key, _check_indexing_error will raise
我正在尝试让我的数据标准化、排序并投入到训练测试拆分中。前几个表在第一个轨道上工作,但一旦它到达第二个轨道,它就会抛出索引错误。我检查了轨道,它有数据并且确实有这个长度。
答:
您正在使用索引 11 从序列中建立索引。 在这一行中:
---> 19 value = x[start]
我唯一能解释的方法是这个索引超过了 pd 的长度。系列。
这在第一次迭代中有效,但在第二次迭代中无效,这一事实告诉我,您正在以有意义的方式覆盖/更改某些数据。事实上,循环之外有一些变量,例如:
train_x, train_y = creatsequence(
incident_data[incident_data["geoid10"] == tracks[0]], 4
)
然后,在循环中更改这些变量:
train_x = np.concatenate((train_x, trainx))
train_y = np.concatenate((train_y, trainy))
请注意,当您在循环之外初始化这些变量时,它们可能不会复制 但您可能会返回一个切片。因此,覆盖 train_x 和 train_y 可能会改变incident_data,从而可能会破坏这一点:incident_data
test_data = incident_data[incident_data["geoid10"] == track]
if test_data[test_data["Year"] < 2016].shape[0] > 4:
trainx, trainy = creatsequence(test_data[test_data["Year"] < 2016], 4)
这当然只是基于我能阅读的程序的逻辑,但如果没有可运行的示例,就很难调试。但我认为你的问题就在这些地方。
对于你的函数来说,当你循环和切掉比你拥有的面包片(我的意思是数据点)更多的东西时,麻烦就开始了。当你完成一个循环时,你已经后退了四步,然后下一个循环开始......砰,您可能已经脱离了数据的边缘。normalizeSeries
start
事情是这样的:你的循环需要非常清楚它在哪里踩踏。你想一次后退一步,不多也不少。因此,让我们调整一下该功能:priorDays
def normalizeSeries(x, priorDays):
new_x = []
start = len(x) - 1
while start >= priorDays: # Make sure we have enough steps to dance
if start - 2 * priorDays >= 0:
subarray = x[start - 2 * priorDays : start - priorDays]
else:
subarray = x[:priorDays]
max_value = max(max(subarray), 0.001) # Don't want to divide by zero
for i in range(priorDays): # Normalize the window
if start - i >= 0: # Watch your step!
value = x[start - i]
new_x.append(value / max_value)
start -= priorDays # Slide to the left! But just `priorDays` steps.
return list(reversed(new_x)) # Flip it to keep the original order
将此函数弹出到您的代码中,并对其进行旋转。它应该让你的指数保持紧密和正确。让我知道它是怎么回事!如果你仍然被绊倒,我是来帮忙的。
评论