提问人:Mostafa Bouzari 提问时间:8/10/2023 最后编辑:Mostafa Bouzari 更新时间:8/11/2023 访问量:47
如何删除(删除)数据帧中的连续值
How to drop (delete) consecutive values in a Dataframe
问:
我有一个数据帧,其中有一列的值为 0。我希望找到这些 0 值并检查它们是否直到最后为 0,只删除末尾而不是中间的值。
这是数据的样子:secondary_df
DSCD date year month RI RIu RIu1 RIe
203 1316 1/29/2010 2010.0 1.0 66.39 66.30 6.21 6.39 \
275 1316 1/29/2016 2016.0 1.0 66.97 166.84 6.89 6.32
131 1316 1/30/2004 2004.0 1.0 66.01 66.15 6.36 6.60
191 1316 1/30/2009 2009.0 1.0 66.36 6.54 685.25 6.71
263 1316 1/30/2015 2015.0 1.0 66.43 6.94 114.14 6.33
.. ... ... ... ... ... ... ... ...
250 1316 12/31/2013 2013.0 12.0 99.98 5.24 59.91 5.07
262 1316 12/31/2014 2014.0 12.0 99.33 54.14 54.64 55.96
274 1316 12/31/2015 2015.0 12.0 55.32 5.89 15.19 54.34
310 1316 12/31/2018 2018.0 12.0 55.56 55.23 5.40 5.49
322 1316 12/31/2019 2019.0 12.0 55.39 55.98 5.69 5.88
RIu Pct Return RIe_Pct_Return
203 -0.05 0.0255 \
275 -0.0358 -0.059
131 0.058 0.05106
191 0.0055 0.0535
263 -0.035 0.053
.. ... ...
250 0.01092 -0.05
262 -0.001 0.02572
274 -0.003 -0.0512
310 -0.000 -0.05274
322 0.004 0.039
这就是我到目前为止所得到的。
for DSCD in FirmReturnIndexValues['DSCD'].unique():
secondary_df=FirmReturnIndexValues[FirmReturnIndexValues['DSCD']==DSCD]
t=secondary_df[(secondary_df['RIe Pct Return'].values == 0)].index.values.tolist()
t.sort()
if len(t)>=1:
print(np.diff((t)))
例如,这部分是 t:
[69438, 69439, 69440, 69441, 69442, 69443, 69444, 69445, 69446, 69447, 69448, 69449, 69450, 69451, 69452, 69453, 69454, 69455, 69456, 69457, 69458, 69459, 69460, 69461, 69462, 69463, 69464, 69465, 69466, 69468, 69548, 69570, 69571, 69572, 69573, 69574, 69575, 69576, 69577, 69578, 69579, 69580, 69581, 69582, 69583, 69584, 69585, 69586, 69587, 69588, 69589, 69590, 69591, 69592, 69593, 69594, 69595, 69596, 69597, 69598, 69599, 69600, 69601, 69602, 69603, 69604, 69605, 69606, 69607, 69608, 69609, 69610, 69611, 69612, 69613, 69614, 69615, 69616, 69617, 69618, 69619, 69620, 69621, 69622, 69623, 69624, 69625, 69626, 69627, 69628, 69629, 69630, 69631, 69632, 69633, 69634, 69635, 69636, 69637, 69638, 69639, 69640, 69641, 69642, 69643, 69644, 69645, 69646, 69647, 69648, 69649, 69650, 69651, 69652, 69653, 69654, 69655, 69656, 69657, 69658]
这是我从代码中获取的索引,当我使用该方法时,我得到以下值,并且我希望删除(删除)的值被加粗:np.diff()
[ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 80 22 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
所以我有 2 个问题。
- 如何删除粗体的?
- 第一个 for 循环包含 8000 个 DSCD,无论如何它都可以更有效率吗?
另一个例子: 列表 T:
[74536, 74537, 74538, 74540, 74542, 74543, 74544, 74545, 74547, 74551, 74554, 74555, 74559, 74560, 74561, 74562, 74563, 74566, 74567, 74568, 74569, 74571, 74572, 74573, 74574, 74575, 74578, 74579, 74580, 74582, 74584, 74585, 74586, 74587, 74588, 74589, 74590, 74591, 74592, 74595, 74596, 74597, 74598, 74599, 74601, 74602, 74603, 74604, 74605, 74606, 74607, 74608, 74609, 74610, 74612, 74613, 74614, 74615, 74616, 74617, 74618, 74619, 74620, 74621, 74622, 74623, 74624, 74625, 74626, 74627, 74628, 74629, 74630, 74631, 74632, 74633, 74634, 74635, 74636, 74637, 74638, 74639, 74640, 74641, 74642, 74643, 74644, 74645, 74646, 74647, 74648, 74649, 74650, 74651, 74652, 74653, 74654, 74655, 74656, 74657, 74658, 74659, 74660, 74661, 74662, 74663, 74664, 74665, 74666, 74667, 74668, 74669, 74670, 74671, 74672, 74673, 74674, 74675, 74676, 74677, 74678, 74679, 74680, 74681, 74682, 74683, 74684, 74685, 74686, 74687, 74688, 74689, 74690, 74691, 74692, 74693, 74694, 74695, 74696, 74697, 74698, 74699, 74700, 74701, 74702, 74703, 74704, 74705, 74706, 74707, 74708, 74709, 74710, 74711, 74712, 74713, 74714, 74715, 74716, 74717, 74718, 74719, 74720, 74721, 74722, 74723, 74724, 74725, 74726, 74727, 74728, 74729, 74730, 74731, 74732, 74733, 74734, 74735, 74736, 74737, 74738, 74739, 74740, 74741, 74742, 74743, 74744, 74745, 74746, 74747, 74748, 74749, 74750, 74751, 74752, 74753, 74754, 74755, 74756, 74757, 74758, 74759, 74760, 74761, 74762, 74763, 74764, 74765, 74766, 74767, 74768, 74769, 74770, 74771, 74772, 74773, 74774, 74775, 74776, 74777, 74778, 74779, 74780, 74781, 74782, 74783, 74784, 74785, 74786, 74787, 74788, 74789, 74790, 74791, 74792, 74793, 74794, 74795, 74796, 74797, 74798, 74799, 74800, 74801, 74802, 74803, 74804, 74805, 74806, 74807, 74808, 74809, 74810, 74811, 74812, 74813, 74814, 74815, 74816, 74817, 74818, 74819, 74820, 74821, 74822, 74823, 74824, 74825, 74826, 74827, 74828, 74829, 74830, 74831, 74832, 74833, 74834, 74835, 74836, 74837, 74838, 74839, 74840, 74841, 74842]
结果np.diff():
[1 1 2 2 1 1 1 2 4 3 1 4 1 1 1 1 3 1 1 1 2 1 1 1 1 3 1 1 2 2 1 1 1 1 1 1 1 1 3 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
难一:
[6359, 6431, 6287, 6347, 6419, 6263, 6275, 6299, 6311, 6323, 6335, 6371, 6383, 6395, 6407, 6443, 6455, 6467, 6360, 6288, 6348, 6420, 6252, 6264, 6276, 6300, 6312, 6324, 6372, 6396, 6408, 6444, 6456, 6468, 6336, 6384, 6432, 6265, 6397, 6469, 6253, 6325, 6385, 6457, 6277, 6289, 6301, 6313, 6337, 6349, 6361, 6373, 6409, 6421, 6433, 6445, 6314, 6446, 6302, 6374, 6434, 6254, 6266, 6278, 6290, 6326, 6338, 6350, 6362, 6386, 6398, 6410, 6422, 6458, 6470, 6423, 6279, 6339, 6411, 6255, 6267, 6291, 6303, 6315, 6327, 6363, 6375, 6387, 6399, 6435, 6447, 6459, 6471, 6268, 6400, 6472, 6256, 6328, 6388, 6460, 6280, 6292, 6304, 6316, 6340, 6351, 6364, 6376, 6412, 6424, 6436, 6448, 6305, 6377, 6437, 6293, 6365, 6257, 6269, 6281, 6317, 6329, 6341, 6389, 6401, 6413, 6425, 6449, 6461, 6473, 6282, 6342, 6414, 6270, 6402, 6474, 6258, 6294, 6306, 6318, 6330, 6353, 6366, 6378, 6390, 6426, 6438, 6450, 6462, 6259, 6331, 6391, 6463, 6319, 6451, 6271, 6283, 6295, 6307, 6343, 6355, 6367, 6379, 6403, 6415, 6427, 6439, 6475, 6296, 6368, 6356, 6428, 6260, 6272, 6284, 6308, 6320, 6332, 6344, 6380, 6392, 6404, 6416, 6440, 6452, 6464, 6476, 6285, 6345, 6417, 6273, 6405, 6477, 6261, 6297, 6309, 6321, 6333, 6357, 6369, 6381, 6393, 6429, 6441, 6453, 6465, 6322, 6454, 6310, 6382, 6442, 6262, 6274, 6286, 6298, 6334, 6346, 6358, 6370, 6394, 6406, 6418, 6430, 6466, 6478]
对上一个列表进行排序后,您将获得:
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
按照同样的逻辑,需要删除粗体
答:
我认为这就是你要找的(完整示例)
熊猫版本:
s = pd.Series([69438, 69439, 69440, 69441, 69442, 69443, 69444, 69445, 69446, 69447, 69448, 69449, 69450, 69451, 69452, 69453, 69454, 69455, 69456, 69457, 69458, 69459, 69460, 69461, 69462, 69463, 69464, 69465, 69466, 69468, 69548, 69570, 69571, 69572, 69573, 69574, 69575, 69576, 69577, 69578, 69579, 69580, 69581, 69582, 69583, 69584, 69585, 69586, 69587, 69588, 69589, 69590, 69591, 69592, 69593, 69594, 69595, 69596, 69597, 69598, 69599, 69600, 69601, 69602, 69603, 69604, 69605, 69606, 69607, 69608, 69609, 69610, 69611, 69612, 69613, 69614, 69615, 69616, 69617, 69618, 69619, 69620, 69621, 69622, 69623, 69624, 69625, 69626, 69627, 69628, 69629, 69630, 69631, 69632, 69633, 69634, 69635, 69636, 69637, 69638, 69639, 69640, 69641, 69642, 69643, 69644, 69645, 69646, 69647, 69648, 69649, 69650, 69651, 69652, 69653, 69654, 69655, 69656, 69657, 69658])
s[:s.index[s.diff().ne(1)][-1]+1]
returns:
[69438, 69439, 69440, 69441, 69442, 69443, 69444, 69445, 69446, 69447, 69448, 69449, 69450, 69451, 69452, 69453, 69454, 69455, 69456, 69457, 69458, 69459, 69460, 69461, 69462, 69463, 69464, 69465, 69466, 69468, 69548, 69570, 69571, 69572]
逻辑解释。 首先,差异到达了你所在的位置。
ne(1) - 只获取 diff 不等于 1 的位置,然后我们返回这些点的索引。
[-1] 获取最后一个索引,添加 2,然后对原始数组进行切片。
Numpy 版本:
# stay in pandas series/df and numpy arrays as much as possible avoid lists
for DSCD in FirmReturnIndexValues['DSCD'].unique():
secondary_df=FirmReturnIndexValues[FirmReturnIndexValues['DSCD'].eq(DSCD)]
t=secondary_df[secondary_df['RIe Pct Return'].eq(0)].index.values
t.sort()
if len(t)>=1:
# get last index where diff is not 1
ind = np.where(np.not_equal(np.diff(t), 1))[0][-1]
# add 2 to index to remove the bolded numbers per your example
result = t[:ind+2]
回答继续:
一般来说,这里有一些简单的加速,你已经展示了代码。但目前尚不清楚您究竟拥有什么作为输入 df 和愿望作为结果。如果您能提供,我们可以对其进行基准测试。
# use direct numpy method calls (eq,lt,gt,ge,le,ne) instead of ==, <, >, etc..
secondary_df=FirmReturnIndexValues[FirmReturnIndexValues['DSCD'].eq(DSCD)]
t=secondary_df[secondary_df['RIe Pct Return'].eq(0)].index.values # don't convert to list if you do not need to
t.sort() # use numpy array sort
if len(t)>=1:
print(np.diff((t)))
这些都是速赢。重构可能会带来结构性加速。但是您需要提供一个完整的工作示例和预期的输出。
如果我正确理解了您的问题,则此代码将取代编写for循环并删除不需要的数据的需要。
# generate a dummy dataframe
df = pd.DataFrame({"col1": np.random.randint(-5,5,20)})
# index of the last 0 value
idx_end = df[df.col1 == 0]['col1'].index[-1]
# get the index of the 0 value that starts the last continuous sequence of 0 values
idx_start = df.iloc[:idx_end][df.iloc[:idx_end].col1 != 0]['col1'].index[-1] + 1
# drop data
df.drop(range(idx_start, idx_end+1), inplace=True)
评论