提问人:eatkimchi 提问时间:9/10/2022 最后编辑:eatkimchi 更新时间:9/10/2022 访问量:40
加快 Python 日期时间比较以生成值
Speeding up a Python datetime comparison to generate values
问:
我想以 pythonic 的方式做到这一点,而无需使用 1) 嵌套的 if 语句和 2) 使用 iterrows。
我有专栏
Date in | Date Out | 1/22 | 2/22 | ... | 12/22
1/1/19 5/5/22
5/5/22 7/7/22
对于像“1/22”这样的列,我想插入一个计算值,该值将是以下值之一:
- 尚未创建
- 闭
- 打开
对于第一行,第 1/22 列将显示为“打开”,因为它是在 22 年 1 月打开的。这种情况将持续到第5/22栏,其中将标有“已关闭”。
对于第二行,第 1/22 列在 5/22 之前将显示为“尚未创建”,直到 7/22 为止,其值为“已关闭”。
我不一定需要完整的表,但我想计算每个月有多少个已关闭/未打开/尚未创建。
这是我正在使用的代码,它有效,但花费的时间比我想象的要长:
table={}
for i in mcLogsClose.iterrows():
table[i[0]] = {}
for month in pd.date_range(start='9/2021', end='9/2022', freq='M'):
if i[1]['Notif Date'] <= month:
if i[1]['Completion Date'] <= month:
table[i[0]][month]="Closed"
else:
table[i[0]][month]="Open"
else:
table[i[0]][month]="Not Yet Created"
然后我想跑table['1/22'].value_counts()
感谢您的关注!
答:
0赞
Code Different
9/10/2022
#1
1.使用循环
# The date range you are calculating for
min_date = pd.Period("2022-01")
max_date = pd.Period("2022-12")
span = (max_date - min_date).n + 1
# Strip the "Date In" and "Date Out" columns down to the month
date_in = pd.to_datetime(df["Date In"]).dt.to_period("M")
date_out = pd.to_datetime(df["Date Out"]).dt.to_period("M")
data = []
for d_in, d_out in zip(date_in, date_out):
if d_in > max_date:
# If date in is after max date, the whole span is under "Not Created" status
data.append((span, 0, 0))
elif d_out < min_date:
# If date out is before min date, the whole span is under "Closed" status
data.append((0, span, 0))
else:
# Now that we have some overlap between (d_in, d_out) and (min_date,
# max_date), we need to calculate time spent in each status
closed = (max_date - min(d_out, max_date)).n
not_created = (max(d_in, min_date) - min_date).n
open_ = span - closed - not_created
data.append((not_created, closed, open_))
cols = ["Not Created Yet", "Closed", "Open"]
df[cols] = pd.DataFrame(data, columns=cols, index=df.index)
2.使用 numpy
def to_n(arr: np.array) -> np.array:
"""Convert an array of pd.Period to array of integers"""
return np.array([i.n for i in arr])
# The date range you are calculating for. Since we intend to use vectorized
# code, we need to turn them into numpy arrays
min_date = np.repeat(pd.Period("2022-01"), len(df))
max_date = np.repeat(pd.Period("2022-12"), len(df))
span = to_n(max_date - min_date) + 1
date_in = pd.to_datetime(df["Date In"]).dt.to_period("M")
date_out = pd.to_datetime(df["Date Out"]).dt.to_period("M")
df["Not Created Yet"] = np.where(
date_in > max_date,
span,
to_n(np.max([date_in, min_date], axis=0) - min_date),
)
df["Closed"] = np.where(
date_out < min_date,
span,
to_n(max_date - np.min([date_out, max_date], axis=0)),
)
df["Open"] = span - df["Not Created Yet"] - df["Closed"]
结果(为我的测试添加了一些行):
Date In Date Out Not Created Yet Closed Open
0 1/1/19 5/5/22 0 7 5
1 5/5/22 7/7/22 4 5 3
2 1/1/20 12/12/20 0 12 0
3 1/1/23 6/6/23 12 0 0
4 6/6/21 6/6/23 0 0 12
评论