加快 Python 日期时间比较以生成值

Speeding up a Python datetime comparison to generate values

提问人:eatkimchi 提问时间:9/10/2022 最后编辑:eatkimchi 更新时间:9/10/2022 访问量:40

问:

我想以 pythonic 的方式做到这一点,而无需使用 1) 嵌套的 if 语句和 2) 使用 iterrows。

我有专栏

Date in | Date Out | 1/22 | 2/22 | ... | 12/22
1/1/19    5/5/22
5/5/22    7/7/22

对于像“1/22”这样的列,我想插入一个计算值,该值将是以下值之一:

  1. 尚未创建
  2. 打开

对于第一行,第 1/22 列将显示为“打开”,因为它是在 22 年 1 月打开的。这种情况将持续到第5/22栏,其中将标有“已关闭”。

对于第二行,第 1/22 列在 5/22 之前将显示为“尚未创建”,直到 7/22 为止,其值为“已关闭”。

我不一定需要完整的表,但我想计算每个月有多少个已关闭/未打开/尚未创建。

这是我正在使用的代码,它有效,但花费的时间比我想象的要长:

table={}
for i in mcLogsClose.iterrows():
    table[i[0]] = {}
    for month in pd.date_range(start='9/2021', end='9/2022', freq='M'):
        if i[1]['Notif Date'] <= month:
            if i[1]['Completion Date'] <= month:
                table[i[0]][month]="Closed"
            else:
                table[i[0]][month]="Open"
        else:
            table[i[0]][month]="Not Yet Created"

然后我想跑table['1/22'].value_counts()

感谢您的关注!

Python Pandas 日期时间 比较 分析

评论

0赞 Code Different 9/10/2022
(1) 您的数据与您想要的内容的解释不匹配 (2) 如果您只想计算每行每个状态的月份数,则有更简单的选项,而不是创建列
0赞 eatkimchi 9/10/2022
“有更简单的选项,而不是创建列”我在听。

答:

0赞 Code Different 9/10/2022 #1

1.使用循环

# The date range you are calculating for
min_date = pd.Period("2022-01")
max_date = pd.Period("2022-12")
span = (max_date - min_date).n + 1

# Strip the "Date In" and "Date Out" columns down to the month
date_in = pd.to_datetime(df["Date In"]).dt.to_period("M")
date_out = pd.to_datetime(df["Date Out"]).dt.to_period("M")

data = []
for d_in, d_out in zip(date_in, date_out):
    if d_in > max_date:
        # If date in is after max date, the whole span is under "Not Created" status
        data.append((span, 0, 0))
    elif d_out < min_date:
        # If date out is before min date, the whole span is under "Closed" status
        data.append((0, span, 0))
    else:
        # Now that we have some overlap between (d_in, d_out) and (min_date,
        # max_date), we need to calculate time spent in each status
        closed = (max_date - min(d_out, max_date)).n
        not_created = (max(d_in, min_date) - min_date).n
        open_ = span - closed - not_created
        data.append((not_created, closed, open_))

cols = ["Not Created Yet", "Closed", "Open"]
df[cols] = pd.DataFrame(data, columns=cols, index=df.index)

2.使用 numpy

def to_n(arr: np.array) -> np.array:
    """Convert an array of pd.Period to array of integers"""
    return np.array([i.n for i in arr])

# The date range you are calculating for. Since we intend to use vectorized
# code, we need to turn them into numpy arrays
min_date = np.repeat(pd.Period("2022-01"), len(df))
max_date = np.repeat(pd.Period("2022-12"), len(df))
span = to_n(max_date - min_date) + 1

date_in = pd.to_datetime(df["Date In"]).dt.to_period("M")
date_out = pd.to_datetime(df["Date Out"]).dt.to_period("M")

df["Not Created Yet"] = np.where(
    date_in > max_date,
    span,
    to_n(np.max([date_in, min_date], axis=0) - min_date),
)
df["Closed"] = np.where(
    date_out < min_date,
    span,
    to_n(max_date - np.min([date_out, max_date], axis=0)),
)
df["Open"] = span - df["Not Created Yet"] - df["Closed"]

结果(为我的测试添加了一些行):

  Date In  Date Out  Not Created Yet  Closed  Open
0  1/1/19    5/5/22                0       7     5
1  5/5/22    7/7/22                4       5     3
2  1/1/20  12/12/20                0      12     0
3  1/1/23    6/6/23               12       0     0
4  6/6/21    6/6/23                0       0    12