截断时间系列文件并提取一些描述性变量-解网

问：

我有两个主要问题，我无法想象python的解决方案。现在，我向你解释上下文。一方面，我有一个数据集，其中包含一些 ID 为 ID（1 个 ID = 1 个患者）的日期点，如下所示：

编号	日期点
0001	25/12/2022 09:00
0002	29/12/2022 16:00
0003	30/12/2022 18:00
...	....

另一方面，我有一个文件夹，其中包含许多包含时间序列的文本文件，如下所示：

0001.txt
0002.txt
0003.txt
...

这些文件具有相同的架构：ID（与数据集相同）在文件名中，文件内部的结构如下（第一列包含日期和第二列 de 值）：

25/12/2022 09：00 155 25/12/2022 09：01 156 25/12/2022 09：02 157 25/12/

2022 09：03
158 ...

1/ 我想截断文本文件并仅检索 48H 数据集日期点之前的变量。

2/ 为了进行一些统计分析，我想取一些值，例如该变量的平均值或最大值，并添加如下数据帧：

编号	意味着	最大
0001
0002
0003
...	....	...

我知道对你来说这将是一个微不足道的问题，但对我来说（python代码的初学者）这将是一个挑战！

谢谢大家。

使用包含日期点的数据帧管理时间序列，并获取一些统计值。

python 时间序列文本文件数据操作

import pandas as pd
from pathlib import Path


# I'll create a limited version of your initial table
data = {
    "ID": ["0001", "0002", "0003"],
    "Date point": ["25/12/2022 09:00", "29/12/2022 16:00", "30/12/2022 18:00"]
}

# put in a Pandas DataFrame
df = pd.DataFrame(data)

# convert the "Date point" column to a datetime object
df["Date point"] = pd.to_datetime(df["Date point"])

# provide the path to the folder containing the files
folder = Path("/path_to_files")

newdata = {"ID": [], "Mean": [], "Maximum": []}  # an empty dictionary that you'll fill with the required statistical info

# loop through the IDs and read in the files
for i, date in zip(df["ID"], df["Date point"]):
    inputfile = folder / f"{i}.txt"  # construct file name
    if inputfile.exists():
        # read in the file
        subdata = pd.read_csv(
            inputfile,
            sep="\s+",  # columns are separated by spaces
            header=None,  # there's no header information
            parse_dates=[[0, 1]],  # the first and second columns should be combined and converted to datetime objects
            infer_datetime_format=True
        )

        # get the values 48 hours after the current date point
        td = pd.Timedelta(value=48, unit="hours")
        mask = (subdata["0_1"] > date) & (subdata["0_1"] <= date + td)

        # add in the required info
        newdata["ID"].append(i)
        newdata["Mean"].append(subdata[2].loc[mask].mean())
        newdata["Maximum"].append(subdata[2].loc[mask].max())

# put newdata into a DataFrame
dfnew = pd.DataFrame(newdata)

截断时间系列文件并提取一些描述性变量

Truncate a time serie files and extract some descriptive variable

评论

评论