从两个单独的文件中检索数据并写入第三个 csv 文件

Retrieving data from two separate files and writing to a third csv file

提问人:Amina Umar 提问时间:5/30/2023 最后编辑:Amina Umar 更新时间:5/31/2023 访问量:61

问:

我一整天都在思考如何完成这项任务。我有这两个文件:

  1. user.plt:包含带有时间戳的用户 GPS 轨迹。
  2. label.txt:包含有关用于覆盖用户行程的出行方式的信息。

第一个文件 () 是一个 7 字段逗号分隔的数据,如下所示:user.plt

lat,lon,constant,alt,ndays,date,time
39.921712,116.472343,0,13,39298.1462037037,2007-08-04,03:30:32
39.921705,116.472343,0,13,39298.1462152778,2007-08-04,03:30:33
39.863516,116.373796,0,115,39753.1872916667,2008-11-01,04:29:42
39.863471,116.373711,0,112,39753.1873032407,2008-11-01,04:29:43
39.991778,116.333088,0,223,39753.2128240741,2008-11-01,05:06:28
39.991776,116.333031,0,223,39753.2128472222,2008-11-01,05:06:30
39.991568,116.331501,0,95,39756.4298611111,2008-11-04,10:19:00
39.99156,116.331508,0,95,39756.4298726852,2008-11-04,10:19:01
39.975891,116.333441,0,-98,39756.4312615741,2008-11-04,10:21:01
39.915171,116.455808,0,656,39756.4601157407,2008-11-04,11:02:34
39.915369,116.455791,0,620,39756.4601273148,2008-11-04,11:02:35
39.912271,116.470686,0,95,39756.4653587963,2008-11-04,11:10:07
39.912088,116.469958,0,246,39756.4681481481,2008-11-04,11:14:08
39.912106,116.469936,0,246,39756.4681597222,2008-11-04,11:14:09
39.912189,116.465108,0,184,39756.4741666667,2008-11-04,11:22:48
39.975859,116.334063,0,279,39756.6100115741,2008-11-04,14:38:25
39.975978,116.334041,0,272,39756.6100231481,2008-11-04,14:38:26
39.991336,116.331886,0,115,39756.6112847222,2008-11-04,14:40:15
39.991581,116.33131,0,164,39756.6123148148,2008-11-04,14:41:44

第二个文件 () 是以 3 列为选项卡分隔的用户行程信息,如下所示:label.txt

Start Time           End Time    Transportation Mode
2008/11/01 03:59:27 2008/11/01 04:30:18 train
2008/11/01 04:35:38 2008/11/01 05:06:30 taxi
2008/11/04 10:18:55 2008/11/04 10:21:11 subway
2008/11/04 11:02:34 2008/11/04 11:10:08 taxi
2008/11/04 11:14:08 2008/11/04 11:22:48 walk

我正在寻找一种方法来读取带有旅行模式注释的旅行每个时期的内容,并以这种方式写入文件:user.pltCSV

  • 读取 1 行(即特定行程的出行方式信息)。创建两个初始化为 和 初始化为 的字段。label.txttrip_id1segment_id1

  • 读取日期和时间在开始时间/结束时间间隔内的每一行(即获取行程的 GPS 轨迹)。user.pltlabel.txt

  • 阅读下一行。label.txt

    • 如果上一行的结束时间与当前行的开始时间之差小于 30 分钟(即相同行程、新航段),则保留为 ,更新为 。trip_id1segment_id2
    • 如果上一行的结束时间与当前行的开始时间之差超过 30 分钟(则新行程、新航段),update 和 .trip_id = 2segment_id = 1
  • 每次,将值写入以下形式的文件中:CSV

    trip_id, segment_id, lat, lon, date, time, transportation-mode

预期结果

给定上面的 2 个输入文件,预期的 CSV 文件 () 将为:processed.csv

trip_id,segment_id,lat,lon,date,time,transportation-mode
1,1,39.863516,116.373796,2008-11-01,04:29:42,train
1,1,39.863471,116.373711,2008-11-01,04:29:43,train
1,2,39.991778,116.333088,2008-11-01,05:06:28,taxi
1,2,39.991776,116.333031,2008-11-01,05:06:30,taxi
2,1,39.991568,116.331501,2008-11-04,10:19:00,subway
2,1,39.99156,116.331508,2008-11-04,10:19:01,subway
2,1,39.975891,116.333441,2008-11-04,10:21:01,subway
3,1,39.915171,116.455808,2008-11-04,11:02:34,taxi
3,1,39.915369,116.455791,2008-11-04,11:02:35,taxi
3,1,39.912271,116.470686,2008-11-04,11:10:07,taxi
3,2,39.912088,116.469958,2008-11-04,11:14:08,walk
3,2,39.912106,116.469936,2008-11-04,11:14:09,walk
3,2,39.912189,116.465108,2008-11-04,11:22:48,walk

注意:并非所有行都有相应的行程信息。这些行将被忽略且不需要。user.pltlabel.txt

编辑

下面我按照评论的建议以字典的形式提供数据。

  1. user.plt:
{'lat': [39.921712,39.921705,39.863516,39.863471,39.991778,39.991776,
  39.991568,39.99156,39.975891,39.915171,39.915369,39.912271,39.912088,
  39.912106,39.912189,39.975859,39.975978,39.991336,39.991581],
 'lon': [116.472343,116.472343,116.373796,116.373711,116.333088,116.333031,
  116.331501,116.331508,116.333441,116.455808,116.455791,116.470686,116.469958,
  116.469936,116.465108,116.334063,116.334041,116.331886,116.33131],
 'constant': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'alt': [13,13,115,112,223,223,95,95,-98,656,620,95,246,246,184,279,272,115,164],
 'ndays': [39298.1462037037,39298.1462152778,39753.1872916667,39753.1873032407,
  39753.2128240741,39753.2128472222,39756.4298611111,39756.4298726852,39756.4312615741,
  39756.4601157407,39756.4601273148,39756.4653587963,39756.4681481481,39756.4681597222,
  39756.4741666667,39756.6100115741,39756.6100231481,39756.6112847222,39756.6123148148],
 'date': ['2007-08-04','2007-08-04','2008-11-01','2008-11-01','2008-11-01','2008-11-01',
  '2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04',
  '2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04'],
 'time': ['03:30:32','03:30:33','04:29:42','04:29:43','05:06:28','05:06:30','10:19:00',
  '10:19:01','10:21:01','11:02:34','11:02:35','11:10:07','11:14:08','11:14:09','11:22:48',
  '14:38:25','14:38:26','14:40:15','14:41:44']}
  1. label.txt:
{'Start Time': ['2008/11/01 03:59:27',
  '2008/11/01 04:35:38',
  '2008/11/04 10:18:55',
  '2008/11/04 11:02:34',
  '2008/11/04 11:14:08'],
 'End Time': ['2008/11/01 04:30:18',
  '2008/11/01 05:06:30',
  '2008/11/04 10:21:11',
  '2008/11/04 11:10:08',
  '2008/11/04 11:22:48'],
 'Transportation Mode': ['train', 'taxi', 'subway', 'taxi', 'walk']}
python-3.x csv 文件 文件-io pandas pyjanitor

评论

1赞 Barry the Platipus 5/30/2023
请将 csv(表格)数据作为字典包含在内,以便我们可以重现您的问题并给出答案。
0赞 Kache 5/30/2023
那么你的问题是什么?您的“以这种方式写入 csv 文件”部分读起来就像您已经知道该怎么做一样。
0赞 Amina Umar 5/30/2023
@Kache详细描述这个问题。
0赞 Amina Umar 5/30/2023
@BarrythePlatipus问题已编辑以在 Python 字典中显示数据。

答:

1赞 Timeless 5/30/2023 #1

如果您想使用带有 风味

import pandas as pd
import janitor

user = pd.read_csv("user.plt", parse_dates={"Trip Time": ["date", "time"]})
label = pd.read_table("label.txt", parse_dates=["Start Time", "End Time"])

df = user.conditional_join(label, 
                          ("Trip Time", "Start Time", ">="), 
                          ("Trip Time", "End Time", "<="))

df["diff (min)"] = ((df["End Time"].shift() - df["Start Time"])
                             .dt.total_seconds().div(60).abs())

df["trip_id"] = ((df["diff (min)"].shift().gt(30) & df["diff (min)"].lt(30))
                               .shift(-1, fill_value=False).cumsum().add(1))

df["segment_id"] = ((~df.duplicated(subset=["trip_id", "Transportation Mode"]))
                                              .groupby(df['trip_id']).cumsum())

use_cols = ["diff (min)", "trip_id", "segment_id",
            "lat", "lon", "Trip Time", "Transportation Mode"]

df = df[use_cols]

# df.to_csv("processed.csv", index=False) #uncomment this line to make a fresh `.csv`

​ 输出:

print(df)

    diff (min)  trip_id  segment_id   lat    lon           Trip Time Transportation Mode
0          NaN        1           1 39.86 116.37 2008-11-01 04:29:42               train
1        30.85        2           1 39.86 116.37 2008-11-01 04:29:43               train
2         5.33        2           2 39.99 116.33 2008-11-01 05:06:28                taxi
3        30.87        2           2 39.99 116.33 2008-11-01 05:06:30                taxi
4      4632.42        3           1 39.99 116.33 2008-11-04 10:19:00              subway
5         2.27        3           1 39.99 116.33 2008-11-04 10:19:01              subway
6         2.27        3           1 39.98 116.33 2008-11-04 10:21:01              subway
7        41.38        4           1 39.92 116.46 2008-11-04 11:02:34                taxi
8         7.57        4           1 39.92 116.46 2008-11-04 11:02:35                taxi
9         7.57        4           1 39.91 116.47 2008-11-04 11:10:07                taxi
10        4.00        4           2 39.91 116.47 2008-11-04 11:14:08                walk
11        8.67        4           2 39.91 116.47 2008-11-04 11:14:09                walk
12        8.67        4           2 39.91 116.47 2008-11-04 11:22:48                walk

评论

1赞 Timeless 5/30/2023
注意:我故意保留了这个专栏,以证明预期输出的差异是合理的。diff (min)
0赞 Barry the Platipus 5/30/2023 #2

这是一个纯粹的熊猫解决方案(确保 dtypes 也是正确的)。这是在 Jupyter 笔记本中测试的,因此,如果使用另一个 IDE,您也可以这样做:display(user_df)print(user_df)

import pandas as pd

user_df = pd.DataFrame({'lat': [39.921712,39.921705,39.863516,39.863471,39.991778,39.991776,
  39.991568,39.99156,39.975891,39.915171,39.915369,39.912271,39.912088,
  39.912106,39.912189,39.975859,39.975978,39.991336,39.991581],
 'lon': [116.472343,116.472343,116.373796,116.373711,116.333088,116.333031,
  116.331501,116.331508,116.333441,116.455808,116.455791,116.470686,116.469958,
  116.469936,116.465108,116.334063,116.334041,116.331886,116.33131],
 'constant': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'alt': [13,13,115,112,223,223,95,95,-98,656,620,95,246,246,184,279,272,115,164],
 'ndays': [39298.1462037037,39298.1462152778,39753.1872916667,39753.1873032407,
  39753.2128240741,39753.2128472222,39756.4298611111,39756.4298726852,39756.4312615741,
  39756.4601157407,39756.4601273148,39756.4653587963,39756.4681481481,39756.4681597222,
  39756.4741666667,39756.6100115741,39756.6100231481,39756.6112847222,39756.6123148148],
 'date': ['2007-08-04','2007-08-04','2008-11-01','2008-11-01','2008-11-01','2008-11-01',
  '2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04',
  '2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04'],
 'time': ['03:30:32','03:30:33','04:29:42','04:29:43','05:06:28','05:06:30','10:19:00',
  '10:19:01','10:21:01','11:02:34','11:02:35','11:10:07','11:14:08','11:14:09','11:22:48',
  '14:38:25','14:38:26','14:40:15','14:41:44']})

label_df = pd.DataFrame({'StartTime': ['2008/11/01 03:59:27',
  '2008/11/01 04:35:38',
  '2008/11/04 10:18:55',
  '2008/11/04 11:02:34',
  '2008/11/04 11:14:08'],
 'EndTime': ['2008/11/01 04:30:18',
  '2008/11/01 05:06:30',
  '2008/11/04 10:21:11',
  '2008/11/04 11:10:08',
  '2008/11/04 11:22:48'],
 'TransportationMode': ['train', 'taxi', 'subway', 'taxi', 'walk']})

# display(user_df)
# display(label_df)
    
user_df['timestamp'] = user_df.apply(lambda row: pd.to_datetime(str(row.date) +' ' + str(row.time), format='ISO8601'), axis=1)
label_df['start_time'] = label_df.apply(lambda row: pd.to_datetime(row['StartTime'], format='ISO8601'), axis=1)
label_df['end_time'] = label_df.apply(lambda row: pd.to_datetime(row['EndTime'], format='ISO8601'), axis=1)

label_df['trip_id'] = label_df.apply(lambda row: 1 if row.name == 0 else (1 if label_df.iloc[row.name, :]['start_time'] - label_df.iloc[row.name-1, :]['end_time'] < pd.Timedelta(minutes=30) else 2), axis=1)
label_df['segment_id'] = label_df.apply(lambda row: 1 if row.name == 0 else (2 if label_df.iloc[row.name, :]['start_time'] - label_df.iloc[row.name-1, :]['end_time'] < pd.Timedelta(minutes=30) else 1), axis=1)
label_df

user_df['transportation_mode'] = user_df.apply(lambda row: label_df[(label_df.start_time <= row.timestamp)&(label_df.end_time >= row.timestamp)]['TransportationMode'].values, axis=1)
user_df['trip_id'] = user_df.apply(lambda row: label_df[(label_df.start_time <= row.timestamp)&(label_df.end_time >= row.timestamp)]['trip_id'].values, axis=1)
user_df['segment_id'] = user_df.apply(lambda row: label_df[(label_df.start_time <= row.timestamp)&(label_df.end_time >= row.timestamp)]['segment_id'].values, axis=1)
display(user_df[user_df.transportation_mode.str.len() > 0])

终端结果:

    lat     lon     constant    alt     ndays   date    time    timestamp   transportation_mode     trip_id     segment_id
2   39.863516   116.373796  0   115     39753.187292    2008-11-01  04:29:42    2008-11-01 04:29:42     [train]     [1]     [1]
3   39.863471   116.373711  0   112     39753.187303    2008-11-01  04:29:43    2008-11-01 04:29:43     [train]     [1]     [1]
4   39.991778   116.333088  0   223     39753.212824    2008-11-01  05:06:28    2008-11-01 05:06:28     [taxi]  [1]     [2]
5   39.991776   116.333031  0   223     39753.212847    2008-11-01  05:06:30    2008-11-01 05:06:30     [taxi]  [1]     [2]
6   39.991568   116.331501  0   95  39756.429861    2008-11-04  10:19:00    2008-11-04 10:19:00     [subway]    [2]     [1]
7   39.991560   116.331508  0   95  39756.429873    2008-11-04  10:19:01    2008-11-04 10:19:01     [subway]    [2]     [1]
8   39.975891   116.333441  0   -98     39756.431262    2008-11-04  10:21:01    2008-11-04 10:21:01     [subway]    [2]     [1]
9   39.915171   116.455808  0   656     39756.460116    2008-11-04  11:02:34    2008-11-04 11:02:34     [taxi]  [2]     [1]
10  39.915369   116.455791  0   620     39756.460127    2008-11-04  11:02:35    2008-11-04 11:02:35     [taxi]  [2]     [1]
11  39.912271   116.470686  0   95  39756.465359    2008-11-04  11:10:07    2008-11-04 11:10:07     [taxi]  [2]     [1]
12  39.912088   116.469958  0   246     39756.468148    2008-11-04  11:14:08    2008-11-04 11:14:08     [walk]  [1]     [2]
13  39.912106   116.469936  0   246     39756.468160    2008-11-04  11:14:09    2008-11-04 11:14:09     [walk]  [1]     [2]
14  39.912189   116.465108  0   184     39756.474167    2008-11-04  11:22:48    2008-11-04 11:22:48     [walk]  [1]     [2]

话虽如此(好吧,书面),我不是 100% 确定这是否符合您的确切要求,或者即使它是有效的——一步一步地手动解决问题,所以这里最欢迎熊猫重量级人物的某种形式的验证。