提问人:Amina Umar 提问时间:5/30/2023 最后编辑:Amina Umar 更新时间:5/31/2023 访问量:61
从两个单独的文件中检索数据并写入第三个 csv 文件
Retrieving data from two separate files and writing to a third csv file
问:
我一整天都在思考如何完成这项任务。我有这两个文件:
user.plt:
包含带有时间戳的用户 GPS 轨迹。label.txt:
包含有关用于覆盖用户行程的出行方式的信息。
第一个文件 () 是一个 7 字段逗号分隔的数据,如下所示:user.plt
lat,lon,constant,alt,ndays,date,time
39.921712,116.472343,0,13,39298.1462037037,2007-08-04,03:30:32
39.921705,116.472343,0,13,39298.1462152778,2007-08-04,03:30:33
39.863516,116.373796,0,115,39753.1872916667,2008-11-01,04:29:42
39.863471,116.373711,0,112,39753.1873032407,2008-11-01,04:29:43
39.991778,116.333088,0,223,39753.2128240741,2008-11-01,05:06:28
39.991776,116.333031,0,223,39753.2128472222,2008-11-01,05:06:30
39.991568,116.331501,0,95,39756.4298611111,2008-11-04,10:19:00
39.99156,116.331508,0,95,39756.4298726852,2008-11-04,10:19:01
39.975891,116.333441,0,-98,39756.4312615741,2008-11-04,10:21:01
39.915171,116.455808,0,656,39756.4601157407,2008-11-04,11:02:34
39.915369,116.455791,0,620,39756.4601273148,2008-11-04,11:02:35
39.912271,116.470686,0,95,39756.4653587963,2008-11-04,11:10:07
39.912088,116.469958,0,246,39756.4681481481,2008-11-04,11:14:08
39.912106,116.469936,0,246,39756.4681597222,2008-11-04,11:14:09
39.912189,116.465108,0,184,39756.4741666667,2008-11-04,11:22:48
39.975859,116.334063,0,279,39756.6100115741,2008-11-04,14:38:25
39.975978,116.334041,0,272,39756.6100231481,2008-11-04,14:38:26
39.991336,116.331886,0,115,39756.6112847222,2008-11-04,14:40:15
39.991581,116.33131,0,164,39756.6123148148,2008-11-04,14:41:44
第二个文件 () 是以 3 列为选项卡分隔的用户行程信息,如下所示:label.txt
Start Time End Time Transportation Mode
2008/11/01 03:59:27 2008/11/01 04:30:18 train
2008/11/01 04:35:38 2008/11/01 05:06:30 taxi
2008/11/04 10:18:55 2008/11/04 10:21:11 subway
2008/11/04 11:02:34 2008/11/04 11:10:08 taxi
2008/11/04 11:14:08 2008/11/04 11:22:48 walk
我正在寻找一种方法来读取带有旅行模式注释的旅行每个时期的内容,并以这种方式写入文件:user.plt
CSV
读取 1 行(即特定行程的出行方式信息)。创建两个初始化为 和 初始化为 的字段。
label.txt
trip_id
1
segment_id
1
读取日期和时间在开始时间/结束时间间隔内的每一行(即获取行程的 GPS 轨迹)。
user.plt
label.txt
阅读下一行。
label.txt
- 如果上一行的结束时间与当前行的开始时间之差小于 30 分钟(即相同行程、新航段),则保留为 ,更新为 。
trip_id
1
segment_id
2
- 如果上一行的结束时间与当前行的开始时间之差超过 30 分钟(则新行程、新航段),update 和 .
trip_id = 2
segment_id = 1
- 如果上一行的结束时间与当前行的开始时间之差小于 30 分钟(即相同行程、新航段),则保留为 ,更新为 。
每次,将值写入以下形式的文件中:
CSV
trip_id, segment_id, lat, lon, date, time, transportation-mode
预期结果
给定上面的 2 个输入文件,预期的 CSV 文件 () 将为:processed.csv
trip_id,segment_id,lat,lon,date,time,transportation-mode
1,1,39.863516,116.373796,2008-11-01,04:29:42,train
1,1,39.863471,116.373711,2008-11-01,04:29:43,train
1,2,39.991778,116.333088,2008-11-01,05:06:28,taxi
1,2,39.991776,116.333031,2008-11-01,05:06:30,taxi
2,1,39.991568,116.331501,2008-11-04,10:19:00,subway
2,1,39.99156,116.331508,2008-11-04,10:19:01,subway
2,1,39.975891,116.333441,2008-11-04,10:21:01,subway
3,1,39.915171,116.455808,2008-11-04,11:02:34,taxi
3,1,39.915369,116.455791,2008-11-04,11:02:35,taxi
3,1,39.912271,116.470686,2008-11-04,11:10:07,taxi
3,2,39.912088,116.469958,2008-11-04,11:14:08,walk
3,2,39.912106,116.469936,2008-11-04,11:14:09,walk
3,2,39.912189,116.465108,2008-11-04,11:22:48,walk
注意:并非所有行都有相应的行程信息。这些行将被忽略且不需要。user.plt
label.txt
编辑
下面我按照评论的建议以字典的形式提供数据。
user.plt
:
{'lat': [39.921712,39.921705,39.863516,39.863471,39.991778,39.991776,
39.991568,39.99156,39.975891,39.915171,39.915369,39.912271,39.912088,
39.912106,39.912189,39.975859,39.975978,39.991336,39.991581],
'lon': [116.472343,116.472343,116.373796,116.373711,116.333088,116.333031,
116.331501,116.331508,116.333441,116.455808,116.455791,116.470686,116.469958,
116.469936,116.465108,116.334063,116.334041,116.331886,116.33131],
'constant': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'alt': [13,13,115,112,223,223,95,95,-98,656,620,95,246,246,184,279,272,115,164],
'ndays': [39298.1462037037,39298.1462152778,39753.1872916667,39753.1873032407,
39753.2128240741,39753.2128472222,39756.4298611111,39756.4298726852,39756.4312615741,
39756.4601157407,39756.4601273148,39756.4653587963,39756.4681481481,39756.4681597222,
39756.4741666667,39756.6100115741,39756.6100231481,39756.6112847222,39756.6123148148],
'date': ['2007-08-04','2007-08-04','2008-11-01','2008-11-01','2008-11-01','2008-11-01',
'2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04',
'2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04'],
'time': ['03:30:32','03:30:33','04:29:42','04:29:43','05:06:28','05:06:30','10:19:00',
'10:19:01','10:21:01','11:02:34','11:02:35','11:10:07','11:14:08','11:14:09','11:22:48',
'14:38:25','14:38:26','14:40:15','14:41:44']}
label.txt
:
{'Start Time': ['2008/11/01 03:59:27',
'2008/11/01 04:35:38',
'2008/11/04 10:18:55',
'2008/11/04 11:02:34',
'2008/11/04 11:14:08'],
'End Time': ['2008/11/01 04:30:18',
'2008/11/01 05:06:30',
'2008/11/04 10:21:11',
'2008/11/04 11:10:08',
'2008/11/04 11:22:48'],
'Transportation Mode': ['train', 'taxi', 'subway', 'taxi', 'walk']}
答:
import pandas as pd
import janitor
user = pd.read_csv("user.plt", parse_dates={"Trip Time": ["date", "time"]})
label = pd.read_table("label.txt", parse_dates=["Start Time", "End Time"])
df = user.conditional_join(label,
("Trip Time", "Start Time", ">="),
("Trip Time", "End Time", "<="))
df["diff (min)"] = ((df["End Time"].shift() - df["Start Time"])
.dt.total_seconds().div(60).abs())
df["trip_id"] = ((df["diff (min)"].shift().gt(30) & df["diff (min)"].lt(30))
.shift(-1, fill_value=False).cumsum().add(1))
df["segment_id"] = ((~df.duplicated(subset=["trip_id", "Transportation Mode"]))
.groupby(df['trip_id']).cumsum())
use_cols = ["diff (min)", "trip_id", "segment_id",
"lat", "lon", "Trip Time", "Transportation Mode"]
df = df[use_cols]
# df.to_csv("processed.csv", index=False) #uncomment this line to make a fresh `.csv`
输出:
print(df)
diff (min) trip_id segment_id lat lon Trip Time Transportation Mode
0 NaN 1 1 39.86 116.37 2008-11-01 04:29:42 train
1 30.85 2 1 39.86 116.37 2008-11-01 04:29:43 train
2 5.33 2 2 39.99 116.33 2008-11-01 05:06:28 taxi
3 30.87 2 2 39.99 116.33 2008-11-01 05:06:30 taxi
4 4632.42 3 1 39.99 116.33 2008-11-04 10:19:00 subway
5 2.27 3 1 39.99 116.33 2008-11-04 10:19:01 subway
6 2.27 3 1 39.98 116.33 2008-11-04 10:21:01 subway
7 41.38 4 1 39.92 116.46 2008-11-04 11:02:34 taxi
8 7.57 4 1 39.92 116.46 2008-11-04 11:02:35 taxi
9 7.57 4 1 39.91 116.47 2008-11-04 11:10:07 taxi
10 4.00 4 2 39.91 116.47 2008-11-04 11:14:08 walk
11 8.67 4 2 39.91 116.47 2008-11-04 11:14:09 walk
12 8.67 4 2 39.91 116.47 2008-11-04 11:22:48 walk
评论
diff (min)
这是一个纯粹的熊猫解决方案(确保 dtypes 也是正确的)。这是在 Jupyter 笔记本中测试的,因此,如果使用另一个 IDE,您也可以这样做:display(user_df)
print(user_df)
import pandas as pd
user_df = pd.DataFrame({'lat': [39.921712,39.921705,39.863516,39.863471,39.991778,39.991776,
39.991568,39.99156,39.975891,39.915171,39.915369,39.912271,39.912088,
39.912106,39.912189,39.975859,39.975978,39.991336,39.991581],
'lon': [116.472343,116.472343,116.373796,116.373711,116.333088,116.333031,
116.331501,116.331508,116.333441,116.455808,116.455791,116.470686,116.469958,
116.469936,116.465108,116.334063,116.334041,116.331886,116.33131],
'constant': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'alt': [13,13,115,112,223,223,95,95,-98,656,620,95,246,246,184,279,272,115,164],
'ndays': [39298.1462037037,39298.1462152778,39753.1872916667,39753.1873032407,
39753.2128240741,39753.2128472222,39756.4298611111,39756.4298726852,39756.4312615741,
39756.4601157407,39756.4601273148,39756.4653587963,39756.4681481481,39756.4681597222,
39756.4741666667,39756.6100115741,39756.6100231481,39756.6112847222,39756.6123148148],
'date': ['2007-08-04','2007-08-04','2008-11-01','2008-11-01','2008-11-01','2008-11-01',
'2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04',
'2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04','2008-11-04'],
'time': ['03:30:32','03:30:33','04:29:42','04:29:43','05:06:28','05:06:30','10:19:00',
'10:19:01','10:21:01','11:02:34','11:02:35','11:10:07','11:14:08','11:14:09','11:22:48',
'14:38:25','14:38:26','14:40:15','14:41:44']})
label_df = pd.DataFrame({'StartTime': ['2008/11/01 03:59:27',
'2008/11/01 04:35:38',
'2008/11/04 10:18:55',
'2008/11/04 11:02:34',
'2008/11/04 11:14:08'],
'EndTime': ['2008/11/01 04:30:18',
'2008/11/01 05:06:30',
'2008/11/04 10:21:11',
'2008/11/04 11:10:08',
'2008/11/04 11:22:48'],
'TransportationMode': ['train', 'taxi', 'subway', 'taxi', 'walk']})
# display(user_df)
# display(label_df)
user_df['timestamp'] = user_df.apply(lambda row: pd.to_datetime(str(row.date) +' ' + str(row.time), format='ISO8601'), axis=1)
label_df['start_time'] = label_df.apply(lambda row: pd.to_datetime(row['StartTime'], format='ISO8601'), axis=1)
label_df['end_time'] = label_df.apply(lambda row: pd.to_datetime(row['EndTime'], format='ISO8601'), axis=1)
label_df['trip_id'] = label_df.apply(lambda row: 1 if row.name == 0 else (1 if label_df.iloc[row.name, :]['start_time'] - label_df.iloc[row.name-1, :]['end_time'] < pd.Timedelta(minutes=30) else 2), axis=1)
label_df['segment_id'] = label_df.apply(lambda row: 1 if row.name == 0 else (2 if label_df.iloc[row.name, :]['start_time'] - label_df.iloc[row.name-1, :]['end_time'] < pd.Timedelta(minutes=30) else 1), axis=1)
label_df
user_df['transportation_mode'] = user_df.apply(lambda row: label_df[(label_df.start_time <= row.timestamp)&(label_df.end_time >= row.timestamp)]['TransportationMode'].values, axis=1)
user_df['trip_id'] = user_df.apply(lambda row: label_df[(label_df.start_time <= row.timestamp)&(label_df.end_time >= row.timestamp)]['trip_id'].values, axis=1)
user_df['segment_id'] = user_df.apply(lambda row: label_df[(label_df.start_time <= row.timestamp)&(label_df.end_time >= row.timestamp)]['segment_id'].values, axis=1)
display(user_df[user_df.transportation_mode.str.len() > 0])
终端结果:
lat lon constant alt ndays date time timestamp transportation_mode trip_id segment_id
2 39.863516 116.373796 0 115 39753.187292 2008-11-01 04:29:42 2008-11-01 04:29:42 [train] [1] [1]
3 39.863471 116.373711 0 112 39753.187303 2008-11-01 04:29:43 2008-11-01 04:29:43 [train] [1] [1]
4 39.991778 116.333088 0 223 39753.212824 2008-11-01 05:06:28 2008-11-01 05:06:28 [taxi] [1] [2]
5 39.991776 116.333031 0 223 39753.212847 2008-11-01 05:06:30 2008-11-01 05:06:30 [taxi] [1] [2]
6 39.991568 116.331501 0 95 39756.429861 2008-11-04 10:19:00 2008-11-04 10:19:00 [subway] [2] [1]
7 39.991560 116.331508 0 95 39756.429873 2008-11-04 10:19:01 2008-11-04 10:19:01 [subway] [2] [1]
8 39.975891 116.333441 0 -98 39756.431262 2008-11-04 10:21:01 2008-11-04 10:21:01 [subway] [2] [1]
9 39.915171 116.455808 0 656 39756.460116 2008-11-04 11:02:34 2008-11-04 11:02:34 [taxi] [2] [1]
10 39.915369 116.455791 0 620 39756.460127 2008-11-04 11:02:35 2008-11-04 11:02:35 [taxi] [2] [1]
11 39.912271 116.470686 0 95 39756.465359 2008-11-04 11:10:07 2008-11-04 11:10:07 [taxi] [2] [1]
12 39.912088 116.469958 0 246 39756.468148 2008-11-04 11:14:08 2008-11-04 11:14:08 [walk] [1] [2]
13 39.912106 116.469936 0 246 39756.468160 2008-11-04 11:14:09 2008-11-04 11:14:09 [walk] [1] [2]
14 39.912189 116.465108 0 184 39756.474167 2008-11-04 11:22:48 2008-11-04 11:22:48 [walk] [1] [2]
话虽如此(好吧,书面),我不是 100% 确定这是否符合您的确切要求,或者即使它是有效的——一步一步地手动解决问题,所以这里最欢迎熊猫重量级人物的某种形式的验证。
评论