提问人:Peter 提问时间:10/21/2023 更新时间:10/22/2023 访问量:56
特征工程:如何在 Jupyter Notebook 中创建功能
feature engineering how to create a features in jupyter notebook
问:
我有出租车数据集如下:
Vendor_id int64
1 tpep_pickup_datetime datetime64[ns]
2 tpep_dropoff_datetime datetime64[ns]
3 passenger_count float64
4 trip_distance float64
5 价率代码 float64
6 store_and_fwd_flag对象
7 PULocationID int64
8 DOLocationID int64
9 payment_type int64 10 fare_amount float64 11 额外float64 12 mta_tax float64 13 tip_amount float64 14 tolls_amount float64 15 improvement_surcharge float64
16 total_amount float64
17 congestion_surcharge浮子64
18 airport_fee浮子64
如果我想使用特征工程来了解旅行是否在高峰时段。如何在 jupyter notebook 中执行此操作。 没有给出高峰时段时间,也许可以假设高峰时段从早上 8 点到上午 10 点。
非常感谢
我很困惑,无法找到如何做到这一点或创建新功能的答案。请帮我该怎么做。我知道这是机器学习的一部分,我认为它是使用 sklearn 来做到这一点的?如果我错了,请纠正我
答:
根据您的推理,高峰时间是从 8 点到 10 点,并假设您已将数据集加载到名为 DataFrame 中:df
import pandas as pd
# Extract the hour from the pickup datetime
df['pickup_hour'] = df['tpep_pickup_datetime'].dt.hour
# Determine if the trip is during rush hour
df['is_rush_hour'] = (df['pickup_hour'] >= 8) & (df['pickup_hour'] < 10).astype(int)
# Drop the 'pickup_hour' column as it was just an intermediate step
df.drop(columns='pickup_hour', inplace=True)
该列现在已添加到 DataFrame 中,其中 1 表示“高峰时段”,0 表示“非高峰时段”is_rush_hour
使用 pandas 来操作您的数据,我认为 sklearn 不是适合您任务的工具
下面的代码计算 2 小时间隔并找到最大值,因为我们将其视为高峰间隔,然后我们定义is_rush_hour列,它是布尔变量,带有 true 的行是高峰时段的乘车
import pandas as pd
df = pd.read_csv('your_dataset.csv')
df['pickup_hour'] = pd.to_datetime(df['tpep_pickup_datetime']).dt.hour
# 2 hours interval count the number of rides
df['pickup_hour_interval'] = pd.cut(df['pickup_hour'], bins=range(0, 25, 2))
hourly_counts = df.groupby('pickup_hour_interval').size().reset_index(name='ride_count')
# find the maximum count of rides 2-hour interval
max_rush_hour_interval = hourly_counts.iloc[hourly_counts['ride_count'].idxmax()]
print("2-Hour Rush Hour Interval with Maximum Rides:")
print(max_rush_hour_interval)
rush_hour_start = max_rush_hour_interval['pickup_hour_interval'].left
rush_hour_end = max_rush_hour_interval['pickup_hour_interval'].right
df['is_rush_hour'] = df['pickup_hour'].between(rush_hour_start, rush_hour_end)
评论