特征工程:如何在 Jupyter Notebook 中创建功能

feature engineering how to create a features in jupyter notebook

提问人:Peter 提问时间:10/21/2023 更新时间:10/22/2023 访问量:56

问:

我有出租车数据集如下: Vendor_id int64
1 tpep_pickup_datetime datetime64[ns] 2 tpep_dropoff_datetime datetime64[ns] 3 passenger_count float64
4 trip_distance float64
5 价率代码 float64
6 store_and_fwd_flag对象
7 PULocationID int64
8 DOLocationID int64
9 payment_type int64 10 fare_amount float64 11 额外float64 12 mta_tax float64 13 tip_amount float64 14 tolls_amount float64 15 improvement_surcharge float64






16 total_amount float64
17 congestion_surcharge浮子64
18 airport_fee浮子64

如果我想使用特征工程来了解旅行是否在高峰时段。如何在 jupyter notebook 中执行此操作。 没有给出高峰时段时间,也许可以假设高峰时段从早上 8 点到上午 10 点。

非常感谢

我很困惑,无法找到如何做到这一点或创建新功能的答案。请帮我该怎么做。我知道这是机器学习的一部分,我认为它是使用 sklearn 来做到这一点的?如果我错了,请纠正我

python pandas 机器学习 scikit-learn jupyter-notebook

评论

0赞 Peter 10/21/2023
@Iskander14yo对不起,愚蠢的问题,但有必要放弃“pickup_hour”吗?为什么?如果我不放弃它,它会干扰以后的分析吗?谢谢
0赞 Peter 10/21/2023
因为它假设高峰时段从早上 8 点到上午 10 点。如果是下午 2 点到 4 点,那么代码将是 (df['pickup_hour'] >= 14) & (df['pickup_hour'] < 16) 是这样吗?

答:

0赞 Iskander14yo 10/21/2023 #1

根据您的推理,高峰时间是从 8 点到 10 点,并假设您已将数据集加载到名为 DataFrame 中:df

import pandas as pd

# Extract the hour from the pickup datetime
df['pickup_hour'] = df['tpep_pickup_datetime'].dt.hour

# Determine if the trip is during rush hour
df['is_rush_hour'] = (df['pickup_hour'] >= 8) & (df['pickup_hour'] < 10).astype(int)

# Drop the 'pickup_hour' column as it was just an intermediate step
df.drop(columns='pickup_hour', inplace=True)

该列现在已添加到 DataFrame 中,其中 1 表示“高峰时段”,0 表示“非高峰时段”is_rush_hour

0赞 Anna Andreeva Rogotulka 10/22/2023 #2

使用 pandas 来操作您的数据,我认为 sklearn 不是适合您任务的工具

下面的代码计算 2 小时间隔并找到最大值,因为我们将其视为高峰间隔,然后我们定义is_rush_hour列,它是布尔变量,带有 true 的行是高峰时段的乘车

import pandas as pd

df = pd.read_csv('your_dataset.csv')
df['pickup_hour'] = pd.to_datetime(df['tpep_pickup_datetime']).dt.hour

# 2 hours interval count the number of rides
df['pickup_hour_interval'] = pd.cut(df['pickup_hour'], bins=range(0, 25, 2))
hourly_counts = df.groupby('pickup_hour_interval').size().reset_index(name='ride_count')

# find the maximum count of rides 2-hour interval
max_rush_hour_interval = hourly_counts.iloc[hourly_counts['ride_count'].idxmax()]

print("2-Hour Rush Hour Interval with Maximum Rides:")
print(max_rush_hour_interval)

rush_hour_start = max_rush_hour_interval['pickup_hour_interval'].left
rush_hour_end = max_rush_hour_interval['pickup_hour_interval'].right
df['is_rush_hour'] = df['pickup_hour'].between(rush_hour_start, rush_hour_end)