提问人:Kajal Singh 提问时间:2/5/2023 最后编辑:Kajal Singh 更新时间:2/6/2023 访问量:105
如何将标签分配给 pandas 数据帧的一个组中最小的、另一个组中第二小的和第三组中第三个最小的?
How can I assign a tag to the smallest in one group, second smallest in another group and third smallest in the third group to a pandas dataframe?
问:
我有下面的数据框,
编号 | 群 | Date_Time_1 | Date_Time_2 | 差异 | New_Column |
---|---|---|---|---|---|
123 | 一个 | 14-10-2021 15:19 | 14-10-2021 15:32 | 13 | 第一 |
123 | 一个 | 14-10-2021 15:19 | 14-10-2021 15:36 | 17 | 零 |
123 | 一个 | 14-10-2021 15:19 | 14-10-2021 15:37 | 18 | 零 |
123 | 一个 | 14-10-2021 15:19 | 14-10-2021 16:29 | 70 | 零 |
123 | 一个 | 14-10-2021 15:19 | 14-10-2021 17:04 | 105 | 零 |
123 | B | 14-10-2021 15:21 | 14-10-2021 15:32 | 11 | 零 |
123 | B | 14-10-2021 15:21 | 14-10-2021 15:36 | 15 | 第二 |
123 | B | 14-10-2021 15:21 | 14-10-2021 15:37 | 16 | 零 |
123 | B | 14-10-2021 15:21 | 14-10-2021 16:29 | 68 | 零 |
123 | B | 14-10-2021 15:21 | 14-10-2021 17:04 | 103 | 零 |
123 | C | 14-10-2021 15:22 | 14-10-2021 15:32 | 10 | 零 |
123 | C | 14-10-2021 15:22 | 14-10-2021 15:36 | 14 | 零 |
123 | C | 14-10-2021 15:22 | 14-10-2021 15:37 | 15 | 第三 |
123 | C | 14-10-2021 15:23 | 14-10-2021 16:29 | 67 | Third_A |
123 | C | 14-10-2021 15:48 | 14-10-2021 17:04 | 102 | Third_B |
789 | 一个 | 14-10-2021 15:19 | 14-10-2021 15:32 | 13 | 第一 |
789 | 一个 | 14-10-2021 15:19 | 14-10-2021 15:36 | 17 | 零 |
789 | B | 14-10-2021 15:21 | 14-10-2021 15:32 | 11 | 零 |
789 | B | 14-10-2021 15:21 | 14-10-2021 15:36 | 15 | 第二 |
789 | C | 14-10-2021 15:22 | 14-10-2021 15:32 | 10 | 零 |
我正在尝试创建一个新列,该列将“第一”分配给组“A”中最小的“Date_Time_2”,并将“第二”分配给组 B 中第二小的“Date_Time_2”。 同样,它将把“第三”分配给C组中第三小的“Date_Time_2”。
我希望它在循环到达“ID”的最后一个“组”后分配“Third_A”、“Third_B”等。因此,一旦它到达最后一组“ID”,它就会将“第三或第三”(因为数据集中只有三个唯一的组)分配给前几组中未使用的第三个最低“Date_Time_2”,如果它为新的“Date_Time_1”找到另一个“Date_Time_2”,它将分配“Third_A”, “Third_B”等
我已经尝试了以下代码,但它不起作用,
`df.drop('New_Column', axis = 1, inplace = True)
df['New_Column'] = pd.Series()
for i, v in df['Difference'].items():
a = 0
b = 1
diff = df[df['Group'] == df['Group'].unique()[a]]['Difference'].nsmallest(b).min()
if diff == v:
df.loc[i, 'New_Column'] = "Yes"
b = b + 1
a = a + 1`
这里的任何帮助都会很棒!
答:
首先,确保正确读取 csv 值。表示应正确解释日期时间值,例如
date_parse = lambda x : pd.to_datetime(x, format="%d-%m-%Y %H:%M")
df = pd.read_csv('filename.csv', parse_dates=['Date_Time_1','Date_Time_2'], date_parser= date_parse)
如果已经有 dataframe,可以使用以下代码来解析 datetime 对象 insde dataframe,
df['Date_Time_2'] = pd.to_datetime(df['Date_Time_2'], format="%d-%m-%Y %H:%M")
df['Date_Time_2'] = pd.to_datetime(df['Date_Time_2'], format="%d-%m-%Y %H:%M")
现在只需遍历不同的组,并在排序列表中过滤掉date_time_2列,最后取出适当的索引,例如对于组“A”取“0”索引,对于组“B”取出“1”索引..., 适当地选择数据框并更新新列中的值
df['New_Column'] = 'NA'
for index, group in enumerate(df['Group'].unique()):
unqiue_time = df[df['Group'] == group]['Date_Time_2'].unique()[index]
df.loc[(df['Group'] == group) & (df['Date_Time_2'] == unqiue_time), 'New_Column'] = index
print(df)
注意:附加数字要容易得多,例如“第一个”,“第二个”之类的词,如果需要,请创建一个新列表,并从索引中分配值,如下所示
df['New_Column'] = 'NA'
number_as_string = ['first', 'second', 'third']
for index, group in enumerate(df['Group'].unique()):
unqiue_time = df[df['Group'] == group]['Date_Time_2'].unique()[index]
df.loc[(df['Group'] == group) & (df['Date_Time_2'] == unqiue_time), 'New_Column'] = number_as_string[index]
print(df)
您可以尝试以下操作:
from string import ascii_uppercase as letters
df["Date_Time_2"] = pd.to_datetime(df["Date_Time_2"])
for n, (_, gdf) in enumerate(df.sort_values("Date_Time_2").groupby("Group")):
nths = gdf.groupby("Date_Time_2", as_index=False).ngroup()
df.loc[gdf[nths == n].index, "New"] = str(n + 1)
for i, c in zip(gdf[nths > n].index, letters):
df.at[i, "New"] = f"{n + 1}_{c}"
- 首先,确保列包含日期时间。
Date_Time_2
- 然后按排序后分组。
df
Group
Date_Time_2
- 然后在每个组中确定属于 th 子组的索引(从 0 开始)并设置在 resp 上。 列行。
n
Date_Time_2
n + 1
New
- 然后取最后一组并将字母值添加到列中。
New
也许您必须将最后一部分替换为
for k, c in zip(range(n + 1, nths.max() + 1), letters):
df.loc[gdf[nths == k].index, "New"] = f"{n + 1}_{c}"
如果字母值也应该分组。
问题中样本的结果:
ID Group Date_Time_1 Date_Time_2 Difference New_Column New
0 123 A 14-10-2021 15:19 2021-10-14 15:32:00 13 First 1
1 123 A 14-10-2021 15:19 2021-10-14 15:36:00 17 NaN NaN
2 123 A 14-10-2021 15:19 2021-10-14 15:37:00 18 NaN NaN
3 123 A 14-10-2021 15:19 2021-10-14 16:29:00 70 NaN NaN
4 123 A 14-10-2021 15:19 2021-10-14 17:04:00 105 NaN NaN
5 123 B 14-10-2021 15:21 2021-10-14 15:32:00 11 NaN NaN
6 123 B 14-10-2021 15:21 2021-10-14 15:36:00 15 Second 2
7 123 B 14-10-2021 15:21 2021-10-14 15:37:00 16 NaN NaN
8 123 B 14-10-2021 15:21 2021-10-14 16:29:00 68 NaN NaN
9 123 B 14-10-2021 15:21 2021-10-14 17:04:00 103 NaN NaN
10 123 C 14-10-2021 15:22 2021-10-14 15:32:00 10 NaN NaN
11 123 C 14-10-2021 15:22 2021-10-14 15:36:00 14 NaN NaN
12 123 C 14-10-2021 15:22 2021-10-14 15:37:00 15 Third 3
13 123 C 14-10-2021 15:23 2021-10-14 16:29:00 67 Third_A 3_A
14 123 C 14-10-2021 15:48 2021-10-14 17:04:00 102 Third_B 3_B
15 789 A 14-10-2021 15:19 2021-10-14 15:32:00 13 First 1
16 789 A 14-10-2021 15:19 2021-10-14 15:36:00 17 NaN NaN
17 789 B 14-10-2021 15:21 2021-10-14 15:32:00 11 NaN NaN
18 789 B 14-10-2021 15:21 2021-10-14 15:36:00 15 Second 2
19 789 C 14-10-2021 15:22 2021-10-14 15:32:00 10 NaN NaN
如果必须为每个组完成整个过程,那么您可以尝试ID
...
for _, df_id in df.sort_values("Date_Time_2").groupby("ID"):
for n, (_, gdf) in enumerate(df_id.groupby("Group")):
nths = gdf.groupby("Date_Time_2", as_index=False).ngroup()
df.loc[gdf[nths == n].index, "New"] = str(n + 1)
for i, c in zip(gdf[nths > n].index, letters):
df.at[i, "New"] = f"{n + 1}_{c}"
相反。
评论
ID
看起来你正在尝试对每组进行“密集”排名?
这也许可以简化 - 但类似于:
group = df.groupby(["ID", "Group"])
df1 = df.assign(
group_id = group.ngroup(),
rank = group["Date_Time_2"].rank(method="dense"),
)
# Get ranks per group
# A 1, A 2, A 3, B 4, B 5 -> A 1, A 2, A 3, B 1, B 2
df1 = df1.assign(group_id = df1.groupby("ID")["group_id"].rank(method="dense"))
df1 = df1.assign(last_group_id = df1.groupby("ID")["group_id"].transform("max"))
# Keep only 1st for 1st group - 2nd for 2nd group
# "OR" > Nth for last group
df1.loc[
(df1["group_id"] == df1["rank"]) |
((df1["group_id"] == df1["last_group_id"]) & (df1["rank"] > df1["group_id"])),
"New_Column"
] = df1["rank"]
ID Group Date_Time_1 Date_Time_2 Difference group_id rank last_group_id New_Column
0 123 A 2021-10-14 15:19:00 2021-10-14 15:32:00 13 1.0 1.0 3.0 1.0
1 123 A 2021-10-14 15:19:00 2021-10-14 15:36:00 17 1.0 2.0 3.0 NaN
2 123 A 2021-10-14 15:19:00 2021-10-14 15:37:00 18 1.0 3.0 3.0 NaN
3 123 A 2021-10-14 15:19:00 2021-10-14 16:29:00 70 1.0 4.0 3.0 NaN
4 123 A 2021-10-14 15:19:00 2021-10-14 17:04:00 105 1.0 5.0 3.0 NaN
5 123 B 2021-10-14 15:21:00 2021-10-14 15:32:00 11 2.0 1.0 3.0 NaN
6 123 B 2021-10-14 15:21:00 2021-10-14 15:36:00 15 2.0 2.0 3.0 2.0
7 123 B 2021-10-14 15:21:00 2021-10-14 15:37:00 16 2.0 3.0 3.0 NaN
8 123 B 2021-10-14 15:21:00 2021-10-14 16:29:00 68 2.0 4.0 3.0 NaN
9 123 B 2021-10-14 15:21:00 2021-10-14 17:04:00 103 2.0 5.0 3.0 NaN
10 123 C 2021-10-14 15:22:00 2021-10-14 15:32:00 10 3.0 1.0 3.0 NaN
11 123 C 2021-10-14 15:22:00 2021-10-14 15:36:00 14 3.0 2.0 3.0 NaN
12 123 C 2021-10-14 15:22:00 2021-10-14 15:37:00 15 3.0 3.0 3.0 3.0
13 123 C 2021-10-14 15:23:00 2021-10-14 16:29:00 67 3.0 4.0 3.0 4.0
14 123 C 2021-10-14 15:48:00 2021-10-14 17:04:00 102 3.0 5.0 3.0 5.0
15 789 A 2021-10-14 15:19:00 2021-10-14 15:32:00 13 1.0 1.0 3.0 1.0
16 789 A 2021-10-14 15:19:00 2021-10-14 15:36:00 17 1.0 2.0 3.0 NaN
17 789 B 2021-10-14 15:21:00 2021-10-14 15:32:00 11 2.0 1.0 3.0 NaN
18 789 B 2021-10-14 15:21:00 2021-10-14 15:36:00 15 2.0 2.0 3.0 2.0
19 789 C 2021-10-14 15:22:00 2021-10-14 15:32:00 10 3.0 1.0 3.0 NaN
4.0 = Third_A
和。。。5.0 = Third_B
这是你想要实现的目标吗?
评论
4, 5
0, 1
_z
评论