使用 lambda 函数,如何遍历 pandas 数据框中具有列表值的列

Using lambda function, how to iterate over the columns having list values in pandas data frame

提问人:Pete 提问时间:11/13/2023 更新时间:11/13/2023 访问量:57

问:

import pandas as pd

mydata = {"Key" : [567, 568, 569, 570, 571, 572] , "Sprint" : ["Max1;Max2", "Max2", "DI001 2", "DI001 25", "DAS 100" , "DI001 101"]}

df = pd.DataFrame(mydata)
df ["sprintlist"]= df["Sprint"].str.split(";")
print (df)

从此数据帧中,我只想将列表中每个值的“Sprintlist”列中字符串最后一部分中出现的数字提取到新列表“Sprintnumb”中,如下所示

预期输出:

enter image description here

在我之前的一个查询中,我清楚地了解了当“Sprint”列中只有一个值时如何提取数字。我尝试使用 lambda 函数来实现所需的输出,但出现错误“str' 对象没有属性'str'”

 df["Sprint Number"] = df.Sprint.str.extract(r"(\d+)$").astype(int)
python pandas 字符串 lambda

评论


答:

1赞 jezrael 11/13/2023 #1

将 Series.explode 与 Series.str.extractall 一起使用,转换为数字列表和聚合列表:

df["Sprint Number"] = (df["sprintlist"].explode()
                                       .str.extractall(r"(\d+)$")[0]
                                       .astype(int)
                                       .groupby(level=0)
                                       .agg(list))
print (df)
   Key     Sprint    sprintlist Sprint Number
0  567  Max1;Max2  [Max1, Max2]        [1, 2]
1  568       Max2        [Max2]           [2]
2  569    DI001 2     [DI001 2]           [2]
3  570   DI001 25    [DI001 25]          [25]
4  571    DAS 100     [DAS 100]         [100]
5  572  DI001 101   [DI001 101]         [101]

或者将列表包含与:regex

df["Sprint Number"] = [[int(re.search('(\d+)$', y).group(0)) for y in x]
                        for x in df["sprintlist"]]
print (df)
   Key     Sprint    sprintlist Sprint Number
0  567  Max1;Max2  [Max1, Max2]        [1, 2]
1  568       Max2        [Max2]           [2]
2  569    DI001 2     [DI001 2]           [2]
3  570   DI001 25    [DI001 25]          [25]
4  571    DAS 100     [DAS 100]         [100]
5  572  DI001 101   [DI001 101]         [101]

如果可能的话,某些字符串不以数字结尾,添加带有测试的分配运算符::=None

import re

mydata = {"Key" : [567, 568, 569, 570, 571, 572] , 
          "Sprint" : ["Max1;Max", "Max2", "DI001 2", "DI001 25", "DAS 100" , "DI001 101"]}

df = pd.DataFrame(mydata)
df ["sprintlist"]= df["Sprint"].str.split(";")

df["Sprint Number"] = [[int(m.group(0)) 
                       for y in x if( m:=re.search('(\d+)$', y)) is not None] 
                       for x in df["sprintlist"]]
print (df)
   Key     Sprint   sprintlist Sprint Number
0  567   Max1;Max  [Max1, Max]           [1]
1  568       Max2       [Max2]           [2]
2  569    DI001 2    [DI001 2]           [2]
3  570   DI001 25   [DI001 25]          [25]
4  571    DAS 100    [DAS 100]         [100]
5  572  DI001 101  [DI001 101]         [101]
    
0赞 mozway 11/13/2023 #2

str.findall 与 lookahead 一起使用:

df['Sprint'].str.findall(r'\d+(?=$|\s*;)')

或者对于自定义格式(转换为 int 或 joining):

import re

pat = re.compile(r'\d+(?=$|\s*;)')

df['Sprintbumb'] = [';'.join(pat.findall(s)) for s in df['Sprint']]

# or
df['Sprintbumb'] = [list(map(int, pat.findall(s))) for s in df['Sprint']]

输出:

   Key     Sprint Sprintbumb
0  567  Max1;Max2     [1, 2]
1  568       Max2        [2]
2  569    DI001 2        [2]
3  570   DI001 25       [25]
4  571    DAS 100      [100]
5  572  DI001 101      [101]

正则表达式演示