在数据帧中创建新列基于匹配其他列的部分字符串

Create new column in dataframe based on partial string matching other column

提问人:user2948714 提问时间:11/3/2013 最后编辑:zx8754user2948714 更新时间:11/13/2021 访问量:34404

问:

我有一个包含 2 列的数据帧,并想添加一个基于列内的一些数据调用的第 3 列。GLGLDESCKINDGLDESC

数据帧如下:

      GL                             GLDESC
1 515100         Payroll-Indir Salary Labor
2 515900 Payroll-Indir Compensated Absences
3 532300                           Bulk Gas
4 539991                     Area Charge In
5 551000        Repairs & Maint-Spare Parts
6 551100                 Supplies-Operating
7 551300                        Consumables

对于数据表的每一行:

  • 如果字符串中的任何位置都包含该单词,那么我想成为GLDESCPayrollKINDPayroll
  • 如果字符串中的任何位置都包含该单词,那么我想成为GLDESCGasKINDMaterials
  • 在所有其他情况下,我都想成为KINDOther

我在 stackoverflow 上寻找类似的示例,但找不到任何示例,还在 switch、grep、apply 和正则表达式上寻找假人,以尝试仅匹配列的一部分,然后用该类型的帐户填充列,但无法使其工作。RGLDESCKIND

正则表达式 字符串 R 匹配 部分

评论


答:

31赞 alexis_laz 11/3/2013 #1

由于您只有两个条件,因此可以使用嵌套:ifelse

#random data; it wasn't easy to copy-paste yours  
DF <- data.frame(GL = sample(10), GLDESC = paste(sample(letters, 10), 
  c("gas", "payroll12", "GaSer", "asdf", "qweaa", "PayROll-12", 
     "asdfg", "GAS--2", "fghfgh", "qweee"), sample(letters, 10), sep = " "))

DF$KIND <- ifelse(grepl("gas", DF$GLDESC, ignore.case = T), "Materials", 
         ifelse(grepl("payroll", DF$GLDESC, ignore.case = T), "Payroll", "Other"))

DF
#   GL         GLDESC      KIND
#1   8        e gas l Materials
#2   1  c payroll12 y   Payroll
#3  10      m GaSer v Materials
#4   6       t asdf n     Other
#5   2      w qweaa t     Other
#6   4 r PayROll-12 q   Payroll
#7   9      n asdfg a     Other
#8   5     d GAS--2 w Materials
#9   7     s fghfgh e     Other
#10  3      g qweee k     Other

编辑 2016 年 10 月 3 日 (..在受到比预期更多的关注之后)

处理更多模式的可能解决方案可能是遍历所有模式,并在存在匹配时逐步减少比较量:

ff = function(x, patterns, replacements = patterns, fill = NA, ...)
{
    stopifnot(length(patterns) == length(replacements))

    ans = rep_len(as.character(fill), length(x))    
    empty = seq_along(x)

    for(i in seq_along(patterns)) {
        greps = grepl(patterns[[i]], x[empty], ...)
        ans[empty[greps]] = replacements[[i]]  
        empty = empty[!greps]
    }

    return(ans)
}

ff(DF$GLDESC, c("gas", "payroll"), c("Materials", "Payroll"), "Other", ignore.case = TRUE)
# [1] "Materials" "Payroll"   "Materials" "Other"     "Other"     "Payroll"   "Other"     "Materials" "Other"     "Other"

ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"), 
   c("pat1a|pat1b", "pat2", "pat3"), 
   c("1", "2", "3"), fill = "empty")
#[1] "1"     "1"     "3"     "empty"

ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"), 
   c("pat2", "pat1a|pat1b", "pat3"), 
   c("2", "1", "3"), fill = "empty")
#[1] "2"     "1"     "3"     "empty"

评论

0赞 Manasi Shah 3/21/2017
你好!感谢您分享这个正则表达式,它非常有用。如果我们要定义第二个变量,而不是,即我希望在新变量中替换另一个字符串,那么我们是否必须为每个这样的新变量单独运行正则表达式。此外,您没有在最新的迭代函数中定义。谢谢!kindlaborxyznewkind
0赞 alexis_laz 3/21/2017
@ManasiShah : 你的意思是像这样?最后的函数 () 是一个更通用的函数,其结果可用于将第二步分配给“data.frame”。还是我错过了你的观点?DF$new = ff(DF$GLDESC, "labor", "xyz", "Other", ignore.case = TRUE)ff
1赞 tjebo 11/13/2021 #2

我个人喜欢按索引匹配。您可以循环访问新标签,以获取部分匹配项的索引,然后将其与查找表一起使用,以简单地重新分配值。grep

如果要创建新标签,请使用命名向量。

DF <- data.frame(GL = sample(10), GLDESC = paste(sample(letters, 10),
  c(
    "gas", "payroll12", "GaSer", "asdf", "qweaa", "PayROll-12",
    "asdfg", "GAS--2", "fghfgh", "qweee"
  ), sample(letters, 10),
  sep = " "
))


lu <- stack(sapply(c(Material = "gas", Payroll = "payroll"), grep, x = DF$GLDESC, ignore.case = TRUE))

DF$KIND <- DF$GLDESC
DF$KIND[lu$values] <- as.character(lu$ind)
DF$KIND[-lu$values] <- "Other"

DF
#>    GL         GLDESC     KIND
#> 1   6        x gas f Material
#> 2   3  t payroll12 q  Payroll
#> 3   5      a GaSer h Material
#> 4   4       s asdf x    Other
#> 5   1      m qweaa y    Other
#> 6  10 y PayROll-12 r  Payroll
#> 7   7      g asdfg a    Other
#> 8   2     k GAS--2 i Material
#> 9   9     e fghfgh j    Other
#> 10  8      l qweee p    Other

reprex 软件包 (v2.0.1) 于 2021-11-13 创建