提问人:user2948714 提问时间:11/3/2013 最后编辑:zx8754user2948714 更新时间:11/13/2021 访问量:34404
在数据帧中创建新列基于匹配其他列的部分字符串
Create new column in dataframe based on partial string matching other column
问:
我有一个包含 2 列的数据帧,并想添加一个基于列内的一些数据调用的第 3 列。GL
GLDESC
KIND
GLDESC
数据帧如下:
GL GLDESC
1 515100 Payroll-Indir Salary Labor
2 515900 Payroll-Indir Compensated Absences
3 532300 Bulk Gas
4 539991 Area Charge In
5 551000 Repairs & Maint-Spare Parts
6 551100 Supplies-Operating
7 551300 Consumables
对于数据表的每一行:
- 如果字符串中的任何位置都包含该单词,那么我想成为
GLDESC
Payroll
KIND
Payroll
- 如果字符串中的任何位置都包含该单词,那么我想成为
GLDESC
Gas
KIND
Materials
- 在所有其他情况下,我都想成为
KIND
Other
我在 stackoverflow 上寻找类似的示例,但找不到任何示例,还在 switch、grep、apply 和正则表达式上寻找假人,以尝试仅匹配列的一部分,然后用该类型的帐户填充列,但无法使其工作。R
GLDESC
KIND
答:
31赞
alexis_laz
11/3/2013
#1
由于您只有两个条件,因此可以使用嵌套:ifelse
#random data; it wasn't easy to copy-paste yours
DF <- data.frame(GL = sample(10), GLDESC = paste(sample(letters, 10),
c("gas", "payroll12", "GaSer", "asdf", "qweaa", "PayROll-12",
"asdfg", "GAS--2", "fghfgh", "qweee"), sample(letters, 10), sep = " "))
DF$KIND <- ifelse(grepl("gas", DF$GLDESC, ignore.case = T), "Materials",
ifelse(grepl("payroll", DF$GLDESC, ignore.case = T), "Payroll", "Other"))
DF
# GL GLDESC KIND
#1 8 e gas l Materials
#2 1 c payroll12 y Payroll
#3 10 m GaSer v Materials
#4 6 t asdf n Other
#5 2 w qweaa t Other
#6 4 r PayROll-12 q Payroll
#7 9 n asdfg a Other
#8 5 d GAS--2 w Materials
#9 7 s fghfgh e Other
#10 3 g qweee k Other
编辑 2016 年 10 月 3 日 (..在受到比预期更多的关注之后)
处理更多模式的可能解决方案可能是遍历所有模式,并在存在匹配时逐步减少比较量:
ff = function(x, patterns, replacements = patterns, fill = NA, ...)
{
stopifnot(length(patterns) == length(replacements))
ans = rep_len(as.character(fill), length(x))
empty = seq_along(x)
for(i in seq_along(patterns)) {
greps = grepl(patterns[[i]], x[empty], ...)
ans[empty[greps]] = replacements[[i]]
empty = empty[!greps]
}
return(ans)
}
ff(DF$GLDESC, c("gas", "payroll"), c("Materials", "Payroll"), "Other", ignore.case = TRUE)
# [1] "Materials" "Payroll" "Materials" "Other" "Other" "Payroll" "Other" "Materials" "Other" "Other"
ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"),
c("pat1a|pat1b", "pat2", "pat3"),
c("1", "2", "3"), fill = "empty")
#[1] "1" "1" "3" "empty"
ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"),
c("pat2", "pat1a|pat1b", "pat3"),
c("2", "1", "3"), fill = "empty")
#[1] "2" "1" "3" "empty"
评论
0赞
Manasi Shah
3/21/2017
你好!感谢您分享这个正则表达式,它非常有用。如果我们要定义第二个变量,而不是,即我希望在新变量中替换另一个字符串,那么我们是否必须为每个这样的新变量单独运行正则表达式。此外,您没有在最新的迭代函数中定义。谢谢!kind
labor
xyz
new
kind
0赞
alexis_laz
3/21/2017
@ManasiShah : 你的意思是像这样?最后的函数 () 是一个更通用的函数,其结果可用于将第二步分配给“data.frame”。还是我错过了你的观点?DF$new = ff(DF$GLDESC, "labor", "xyz", "Other", ignore.case = TRUE)
ff
1赞
tjebo
11/13/2021
#2
我个人喜欢按索引匹配。您可以循环访问新标签,以获取部分匹配项的索引,然后将其与查找表一起使用,以简单地重新分配值。grep
如果要创建新标签,请使用命名向量。
DF <- data.frame(GL = sample(10), GLDESC = paste(sample(letters, 10),
c(
"gas", "payroll12", "GaSer", "asdf", "qweaa", "PayROll-12",
"asdfg", "GAS--2", "fghfgh", "qweee"
), sample(letters, 10),
sep = " "
))
lu <- stack(sapply(c(Material = "gas", Payroll = "payroll"), grep, x = DF$GLDESC, ignore.case = TRUE))
DF$KIND <- DF$GLDESC
DF$KIND[lu$values] <- as.character(lu$ind)
DF$KIND[-lu$values] <- "Other"
DF
#> GL GLDESC KIND
#> 1 6 x gas f Material
#> 2 3 t payroll12 q Payroll
#> 3 5 a GaSer h Material
#> 4 4 s asdf x Other
#> 5 1 m qweaa y Other
#> 6 10 y PayROll-12 r Payroll
#> 7 7 g asdfg a Other
#> 8 2 k GAS--2 i Material
#> 9 9 e fghfgh j Other
#> 10 8 l qweee p Other
由 reprex 软件包 (v2.0.1) 于 2021-11-13 创建
评论