提问人:user1165199 提问时间:1/29/2013 最后编辑:Jaapuser1165199 更新时间:4/28/2023 访问量:159771
将多个列粘贴在一起
Paste multiple columns together
问:
我在数据帧中有一堆列,我想将它们粘贴在一起(用“-”分隔),如下所示:
data <- data.frame('a' = 1:3,
'b' = c('a','b','c'),
'c' = c('d', 'e', 'f'),
'd' = c('g', 'h', 'i'))
i.e.
a b c d
1 a d g
2 b e h
3 c f i
我想成为:
a x
1 a-d-g
2 b-e-h
3 c-f-i
我通常可以通过以下方式做到这一点:
within(data, x <- paste(b,c,d,sep='-'))
然后删除旧列,但不幸的是,我不知道具体列的名称,只有所有列的统称,例如我会知道cols <- c('b','c','d')
有谁知道一种方法?
答:
# your starting data..
data <- data.frame('a' = 1:3, 'b' = c('a','b','c'), 'c' = c('d', 'e', 'f'), 'd' = c('g', 'h', 'i'))
# columns to paste together
cols <- c( 'b' , 'c' , 'd' )
# create a new column `x` with the three columns collapsed together
data$x <- apply( data[ , cols ] , 1 , paste , collapse = "-" )
# remove the unnecessary columns
data <- data[ , !( names( data ) %in% cols ) ]
评论
do.call
evil(parse(...))
do.call
collapse = "-"
paste
library(plyr)
ldply(apply(data, 1, function(x) data.frame(
x = paste(x[2:4],sep="",collapse="-"))))
# x
#1 a-d-g
#2 b-e-h
#3 c-f-i
# and with just the vector of names you have:
ldply(apply(data, 1, function(x) data.frame(
x = paste(x[c('b','c','d')],sep="",collapse="-"))))
# or equally:
mynames <-c('b','c','d')
ldply(apply(data, 1, function(x) data.frame(
x = paste(x[mynames],sep="",collapse="-"))))
我将构造一个新的data.frame:
d <- data.frame('a' = 1:3, 'b' = c('a','b','c'), 'c' = c('d', 'e', 'f'), 'd' = c('g', 'h', 'i'))
cols <- c( 'b' , 'c' , 'd' )
data.frame(a = d[, 'a'], x = do.call(paste, c(d[ , cols], list(sep = '-'))))
评论
d[ , cols]
d[ , names(d) != 'a']
a
cbind(a = d['a'], x = do.call(paste, c(d[cols], sep = '-')))
list
data.frame
data.frame
cbind
作为 baptiste 答案的变体,定义了您拥有的列,并且要放在一起的列定义在data
cols
cols <- c("b", "c", "d")
您可以使用以下命令添加新列和删除旧列data
data$x <- do.call(paste, c(data[cols], sep="-"))
for (co in cols) data[co] <- NULL
这给了
> data
a x
1 1 a-d-g
2 2 b-e-h
3 3 c-f-i
评论
data.frame
使用包,这可以在 1 个函数调用中轻松处理。tidyr
data <- data.frame('a' = 1:3,
'b' = c('a','b','c'),
'c' = c('d', 'e', 'f'),
'd' = c('g', 'h', 'i'))
tidyr::unite_(data, paste(colnames(data)[-1], collapse="_"), colnames(data)[-1])
a b_c_d
1 1 a_d_g
2 2 b_e_h
3 3 c_f_i
编辑:排除第一列,其他所有内容都会被粘贴。
# tidyr_0.6.3
unite(data, newCol, -a)
# or by column index unite(data, newCol, -1)
# a newCol
# 1 1 a_d_g
# 2 2 b_e_h
# 3 3 c_f_i
评论
within(data, x <- paste(b,c,d,sep='-'))
unite_(data, "b_c_d", cols)
unite(data, b_c_d, -a)
只是为了添加额外的解决方案,它可能比慢,但比可能更好,因为它会避免转换。此外,我们可以使用一个循环来删除不需要的列Reduce
do.call
apply
matrix
for
setdiff
cols <- c('b','c','d')
data$x <- Reduce(function(...) paste(..., sep = "-"), data[cols])
data[setdiff(names(data), cols)]
# a x
# 1 1 a-d-g
# 2 2 b-e-h
# 3 3 c-f-i
或者,我们可以使用包就地更新(假设是新数据)data
data.table
library(data.table)
setDT(data)[, x := Reduce(function(...) paste(..., sep = "-"), .SD[, mget(cols)])]
data[, (cols) := NULL]
data
# a x
# 1: 1 a-d-g
# 2: 2 b-e-h
# 3: 3 c-f-i
另一种选择是 use 而不是 as in.SDcols
mget
setDT(data)[, x := Reduce(function(...) paste(..., sep = "-"), .SD), .SDcols = cols]
我在一个小样本上对 Anthony Damico、Brian Diggs 和 data_steve 的答案进行了基准测试,得到了以下结果。tbl_df
> data <- data.frame('a' = 1:3,
+ 'b' = c('a','b','c'),
+ 'c' = c('d', 'e', 'f'),
+ 'd' = c('g', 'h', 'i'))
> data <- tbl_df(data)
> cols <- c("b", "c", "d")
> microbenchmark(
+ do.call(paste, c(data[cols], sep="-")),
+ apply( data[ , cols ] , 1 , paste , collapse = "-" ),
+ tidyr::unite_(data, "x", cols, sep="-")$x,
+ times=1000
+ )
Unit: microseconds
expr min lq mean median uq max neval
do.call(paste, c(data[cols], sep = "-")) 65.248 78.380 93.90888 86.177 99.3090 436.220 1000
apply(data[, cols], 1, paste, collapse = "-") 223.239 263.044 313.11977 289.514 338.5520 743.583 1000
tidyr::unite_(data, "x", cols, sep = "-")$x 376.716 448.120 556.65424 501.877 606.9315 11537.846 1000
但是,当我自己用 ~100 万行和 10 列进行评估时,结果却大不相同。tbl_df
> microbenchmark(
+ do.call(paste, c(data[c("a", "b")], sep="-")),
+ apply( data[ , c("a", "b") ] , 1 , paste , collapse = "-" ),
+ tidyr::unite_(data, "c", c("a", "b"), sep="-")$c,
+ times=25
+ )
Unit: milliseconds
expr min lq mean median uq max neval
do.call(paste, c(data[c("a", "b")], sep="-")) 930.7208 951.3048 1129.334 997.2744 1066.084 2169.147 25
apply( data[ , c("a", "b") ] , 1 , paste , collapse = "-" ) 9368.2800 10948.0124 11678.393 11136.3756 11878.308 17587.617 25
tidyr::unite_(data, "c", c("a", "b"), sep="-")$c 968.5861 1008.4716 1095.886 1035.8348 1082.726 1759.349 25
在我看来,函数也应该在这些答案中占有一席之地。您可以按如下方式使用:sprintf
sprintf
do.call(sprintf, c(d[cols], '%s-%s-%s'))
这给了:
[1] "a-d-g" "b-e-h" "c-f-i"
要创建所需的数据帧,请执行以下操作:
data.frame(a = d$a, x = do.call(sprintf, c(d[cols], '%s-%s-%s')))
给:
a x
1 1 a-d-g
2 2 b-e-h
3 3 c-f-i
虽然与@BrianDiggs的 / 组合相比没有明显的优势,但当您还想填充所需字符串的某些部分或想要指定位数时,它特别有用。有关几个选项,请参阅。sprintf
do.call
paste
?sprintf
另一种变体是使用 from purrr:pmap
pmap(d[2:4], paste, sep = '-')
注意:此解决方案仅在列不是因子时才有效。pmap
较大数据集的基准测试:
# create a larger dataset
d2 <- d[sample(1:3,1e6,TRUE),]
# benchmark
library(microbenchmark)
microbenchmark(
docp = do.call(paste, c(d2[cols], sep="-")),
appl = apply( d2[, cols ] , 1 , paste , collapse = "-" ),
tidr = tidyr::unite_(d2, "x", cols, sep="-")$x,
docs = do.call(sprintf, c(d2[cols], '%s-%s-%s')),
times=10)
结果如下:
Unit: milliseconds
expr min lq mean median uq max neval cld
docp 214.1786 226.2835 297.1487 241.6150 409.2495 493.5036 10 a
appl 3832.3252 4048.9320 4131.6906 4072.4235 4255.1347 4486.9787 10 c
tidr 206.9326 216.8619 275.4556 252.1381 318.4249 407.9816 10 a
docs 413.9073 443.1550 490.6520 453.1635 530.1318 659.8400 10 b
使用的数据:
d <- data.frame(a = 1:3, b = c('a','b','c'), c = c('d','e','f'), d = c('g','h','i'))
评论
这是一种相当非常规(但快速)的方法:使用 from 将列“粘贴”在一起,然后将其读回。为方便起见,我将这些步骤编写为一个名为:fwrite
data.table
fread
fpaste
fpaste <- function(dt, sep = ",") {
x <- tempfile()
fwrite(dt, file = x, sep = sep, col.names = FALSE)
fread(x, sep = "\n", header = FALSE)
}
下面是一个示例:
d <- data.frame(a = 1:3, b = c('a','b','c'), c = c('d','e','f'), d = c('g','h','i'))
cols = c("b", "c", "d")
fpaste(d[cols], "-")
# V1
# 1: a-d-g
# 2: b-e-h
# 3: c-f-i
它的表现如何?
d2 <- d[sample(1:3,1e6,TRUE),]
library(microbenchmark)
microbenchmark(
docp = do.call(paste, c(d2[cols], sep="-")),
tidr = tidyr::unite_(d2, "x", cols, sep="-")$x,
docs = do.call(sprintf, c(d2[cols], '%s-%s-%s')),
appl = apply( d2[, cols ] , 1 , paste , collapse = "-" ),
fpaste = fpaste(d2[cols], "-")$V1,
dt2 = as.data.table(d2)[, x := Reduce(function(...) paste(..., sep = "-"), .SD), .SDcols = cols][],
times=10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# docp 215.34536 217.22102 220.3603 221.44104 223.27224 225.0906 10
# tidr 215.19907 215.81210 220.7131 220.09636 225.32717 229.6822 10
# docs 281.16679 285.49786 289.4514 286.68738 290.17249 312.5484 10
# appl 2816.61899 3106.19944 3259.3924 3266.45186 3401.80291 3804.7263 10
# fpaste 88.57108 89.67795 101.1524 90.59217 91.76415 197.1555 10
# dt2 301.95508 310.79082 384.8247 316.29807 383.94993 874.4472 10
评论
TMPDIR=/dev/shm R
fread
fwrite
我知道这是一个老问题,但我认为无论如何我都应该按照提问者的建议使用 paste() 函数提出简单的解决方案:
data_1<-data.frame(a=data$a,"x"=paste(data$b,data$c,data$d,sep="-"))
data_1
a x
1 1 a-d-g
2 2 b-e-h
3 3 c-f-i
简单明了的代码,来自unite
{tidyr} v1.2.0
解决方案{tidyr v1.2.0}
library(tidyr)
data %>% unite("x", all_of(cols), remove = T, sep = "-")
"x"
是新列的名称。all_of(cols)
是我们要合并的列的选择。使用列名不需要硬编码。<tidy-select>
remove = T
我们删除输入列sep = "-"
我们定义值之间的分隔符- 如果有,我们还可以添加
NA
na.rm = TRUE
输出
# a x
# 1 1 a-d-g
# 2 2 b-e-h
# 3 3 c-f-i
输入数据
data <- data.frame('a' = 1:3,
'b' = c('a','b','c'),
'c' = c('d', 'e', 'f'),
'd' = c('g', 'h', 'i'))
cols <- c('b','c','d')
data
# a b c d
# 1 1 a d g
# 2 2 b e h
# 3 3 c f i
*此解决方案与已发布的解决方案不同。
评论