将列中以逗号分隔的字符串拆分为单独的行

Split comma-separated strings in a column into separate rows

提问人:RoyalTS 提问时间:12/8/2012 最后编辑:JaapRoyalTS 更新时间:5/27/2023 访问量:52032

问:

我有一个数据框,如下所示:

data.frame(director = c("Aaron Blaise,Bob Walker", "Akira Kurosawa", 
                        "Alan J. Pakula", "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", 
                        "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu", 
                        "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", 
                        "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", 
                        "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", 
                        "Anne Fontaine", "Anthony Harvey"), AB = c('A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'B', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'A'))

如您所见,列中的某些条目是用逗号分隔的多个名称。我想将这些条目拆分为单独的行,同时保留另一列的值。例如,上述数据框中的第一行应拆分为两行,列中各有一个名称,列中应有一个“A”。directordirectorAB

字符串 拆分 r-faq data.table dplyr tidyr

评论

3赞 Ricardo Saporta 12/8/2012
只是想问一个显而易见的问题:这些数据是你应该在互联网上发布的吗?
2赞 Matthew Lundberg 12/8/2012
它们“并不都是B级电影”。看起来无害。
29赞 RoyalTS 12/8/2012
所有这些人都是奥斯卡金像奖提名者,我几乎不认为这是秘密=)

答:

52赞 Matthew Lundberg 12/8/2012 #1

命名您的原始 data.frame ,我们有这个:v

> s <- strsplit(as.character(v$director), ',')
> data.frame(director=unlist(s), AB=rep(v$AB, sapply(s, FUN=length)))
                      director AB
1                 Aaron Blaise  A
2                   Bob Walker  A
3               Akira Kurosawa  B
4               Alan J. Pakula  A
5                  Alan Parker  A
6           Alejandro Amenabar  B
7  Alejandro Gonzalez Inarritu  B
8  Alejandro Gonzalez Inarritu  B
9             Benicio Del Toro  B
10 Alejandro González Iñárritu  A
11                 Alex Proyas  B
12              Alexander Hall  A
13              Alfonso Cuaron  B
14            Alfred Hitchcock  A
15              Anatole Litvak  A
16              Andrew Adamson  B
17                 Marilyn Fox  B
18              Andrew Dominik  B
19              Andrew Stanton  B
20              Andrew Stanton  B
21                 Lee Unkrich  B
22              Angelina Jolie  B
23              John Stevenson  B
24               Anne Fontaine  B
25              Anthony Harvey  A

请注意 to 构建新的 AB 列。此处,返回每个原始行中的名称数。repsapply

评论

1赞 IRTFM 1/4/2013
我想知道'AB=rep(v$AB, unlist(sapply(s, FUN=length )))'是否可能比更晦涩难懂的更容易掌握?这里有什么更合适的吗?vapplyvapply
8赞 Rich Scriven 2/21/2017
如今可以用 .sapply(s, length)lengths(s)
32赞 A5C1D2H2I1M1N2O1R2T1 2/3/2015 #2

迟到了,但另一个通用的替代方案是从我的“splitstackshape”包中使用,它有一个参数。将此设置为可获取指定的结果:cSplitdirection"long"

library(splitstackshape)
head(cSplit(mydf, "director", ",", direction = "long"))
#              director AB
# 1:       Aaron Blaise  A
# 2:         Bob Walker  A
# 3:     Akira Kurosawa  B
# 4:     Alan J. Pakula  A
# 5:        Alan Parker  A
# 6: Alejandro Amenabar  B
147赞 Jaap 7/20/2015 #3

几种选择:

1) 的两种方式:

library(data.table)
# method 1 (preferred)
setDT(v)[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB
         ][!is.na(director)]
# method 2
setDT(v)[, strsplit(as.character(director), ",", fixed=TRUE), by = .(AB, director)
         ][,.(director = V1, AB)]

2) / 组合:

library(dplyr)
library(tidyr)
v %>% 
  mutate(director = strsplit(as.character(director), ",")) %>%
  unnest(director)

3) 仅使用 tidyr:使用 tidyr 0.5.0(及更高版本),您还可以使用:separate_rows

separate_rows(v, director, sep = ",")

您可以使用该参数自动将数字转换为数值列。convert = TRUE

tidyr_1.3.0(及更高版本)中,您可以使用(现已取代):separate_longer_delimseparate_rows

separate_longer_delim(v, director, delim = ",")

4) 使用基数 R:

# if 'director' is a character-column:
stack(setNames(strsplit(df$director,','), df$AB))

# if 'director' is a factor-column:
stack(setNames(strsplit(as.character(df$director),','), df$AB))

评论

0赞 Reilstein 1/19/2019
有没有办法一次对多列执行此操作?例如,3 列,每列都有用 “;” 分隔的字符串,每列具有相同数量的字符串。即 成为?data.table(id= "X21", a = "chr1;chr1;chr1", b="123;133;134",c="234;254;268")data.table(id = c("X21","X21",X21"), a=c("chr1","chr1","chr1"), b=c("123","133","134"), c=c("234","254","268"))
1赞 Reilstein 1/19/2019
哇刚刚意识到它已经可以同时适用于多个列 - 这太棒了!
0赞 Moon_Watcher 6/14/2019
您@Reilstein能分享一下您是如何将其应用于多个专栏的吗?我有相同的用例,但不确定如何去做。
1赞 Reilstein 6/15/2019
@Moon_Watcher上面答案中的方法 1 已经适用于多个列,这就是我认为很神奇的地方。 对我有用。setDT(dt)[,lapply(.SD, function(x) unlist(tstrsplit(x, ";",fixed=TRUE))), by = ID]
2赞 Maël 3/13/2023
仅供参考,现在已弃用separate_rowsseparate_longer_delim
100赞 Uwe 4/16/2017 #4

这个老问题经常被用作欺骗目标(标记为 )。截至今天,它已经回答了三次,提供了 6 种不同的方法,但缺乏基准作为指导,哪种方法最快1.r-faq

基准解决方案包括

总的来说,使用该软件包对 6 种不同大小的数据帧进行了 8 种不同的方法基准测试(参见下面的代码)。microbenchmark

OP 给出的样本数据仅包含 20 行。为了创建更大的数据帧,这 20 行只需重复 1、10、100、1000、10000 和 100000 次,从而产生多达 200 万行的问题大小。

基准测试结果

enter image description here

基准测试结果表明,对于足够大的数据帧,所有方法都比任何其他方法都快。对于超过 5000 行的数据帧,Jaap 的方法 2 和变体是最快的,比最慢的方法快得多。data.tabledata.tableDT3

值得注意的是,这两种方法和解决方案的时间安排非常相似,以至于很难区分图表中的曲线。它们是所有数据帧大小的基准方法中最慢的。tidyversesplistackshape

对于较小的数据帧,Matt 的基本 R 解决方案和方法 4 的开销似乎比其他方法少。data.table

法典

director <- 
  c("Aaron Blaise,Bob Walker", "Akira Kurosawa", "Alan J. Pakula", 
    "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", 
    "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu", 
    "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", 
    "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", 
    "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", 
    "Anne Fontaine", "Anthony Harvey")
AB <- c("A", "B", "A", "A", "B", "B", "B", "A", "B", "A", "B", "A", 
        "A", "B", "B", "B", "B", "B", "B", "A")

library(data.table)
library(magrittr)

为问题大小的基准运行定义函数n

run_mb <- function(n) {
  # compute number of benchmark runs depending on problem size `n`
  mb_times <- scales::squish(10000L / n , c(3L, 100L)) 
  cat(n, " ", mb_times, "\n")
  # create data
  DF <- data.frame(director = rep(director, n), AB = rep(AB, n))
  DT <- as.data.table(DF)
  # start benchmarks
  microbenchmark::microbenchmark(
    matt_mod = {
      s <- strsplit(as.character(DF$director), ',')
      data.frame(director=unlist(s), AB=rep(DF$AB, lengths(s)))},
    jaap_DT1 = {
      DT[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB
         ][!is.na(director)]},
    jaap_DT2 = {
      DT[, strsplit(as.character(director), ",", fixed=TRUE), 
         by = .(AB, director)][,.(director = V1, AB)]},
    jaap_dplyr = {
      DF %>% 
        dplyr::mutate(director = strsplit(as.character(director), ",")) %>%
        tidyr::unnest(director)},
    jaap_tidyr = {
      tidyr::separate_rows(DF, director, sep = ",")},
    cSplit = {
      splitstackshape::cSplit(DF, "director", ",", direction = "long")},
    DT3 = {
      DT[, strsplit(as.character(director), ",", fixed=TRUE),
         by = .(AB, director)][, director := NULL][
           , setnames(.SD, "V1", "director")]},
    DT4 = {
      DT[, .(director = unlist(strsplit(as.character(director), ",", fixed = TRUE))), 
         by = .(AB)]},
    times = mb_times
  )
}

针对不同问题规模运行基准测试

# define vector of problem sizes
n_rep <- 10L^(0:5)
# run benchmark for different problem sizes
mb <- lapply(n_rep, run_mb)

准备用于绘图的数据

mbl <- rbindlist(mb, idcol = "N")
mbl[, n_row := NROW(director) * n_rep[N]]
mba <- mbl[, .(median_time = median(time), N = .N), by = .(n_row, expr)]
mba[, expr := forcats::fct_reorder(expr, -median_time)]

创建图表

library(ggplot2)
ggplot(mba, aes(n_row, median_time*1e-6, group = expr, colour = expr)) + 
  geom_point() + geom_smooth(se = FALSE) + 
  scale_x_log10(breaks = NROW(director) * n_rep) + scale_y_log10() + 
  xlab("number of rows") + ylab("median of execution time [ms]") +
  ggtitle("microbenchmark results") + theme_bw()

会话信息和软件包版本(摘录)

devtools::session_info()
#Session info
# version  R version 3.3.2 (2016-10-31)
# system   x86_64, mingw32
#Packages
# data.table      * 1.10.4  2017-02-01 CRAN (R 3.3.2)
# dplyr             0.5.0   2016-06-24 CRAN (R 3.3.1)
# forcats           0.2.0   2017-01-23 CRAN (R 3.3.2)
# ggplot2         * 2.2.1   2016-12-30 CRAN (R 3.3.2)
# magrittr        * 1.5     2014-11-22 CRAN (R 3.3.0)
# microbenchmark    1.4-2.1 2015-11-25 CRAN (R 3.3.3)
# scales            0.4.1   2016-11-09 CRAN (R 3.3.2)
# splitstackshape   1.4.2   2014-10-23 CRAN (R 3.3.3)
# tidyr             0.6.1   2017-01-10 CRAN (R 3.3.2)

1我的好奇心被这个热情洋溢的评论激起了:太棒了!速度快几个数量级!一个问题整洁答案,该问题作为该问题的副本关闭。

评论

0赞 Frank 4/16/2017
好!看起来 cSplit 和 separate_rows(专门为此而设计)有改进的空间。顺便说一句,cSplit 也接受一个 fixed= arg,并且是一个基于 data.table 的包,所以不妨给它 DT 而不是 DF。另外,我不认为从 factor 到 char 的转换属于基准测试(因为它一开始应该是 char)。我检查了一下,这些变化都没有对结果产生任何定性影响。
1赞 Uwe 4/16/2017
@Frank 感谢您对改进基准和检查对结果的影响的建议。在下一个版本的 、 等发布后进行更新时会选择这个。data.tabledplyr
0赞 Ferroao 6/27/2017
我认为这些方法没有可比性,至少在所有情况下都没有,因为数据表方法只生成具有“选定”列的表,而 dplyr 生成包含所有列的结果(包括不参与分析的列,并且不必在函数中写入它们的名称)。
6赞 Tensibai 6/27/2017
@Ferroao 这是错误的,data.tables 方法就地修改了“表”,所有列都被保留了下来,当然,如果你不就地修改,你只会得到你所要求的内容的过滤副本。简而言之,data.table 方法是不生成生成的数据集,而是更新数据集,这就是 data.table 和 dplyr 之间的真正区别。
1赞 GKi 6/18/2020
真的很不错的比较!也许您可以在执行时添加matt_modjaap_dplyr。因为其他人有它,这将对时间产生影响。从 R 4.0.0 开始,创建 时的默认值是 ,因此可以删除。strsplitfixed=TRUEdata.framestringsAsFactors = FALSEas.character
2赞 zhang jing 7/27/2019 #5
devtools::install_github("yikeshu0611/onetree")

library(onetree)

dd=spread_byonecolumn(data=mydata,bycolumn="director",joint=",")

head(dd)
            director AB
1       Aaron Blaise  A
2         Bob Walker  A
3     Akira Kurosawa  B
4     Alan J. Pakula  A
5        Alan Parker  A
6 Alejandro Amenabar  B
2赞 GKi 6/23/2020 #6

目前,建议使用 from base 生成的另一个基准测试将列中的逗号分隔字符串拆分为单独的行,因为它在各种大小范围内是最快的:strsplit

s <- strsplit(v$director, ",", fixed=TRUE)
s <- data.frame(director=unlist(s), AB=rep(v$AB, lengths(s)))

请注意,使用对计时有重大影响。fixed=TRUE

Curves showing computation time over number of rows

比较方法:

met <- alist(base = {s <- strsplit(v$director, ",") #Matthew Lundberg
   s <- data.frame(director=unlist(s), AB=rep(v$AB, sapply(s, FUN=length)))}
 , baseLength = {s <- strsplit(v$director, ",") #Rich Scriven
   s <- data.frame(director=unlist(s), AB=rep(v$AB, lengths(s)))}
 , baseLeFix = {s <- strsplit(v$director, ",", fixed=TRUE)
   s <- data.frame(director=unlist(s), AB=rep(v$AB, lengths(s)))}
 , cSplit = s <- cSplit(v, "director", ",", direction = "long") #A5C1D2H2I1M1N2O1R2T1
 , dt = s <- setDT(v)[, lapply(.SD, function(x) unlist(tstrsplit(x, "," #Jaap
   , fixed=TRUE))), by = AB][!is.na(director)]
#, dt2 = s <- setDT(v)[, strsplit(director, "," #Jaap #Only Unique
#  , fixed=TRUE), by = .(AB, director)][,.(director = V1, AB)]
 , dplyr = {s <- v %>%  #Jaap
    mutate(director = strsplit(director, ",", fixed=TRUE)) %>%
    unnest(director)}
 , tidyr = s <- separate_rows(v, director, sep = ",") #Jaap
 , stack = s <- stack(setNames(strsplit(v$director, ",", fixed=TRUE), v$AB)) #Jaap
#, dt3 = {s <- setDT(v)[, strsplit(director, ",", fixed=TRUE), #Uwe #Only Unique
#  by = .(AB, director)][, director := NULL][, setnames(.SD, "V1", "director")]}
 , dt4 = {s <- setDT(v)[, .(director = unlist(strsplit(director, "," #Uwe
   , fixed = TRUE))), by = .(AB)]}
 , dt5 = {s <- vT[, .(director = unlist(strsplit(director, "," #Uwe
   , fixed = TRUE))), by = .(AB)]}
   )

图书馆:

library(microbenchmark)
library(splitstackshape) #cSplit
library(data.table) #dt, dt2, dt3, dt4
#setDTthreads(1) #Looks like it has here minor effect
library(dplyr) #dplyr
library(tidyr) #dplyr, tidyr

数据:

v0 <- data.frame(director = c("Aaron Blaise,Bob Walker", "Akira Kurosawa", 
                        "Alan J. Pakula", "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", 
                        "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu", 
                        "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", 
                        "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", 
                        "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", 
                        "Anne Fontaine", "Anthony Harvey"), AB = c('A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'B', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'A'))

计算和计时结果:

n <- 10^(0:5)
x <- lapply(n, function(n) {v <- v0[rep(seq_len(nrow(v0)), n),]
  vT <- setDT(v)
  ti <- min(100, max(3, 1e4/n))
  microbenchmark(list = met, times = ti, control=list(order="block"))})

y <- do.call(cbind, lapply(x, function(y) aggregate(time ~ expr, y, median)))
y <- cbind(y[1], y[-1][c(TRUE, FALSE)])
y[-1] <- y[-1] / 1e6 #ms
names(y)[-1] <- paste("n:", n * nrow(v0))
y #Time in ms
#         expr     n: 20    n: 200    n: 2000   n: 20000   n: 2e+05   n: 2e+06
#1        base 0.2989945 0.6002820  4.8751170  46.270246  455.89578  4508.1646
#2  baseLength 0.2754675 0.5278900  3.8066300  37.131410  442.96475  3066.8275
#3   baseLeFix 0.2160340 0.2424550  0.6674545   4.745179   52.11997   555.8610
#4      cSplit 1.7350820 2.5329525 11.6978975  99.060448 1053.53698 11338.9942
#5          dt 0.7777790 0.8420540  1.6112620   8.724586  114.22840  1037.9405
#6       dplyr 6.2425970 7.9942780 35.1920280 334.924354 4589.99796 38187.5967
#7       tidyr 4.0323765 4.5933730 14.7568235 119.790239 1294.26959 11764.1592
#8       stack 0.2931135 0.4672095  2.2264155  22.426373  289.44488  2145.8174
#9         dt4 0.5822910 0.6414900  1.2214470   6.816942   70.20041   787.9639
#10        dt5 0.5015235 0.5621240  1.1329110   6.625901   82.80803   636.1899

注意,像这样的方法

(v <- rbind(v0[1:2,], v0[1,]))
#                 director AB
#1 Aaron Blaise,Bob Walker  A
#2          Akira Kurosawa  B
#3 Aaron Blaise,Bob Walker  A

setDT(v)[, strsplit(director, "," #Jaap #Only Unique
  , fixed=TRUE), by = .(AB, director)][,.(director = V1, AB)]
#         director AB
#1:   Aaron Blaise  A
#2:     Bob Walker  A
#3: Akira Kurosawa  B

返回 A 表示 director,并且可能与strsplitunique

tmp <- unique(v)
s <- strsplit(tmp$director, ",", fixed=TRUE)
s <- data.frame(director=unlist(s), AB=rep(tmp$AB, lengths(s)))

但据我了解,这并没有被问到。