提问人:edgester 提问时间:7/16/2012 最后编辑:moodymudskipperedgester 更新时间:5/19/2023 访问量:9647
扩展由“from”和“to”列定义的范围
Expand ranges defined by "from" and "to" columns
问:
此问题也称为“将”开始-结束“数据集转换为面板数据集
我有一个数据框,其中包含美国总统,他们开始和结束任职的年份,(和列)。下面是一个示例:"name"
"from"
"to"
presidents <- data.frame(
name = c("Bill Clinton", "George W. Bush", "Barack Obama"),
from = c(1993, 2001, 2009),
to = c(2001, 2009, 2012)
)
presidents
#> name from to
#> 1 Bill Clinton 1993 2001
#> 2 George W. Bush 2001 2009
#> 3 Barack Obama 2009 2012
我想创建包含两列 ( 和 ) 的数据框,其中有一行表示总统在任的每一年。因此,我需要创建一个从“”到每年的常规序列。这是我的预期:"name"
"year"
from
"to"
name year
Bill Clinton 1993
Bill Clinton 1994
...
Bill Clinton 2000
Bill Clinton 2001
George W. Bush 2001
George W. Bush 2002
...
George W. Bush 2008
George W. Bush 2009
Barack Obama 2009
Barack Obama 2010
Barack Obama 2011
Barack Obama 2012
我知道我可以用来为单个总统扩展东西,但我无法弄清楚如何为每个总统进行迭代。data.frame(name = "Bill Clinton", year = seq(1993, 2001))
我该怎么做?我觉得我应该知道这一点,但我画的是空白。
更新 1
好的,我已经尝试了两种解决方案,但出现错误:
foo<-structure(list(name = c("Grover Cleveland", "Benjamin Harrison", "Grover Cleveland"), from = c(1885, 1889, 1893), to = c(1889, 1893, 1897)), .Names = c("name", "from", "to"), row.names = 22:24, class = "data.frame")
ddply(foo, "name", summarise, year = seq(from, to))
Error in seq.default(from, to) : 'from' must be of length 1
答:
您可以使用以下软件包:plyr
library(plyr)
ddply(presidents, "name", summarise, year = seq(from, to))
# name year
# 1 Barack Obama 2009
# 2 Barack Obama 2010
# 3 Barack Obama 2011
# 4 Barack Obama 2012
# 5 Bill Clinton 1993
# 6 Bill Clinton 1994
# [...]
如果数据按年份排序很重要,您可以使用以下功能:arrange
df <- ddply(presidents, "name", summarise, year = seq(from, to))
arrange(df, df$year)
# name year
# 1 Bill Clinton 1993
# 2 Bill Clinton 1994
# 3 Bill Clinton 1995
# [...]
# 21 Barack Obama 2011
# 22 Barack Obama 2012
编辑1:根据@edgester的“更新1”,更合适的方法是用于解释非连续任期的总统:adply
adply(foo, 1, summarise, year = seq(from, to))[c("name", "year")]
评论
name
这是一个解决方案。它有一个很好的(如果次要的)功能,可以让总统按照他们提供的顺序排列:data.table
library(data.table)
dt <- data.table(presidents)
dt[, list(year = seq(from, to)), by = name]
# name year
# 1: Bill Clinton 1993
# 2: Bill Clinton 1994
# ...
# ...
# 21: Barack Obama 2011
# 22: Barack Obama 2012
编辑:要处理具有非连续任期的总统,请改用以下命令:
dt[, list(year = seq(from, to)), by = c("name", "from")]
评论
name
birthdate
state
dt[, list(birthdate, state, year = seq(from, to)), by = name]
from
to
merge(dt[, !c("from", "to")], dt[, list(year = seq(from, to)), by = name], by="name")
这是一个快速的基础解决方案,您的位置在哪里:R
Df
data.frame
do.call(rbind, apply(Df, 1, function(x) {
data.frame(name=x[1], year=seq(x[2], x[3]))}))
它给出了一些关于行名的警告,但似乎返回了正确的 .data.frame
评论
name
do.call(rbind, apply(presidents, 1, function(x) { data.frame(name=rep(x[1],as.numeric(x[3])-as.numeric(x[2])+1), year=x[2]:x[3])}))
这里有一个解决方案:dplyr
library(dplyr)
# the data
presidents <-
structure(list(name = c("Bill Clinton", "George W. Bush", "Barack Obama"
), from = c(1993, 2001, 2009), to = c(2001, 2009, 2012)), .Names = c("name",
"from", "to"), row.names = 42:44, class = "data.frame")
# the expansion of the table
presidents %>%
rowwise() %>%
do(data.frame(name = .$name, year = seq(.$from, .$to, by = 1)))
# the output
Source: local data frame [22 x 2]
Groups: <by row>
name year
(chr) (dbl)
1 Bill Clinton 1993
2 Bill Clinton 1994
3 Bill Clinton 1995
4 Bill Clinton 1996
5 Bill Clinton 1997
6 Bill Clinton 1998
7 Bill Clinton 1999
8 Bill Clinton 2000
9 Bill Clinton 2001
10 George W. Bush 2001
.. ... ...
电动: https://stackoverflow.com/a/24804470/1036500
两种解决方案。base
用:sequence
len = d$to - d$from + 1
data.frame(name = d$name[rep(1:nrow(d), len)], year = sequence(len, d$from))
用:mapply
l <- mapply(`:`, d$from, d$to)
data.frame(name = d$name[rep(1:nrow(d), lengths(l))], year = unlist(l))
# name year
# 1 Bill Clinton 1993
# 2 Bill Clinton 1994
# ...snip
# 8 Bill Clinton 2000
# 9 Bill Clinton 2001
# 10 George W. Bush 2001
# 11 George W. Bush 2002
# ...snip
# 17 George W. Bush 2008
# 18 George W. Bush 2009
# 19 Barack Obama 2009
# 20 Barack Obama 2010
# 21 Barack Obama 2011
# 22 Barack Obama 2012
正如@Esteis在评论中指出的那样,在扩大范围之后,很可能有几列需要重复(不仅仅是像 OP 中的“名称”)。在这种情况下,只需重复整个数据框的行,而不是重复单个列的值,除了“from”和“to”列。一个简单的例子:
d = data.frame(x = 1:2, y = 3:4, names = c("a", "b"),
from = c(2001, 2011), to = c(2003, 2012))
# x y names from to
# 1 1 3 a 2001 2003
# 2 2 4 b 2011 2012
len = d$to - d$from + 1
cbind(d[rep(1:nrow(d), len), setdiff(names(d), c("from", "to"))],
year = sequence(len, d$from))
x y names year
1 1 3 a 2001
1.1 1 3 a 2002
1.2 1 3 a 2003
2 2 4 b 2011
2.1 2 4 b 2012
评论
name
另一种选择是将数据转换为长格式,并在 和 日期之间创建序列。tidyverse
gather
group_by
name
from
to
library(tidyverse)
presidents %>%
gather(key, date, -name) %>%
group_by(name) %>%
complete(date = seq(date[1], date[2]))%>%
select(-key)
# A tibble: 22 x 2
# Groups: name [3]
# name date
# <chr> <dbl>
# 1 Barack Obama 2009
# 2 Barack Obama 2010
# 3 Barack Obama 2011
# 4 Barack Obama 2012
# 5 Bill Clinton 1993
# 6 Bill Clinton 1994
# 7 Bill Clinton 1995
# 8 Bill Clinton 1996
# 9 Bill Clinton 1997
#10 Bill Clinton 1998
# … with 12 more rows
评论
name
一些替代方法:tidyverse
使用和:reframe()
mapply()
library(tidyverse)
presidents %>%
reframe(year = mapply(seq, from, to), .by = -c(from, to))
# name year
# 1 Bill Clinton 1993
# 2 Bill Clinton 1994
...
# 21 Barack Obama 2011
# 22 Barack Obama 2012
使用和:map2()
unnest()
presidents %>%
mutate(year = map2(from, to, seq), .keep = "unused") %>%
unnest(year)
# name year
# 1 Bill Clinton 1993
# 2 Bill Clinton 1994
...
# 21 Barack Obama 2011
# 22 Barack Obama 2012
评论
select
.keep = "unused"
presidents %>% mutate(year = map2(from, to, seq), .keep = "unused") %>% unnest(year)
用于创建 data.frame 列表,每个总统一个 data.frame,然后将它们放在一起。不使用任何软件包。by
by
L
rbind
L <- by(presidents, presidents$name, with, data.frame(name, year = from:to))
do.call("rbind", setNames(L, NULL))
如果您不介意行名,那么最后一行可以减少为:
do.call("rbind", L)
评论
name
另一个解决方案是使用 和 .它可以正确保留您拥有的任何数据列。dplyr
tidyr
library(magrittr) # for pipes
df <- data.frame(
tata = c('toto1', 'toto2'),
from = c(2000, 2004),
to = c(2001, 2009),
measure1 = rnorm(2),
measure2 = 10 * rnorm(2)
)
tata from to measure1 measure2
1 toto1 2000 2001 -0.575 -10.13
2 toto2 2004 2009 -0.258 17.37
df %>%
dplyr::rowwise() %>%
dplyr::mutate(year = list(seq(from, to))) %>%
dplyr::select(-from, -to) %>%
tidyr::unnest(c(year))
# A tibble: 8 x 4
tata measure1 measure2 year
<chr> <dbl> <dbl> <int>
1 toto1 -0.575 -10.1 2000
2 toto1 -0.575 -10.1 2001
3 toto2 -0.258 17.4 2004
4 toto2 -0.258 17.4 2005
5 toto2 -0.258 17.4 2006
6 toto2 -0.258 17.4 2007
7 toto2 -0.258 17.4 2008
8 toto2 -0.258 17.4 2009
解决方案的补充可以是:tidyverse
df %>%
uncount(to - from + 1) %>%
group_by(name) %>%
transmute(year = seq(first(from), first(to)))
name year
<chr> <dbl>
1 Bill Clinton 1993
2 Bill Clinton 1994
3 Bill Clinton 1995
4 Bill Clinton 1996
5 Bill Clinton 1997
6 Bill Clinton 1998
7 Bill Clinton 1999
8 Bill Clinton 2000
9 Bill Clinton 2001
10 George W. Bush 2001
评论
name
下面是另一个应该相当快的基本 R 解决方案:
a <- lapply(1:nrow(presidents),function(a){
data.frame(
name=rep(presidents$name[a],presidents$to[a]-presidents$from[a]+1),
year=presidents$from[a]:presidents$to[a]
)
})
do.call('rbind',a)
您要保留的任何数据都可以采用与名称列相同的方式进行处理。
与上面 Jason Morgan 的其他 Base R 解决方案进行基准测试,因为我很好奇:
f_max <- function(x){
a <- lapply(1:nrow(x),function(a){
data.frame(
name=rep(x$name[a],x$to[a]-x$from[a]+1),
year=x$from[a]:x$to[a]
)
})
do.call(rbind,a)
}
f_jason <- function(y){
do.call(rbind, apply(y, 1, function(x) {
data.frame(name=x[1], year=seq(x[2], x[3]))}))
}
f_combined <- function(y){
do.call(rbind, apply(y, 1, function(x) {
data.frame(name=rep(x[1],as.numeric(x[3])-as.numeric(x[2])+1), year=x[2]:x[3])}))
}
r <- f_jason(presidents)
all(r==f_max(presidents))
all(r==f_combined(presidents))
res <- microbenchmark(f_jason(presidents),f_combined(presidents),f_max(presidents))
print(res, order="mean")
expr min lq mean median uq max neval cld
f_max(presidents) 436.9 462.75 561.074 482.30 521.25 5601.0 100 a
f_combined(presidents) 566.5 605.95 796.029 639.70 723.60 7548.7 100 b
f_jason(presidents) 770.2 829.70 998.108 906.15 1011.85 4891.0 100 b
评论