将 data.frame 从宽格式调整为长格式

Reshaping data.frame from wide to long format

提问人:mropa 提问时间:2/2/2010 最后编辑:NelsonGonmropa 更新时间:11/1/2022 访问量:298951

问:

我在将我的宽表转换为长表时遇到了一些麻烦。 目前它看起来像这样:data.frame

Code Country        1950    1951    1952    1953    1954
AFG  Afghanistan    20,249  21,352  22,532  23,557  24,555
ALB  Albania        8,097   8,986   10,058  11,123  12,246

现在我想把它变成一个长. 像这样的东西:data.framedata.frame

Code Country        Year    Value
AFG  Afghanistan    1950    20,249
AFG  Afghanistan    1951    21,352
AFG  Afghanistan    1952    22,532
AFG  Afghanistan    1953    23,557
AFG  Afghanistan    1954    24,555
ALB  Albania        1950    8,097
ALB  Albania        1951    8,986
ALB  Albania        1952    10,058
ALB  Albania        1953    11,123
ALB  Albania        1954    12,246

我已经查看并已经尝试使用和函数 正如一些人在类似问题中建议的那样。 但是,到目前为止,我只得到混乱的结果。melt()reshape()

如果可能的话,我想用这个函数来做,因为 它看起来处理起来更好一些。reshape()

R 数据帧 重塑 r-faq data.table tidyr sqldf

评论

3赞 Eduardo Leoni 2/3/2010
不知道这是否是问题所在,但重塑包中的功能是熔化和铸造(和重铸)。
3赞 IRTFM 9/16/2014
并且 reshape 包已被 reshape2 取代。
7赞 drhagen 2/15/2016
现在 reshape2 已被 tidyr 取代。
4赞 NelsonGon 1/13/2022
而现在的 和 已被函数所取代。tidyrgatherspreadpivot_*

答:

39赞 Shane 2/3/2010 #1

使用重塑包:

#data
x <- read.table(textConnection(
"Code Country        1950    1951    1952    1953    1954
AFG  Afghanistan    20,249  21,352  22,532  23,557  24,555
ALB  Albania        8,097   8,986   10,058  11,123  12,246"), header=TRUE)

library(reshape)

x2 <- melt(x, id = c("Code", "Country"), variable_name = "Year")
x2[,"Year"] <- as.numeric(gsub("X", "" , x2[,"Year"]))
137赞 Aniko 2/3/2010 #2

reshape()需要一段时间才能习惯,就像 / 一样。这是一个具有 reshape 的解决方案,假设您的数据框被调用:meltcastd

reshape(d, 
        direction = "long",
        varying = list(names(d)[3:7]),
        v.names = "Value",
        idvar = c("Code", "Country"),
        timevar = "Year",
        times = 1950:1954)
228赞 Jaap 9/16/2014 #3

两种替代解决方案:

1) 使用

您可以使用熔融功能:

library(data.table)
long <- melt(setDT(wide), id.vars = c("Code","Country"), variable.name = "year")

这给了:

> long
    Code     Country year  value
 1:  AFG Afghanistan 1950 20,249
 2:  ALB     Albania 1950  8,097
 3:  AFG Afghanistan 1951 21,352
 4:  ALB     Albania 1951  8,986
 5:  AFG Afghanistan 1952 22,532
 6:  ALB     Albania 1952 10,058
 7:  AFG Afghanistan 1953 23,557
 8:  ALB     Albania 1953 11,123
 9:  AFG Afghanistan 1954 24,555
10:  ALB     Albania 1954 12,246

一些替代符号:

melt(setDT(wide), id.vars = 1:2, variable.name = "year")
melt(setDT(wide), measure.vars = 3:7, variable.name = "year")
melt(setDT(wide), measure.vars = as.character(1950:1954), variable.name = "year")

2)

使用 pivot_longer()

library(tidyr)

long <- wide %>% 
  pivot_longer(
    cols = `1950`:`1954`, 
    names_to = "year",
    values_to = "value"
)

注意:

  • names_to并分别默认为 和 ,因此您可以将其写得更简洁。values_to"name""value"wide %>% pivot_longer(`1950`:`1954`)
  • 该参数使用高度灵活的 tidyselect DSL,因此您可以使用负选择 ()、选择帮助程序 (; )、数字索引 () 等来选择相同的列。cols!c(Code, Country)starts_with("19")matches("^\\d{4}$")3:7
  • tidyr::pivot_longer()是 和 的继任者,它们不再处于开发中。tidyr::gather()reshape2::melt()

转换价值

数据的另一个问题是,R 将把这些值作为字符值读取(作为数字的结果)。您可以在重塑之前使用 和 进行修复:,gsubas.numeric

long$value <- as.numeric(gsub(",", "", long$value))

或在整形过程中,用 或 :data.tabletidyr

# data.table
long <- melt(setDT(wide),
             id.vars = c("Code","Country"),
             variable.name = "year")[, value := as.numeric(gsub(",", "", value))]

# tidyr
long <- wide %>%
  pivot_longer(
    cols = `1950`:`1954`, 
    names_to = "year",
    values_to = "value",
    values_transform = ~ as.numeric(gsub(",", "", .x))
  )

数据:

wide <- read.table(text="Code Country        1950    1951    1952    1953    1954
AFG  Afghanistan    20,249  21,352  22,532  23,557  24,555
ALB  Albania        8,097   8,986   10,058  11,123  12,246", header=TRUE, check.names=FALSE)

评论

1赞 Jia Gao 10/19/2017
很好的答案,只是一个小小的提醒:不要在数据框中放置除 和 之外的任何变量,在这种情况下无法判断您想做什么。idtimemelt
1赞 Jaap 10/19/2017
@JasonGoal 你能详细说明一下吗?正如我解释你的评论一样,这应该不是问题。只需同时指定 和 .id.varsmeasure.vars
1赞 11/17/2019
对不起,这篇文章 - 有人可以向我解释为什么 3 有效吗?我已经测试了它并且它有效,但我不明白 dplyr 在看到时在做什么......-c(var1, var2)
1赞 Jaap 11/18/2019
@ReputableMisnomer 当 tidyr 看到时,它会在将数据从宽格式转换为长格式时省略这些变量。-c(var1, var2)
10赞 Evan Rosica 4/22/2020
根据 tidyverse 博客现已停用,取而代之的是 .他们说:“新的,并提供现代的替代品。它们经过精心重新设计,更易于学习和记忆,并包含许多新功能。spread() 和 gather() 不会消失,但它们已经退役,这意味着它们不再处于积极开发状态。gatherpivot_longerpivot_longer()pivot_wider()spread()gather()
11赞 Mark Peterson 12/5/2016 #4

下面是另一个示例,显示了 from 的用法。您可以通过单独删除列(就像我在这里所做的那样)或显式包含您想要的年份来选择列。gathertidyrgather

请注意,为了处理逗号(如果未设置,则添加 X),我还使用 'mutate with from 将文本值转换回数字。这些都是 的一部分,因此可以与check.names = FALSEdplyrparse_numberreadrtidyverselibrary(tidyverse)

wide %>%
  gather(Year, Value, -Code, -Country) %>%
  mutate(Year = parse_number(Year)
         , Value = parse_number(Value))

返回:

   Code     Country Year Value
1   AFG Afghanistan 1950 20249
2   ALB     Albania 1950  8097
3   AFG Afghanistan 1951 21352
4   ALB     Albania 1951  8986
5   AFG Afghanistan 1952 22532
6   ALB     Albania 1952 10058
7   AFG Afghanistan 1953 23557
8   ALB     Albania 1953 11123
9   AFG Afghanistan 1954 24555
10  ALB     Albania 1954 12246
36赞 A5C1D2H2I1M1N2O1R2T1 1/9/2018 #5

由于这个答案被标记为 ,我觉得分享 base R 的另一个替代方案会很有用:.stack

但是请注意,这不适用于 s——它仅在 is 时有效,并且从 的文档中我们发现:stackfactoris.vectorTRUEis.vector

is.vector如果 x 是指定模式的向量,则返回该向量,该向量除名称外没有其他属性。否则返回。TRUEFALSE

我使用的是 @Jaap 答案中的示例数据,其中年份列中的值为 s。factor

方法如下:stack

cbind(wide[1:2], stack(lapply(wide[-c(1, 2)], as.character)))
##    Code     Country values  ind
## 1   AFG Afghanistan 20,249 1950
## 2   ALB     Albania  8,097 1950
## 3   AFG Afghanistan 21,352 1951
## 4   ALB     Albania  8,986 1951
## 5   AFG Afghanistan 22,532 1952
## 6   ALB     Albania 10,058 1952
## 7   AFG Afghanistan 23,557 1953
## 8   ALB     Albania 11,123 1953
## 9   AFG Afghanistan 24,555 1954
## 10  ALB     Albania 12,246 1954

评论

0赞 DuckPyjamas 6/23/2023
您已经节省了无处不在🥳的软件包开发人员
6赞 M-- 4/16/2019 #6

下面是一个 sqldf 解决方案:Here's a solution:

sqldf("Select Code, Country, '1950' As Year, `1950` As Value From wide
        Union All
       Select Code, Country, '1951' As Year, `1951` As Value From wide
        Union All
       Select Code, Country, '1952' As Year, `1952` As Value From wide
        Union All
       Select Code, Country, '1953' As Year, `1953` As Value From wide
        Union All
       Select Code, Country, '1954' As Year, `1954` As Value From wide;")

若要在不键入所有内容的情况下进行查询,可以使用以下命令:

感谢 G. Grothendieck 的实施。

ValCol <- tail(names(wide), -2)

s <- sprintf("Select Code, Country, '%s' As Year, `%s` As Value from wide", ValCol, ValCol)
mquery <- paste(s, collapse = "\n Union All\n")

cat(mquery) #just to show the query
 #> Select Code, Country, '1950' As Year, `1950` As Value from wide
 #>  Union All
 #> Select Code, Country, '1951' As Year, `1951` As Value from wide
 #>  Union All
 #> Select Code, Country, '1952' As Year, `1952` As Value from wide
 #>  Union All
 #> Select Code, Country, '1953' As Year, `1953` As Value from wide
 #>  Union All
 #> Select Code, Country, '1954' As Year, `1954` As Value from wide

sqldf(mquery)
 #>    Code     Country Year  Value
 #> 1   AFG Afghanistan 1950 20,249
 #> 2   ALB     Albania 1950  8,097
 #> 3   AFG Afghanistan 1951 21,352
 #> 4   ALB     Albania 1951  8,986
 #> 5   AFG Afghanistan 1952 22,532
 #> 6   ALB     Albania 1952 10,058
 #> 7   AFG Afghanistan 1953 23,557
 #> 8   ALB     Albania 1953 11,123
 #> 9   AFG Afghanistan 1954 24,555
 #> 10  ALB     Albania 1954 12,246

不幸的是,我不这么认为,并且会为 .如果你想以更复杂的方式写下你的查询,你也可以看看这些帖子:PIVOTUNPIVOTRSQLite

63赞 akrun 9/15/2019 #7

对于 ,另一种选择是tidyr_1.0.0pivot_longer

library(tidyr)
pivot_longer(df1, -c(Code, Country), values_to = "Value", names_to = "Year")
# A tibble: 10 x 4
#   Code  Country     Year  Value 
#   <fct> <fct>       <chr> <fct> 
# 1 AFG   Afghanistan 1950  20,249
# 2 AFG   Afghanistan 1951  21,352
# 3 AFG   Afghanistan 1952  22,532
# 4 AFG   Afghanistan 1953  23,557
# 5 AFG   Afghanistan 1954  24,555
# 6 ALB   Albania     1950  8,097 
# 7 ALB   Albania     1951  8,986 
# 8 ALB   Albania     1952  10,058
# 9 ALB   Albania     1953  11,123
#10 ALB   Albania     1954  12,246

数据

df1 <- structure(list(Code = structure(1:2, .Label = c("AFG", "ALB"), class = "factor"), 
    Country = structure(1:2, .Label = c("Afghanistan", "Albania"
    ), class = "factor"), `1950` = structure(1:2, .Label = c("20,249", 
    "8,097"), class = "factor"), `1951` = structure(1:2, .Label = c("21,352", 
    "8,986"), class = "factor"), `1952` = structure(2:1, .Label = c("10,058", 
    "22,532"), class = "factor"), `1953` = structure(2:1, .Label = c("11,123", 
    "23,557"), class = "factor"), `1954` = structure(2:1, .Label = c("12,246", 
    "24,555"), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))

评论

12赞 Evan Rosica 4/22/2020
这需要更多的赞成票。根据 Tidyverse 的说法,博客即将停用,现在是实现这一目标的正确方法。gatherpivot_longer
9赞 andschar 6/14/2021
@EvanRosica直到他们决定再次更改功能:p
0赞 trilisser 10/23/2023
该怎么办?它看起来像删除列,但事实并非如此。你能解释一下语法吗?-c(...)
1赞 Karsten W. 7/19/2020 #8

您还可以使用包,它使用(转换)控制表的概念:cdata

# data
wide <- read.table(text="Code Country        1950    1951    1952    1953    1954
AFG  Afghanistan    20,249  21,352  22,532  23,557  24,555
ALB  Albania        8,097   8,986   10,058  11,123  12,246", header=TRUE, check.names=FALSE)

library(cdata)
# build control table
drec <- data.frame(
    Year=as.character(1950:1954),
    Value=as.character(1950:1954),
    stringsAsFactors=FALSE
)
drec <- cdata::rowrecs_to_blocks_spec(drec, recordKeys=c("Code", "Country"))

# apply control table
cdata::layout_by(drec, wide)

我目前正在探索该软件包,发现它非常容易访问。它专为更复杂的转换而设计,包括反向转换。有一个可用的教程