提问人:Hotamd6 提问时间:7/4/2012 最后编辑:JaapHotamd6 更新时间:10/16/2022 访问量:8513
将不带“timevar”的数据帧从长格式转置/重塑为宽格式
Transpose / reshape dataframe without "timevar" from long to wide format
问:
我有一个遵循以下长模式的数据框:
Name MedName
Name1 atenolol 25mg
Name1 aspirin 81mg
Name1 sildenafil 100mg
Name2 atenolol 50mg
Name2 enalapril 20mg
并想在下面(我不在乎我是否可以以这种方式命名列,只想要这种格式的数据):
Name medication1 medication2 medication3
Name1 atenolol 25mg aspirin 81mg sildenafil 100mg
Name2 atenolol 50mg enalapril 20mg NA
通过这个网站,我已经熟悉了 reshape/reshape2 包,并经历了几次尝试来尝试让它工作,但到目前为止都失败了。
当我尝试时,我只得到一堆列,这些列是药物名称的标志(转置的值是 1 或 0)示例:dcast(dataframe, Name ~ MedName, value.var='MedName')
Name atenolol 25mg aspirin 81mg
Name1 1 1
Name2 0 0
在我融化数据集后,我也尝试过,但这只是吐出以下内容(只是计算每个人有多少药物):dcast(dataset, Name ~ variable)
Name MedName
Name1 3
name2 2
最后,我尝试融化数据,然后使用(其中所有都是 Mednames)进行重塑,但是这似乎不是为我的问题而构建的,因为如果 idvar 有多个匹配项,则重塑只会采用第一个 MedName 并忽略其余部分。idvar="Name"
timevar="variable"
有谁知道如何使用 reshape 或其他 R 函数来做到这一点?我意识到可能有一种方法可以以更混乱的方式做到这一点,使用一些 for 循环和条件来基本上拆分和重新粘贴数据,但我希望有一个更简单的解决方案。非常感谢!
答:
假设您的数据在对象中:dataset
library(plyr)
## Add a medication index
data_with_index <- ddply(dataset, .(Name), mutate,
index = paste0('medication', 1:length(Name)))
dcast(data_with_index, Name ~ index, value.var = 'MedName')
## Name medication1 medication2 medication3
## 1 Name1 atenolol 25mg aspirin 81mg sildenafil 100mg
## 2 Name2 atenolol 50mg enalapril 20mg <NA>
评论
medication1,medication10,medication11,medication12,...,medication2
在使用 .在这里,我习惯于将函数“along”应用于每个“Name”。timevar
reshape
ave
seq_along
test <- data.frame(
Name=c(rep("name1",3),rep("name2",2)),
MedName=c("atenolol 25mg","aspirin 81mg","sildenafil 100mg",
"atenolol 50mg","enalapril 20mg")
)
# generate the 'timevar'
test$uniqid <- with(test, ave(as.character(Name), Name, FUN = seq_along))
# reshape!
reshape(test, idvar = "Name", timevar = "uniqid", direction = "wide")
结果:
Name MedName.1 MedName.2 MedName.3
1 name1 atenolol 25mg aspirin 81mg sildenafil 100mg
4 name2 atenolol 50mg enalapril 20mg <NA>
评论
gsub("MedName.","medication",names(reshapedtestdata),fixed=TRUE)
@thelatemail的解决方案与此类似。当我生成时间变量时,我会在我没有以交互方式工作并且变量需要是动态的情况下使用。rle
Name
# start with your example data
x <-
data.frame(
Name=c(rep("name1",3),rep("name2",2)),
MedName=c("atenolol 25mg","aspirin 81mg","sildenafil 100mg",
"atenolol 50mg","enalapril 20mg")
)
# pick the id variable
id <- 'Name'
# sort the data.frame by that variable
x <- x[ order( x[ , id ] ) , ]
# construct a `time` variable on the fly
x$time <- unlist( lapply( rle( as.character( x[ , id ] ) )$lengths , seq_len ) )
# `reshape` uses that new `time` column by default
y <- reshape( x , idvar = id , direction = 'wide' )
# done
y
评论
rle
id <- 'Name'
as.character(get(id))
这似乎实际上是一个相当普遍的问题,所以我在我的“splitstackshape”包中包含了一个调用的函数。getanID
它的作用如下:
library(splitstackshape)
getanID(test, "Name")
# Name MedName .id
# 1: name1 atenolol 25mg 1
# 2: name1 aspirin 81mg 2
# 3: name1 sildenafil 100mg 3
# 4: name2 atenolol 50mg 1
# 5: name2 enalapril 20mg 2
由于 “data.table” 与 “splitstackshape” 一起加载,因此您可以访问 ,因此您可以按照 @mnel 的示例继续操作。dcast.data.table
dcast.data.table(getanID(test, "Name"), Name ~ .id, value.var = "MedName")
# Name 1 2 3
# 1: name1 atenolol 25mg aspirin 81mg sildenafil 100mg
# 2: name2 atenolol 50mg enalapril 20mg NA
该函数实质上实现了由标识的组来创建“时间”列。sequence(.N)
使用 data.table 包,可以使用新功能轻松解决此问题:rowid
library(data.table)
dcast(setDT(d1),
Name ~ rowid(Name, prefix = "medication"),
value.var = "MedName")
这给了:
Name medication1 medication2 medication3 1 Name1 atenolol 25mg aspirin 81mg sildenafil 100mg 2 Name2 atenolol 50mg enalapril 20mg <NA>
另一种方法(通常在 1.9.7 版本之前使用):
dcast(setDT(d1)[, rn := 1:.N, by = Name],
Name ~ paste0("medication",rn),
value.var = "MedName")
给出相同的结果。
library(dplyr)
library(tidyr)
d1 %>%
group_by(Name) %>%
mutate(rn = paste0("medication",row_number())) %>%
spread(rn, MedName)
这给了:
Source: local data frame [2 x 4] Groups: Name [2] Name medication1 medication2 medication3 (fctr) (chr) (chr) (chr) 1 Name1 atenolol 25mg aspirin 81mg sildenafil 100mg 2 Name2 atenolol 50mg enalapril 20mg NA
下面是一个更短的方法,利用处理名称的方式:unlist
library(dplyr)
df1 %>% group_by(Name) %>% do(as_tibble(t(unlist(.[2]))))
# # A tibble: 2 x 4
# # Groups: Name [2]
# Name MedName1 MedName2 MedName3
# <chr> <chr> <chr> <chr>
# 1 name1 atenolol 25mg aspirin 81mg sildenafil 100mg
# 2 name2 atenolol 50mg enalapril 20mg <NA>
一个干净的解决方案涉及 包版本中非常有用的功能 .这样,您还可以使用参数直接指定列名。pivot_wider
tidyr
1.1.0
names_glue
library(tidyr)
library(dplyr)
dataframe %>%
group_by(Name) %>%
mutate(row_n = row_number()) %>%
pivot_wider(id_cols = Name, names_from = row_n, values_from = MedName, names_glue = "medication{row_n}")
输出
# A tibble: 2 x 4
# Groups: Name [2]
# Name medication1 medication2 medication3
# <chr> <chr> <chr> <chr>
# 1 Name1 atenolol 25mg aspirin 81mg sildenafil 100mg
# 2 Name2 atenolol 50mg enalapril 20mg NA
使用 和 的解决方案。tidyr
chop()
unnest_wider()
library(tidyr)
df %>%
chop(-Name) %>%
unnest_wider(MedName, names_sep = "")
# # A tibble: 2 x 4
# Name MedName1 MedName2 MedName3
# <chr> <chr> <chr> <chr>
# 1 Name1 atenolol 25mg aspirin 81mg sildenafil 100mg
# 2 Name2 atenolol 50mg enalapril 20mg NA
这个论点是必要的;否则,新列名将为 、 和 。names_sep = ""
..1
..2
..3
数据
df <- structure(list(Name = c("Name1", "Name1", "Name1", "Name2", "Name2"
), MedName = c("atenolol 25mg", "aspirin 81mg", "sildenafil 100mg",
"atenolol 50mg", "enalapril 20mg")), class = "data.frame", row.names = c(NA, -5L))
在 base 中,您可以使用或按名称拆分 MedName。然后将每个列表元素的长度设置为相同的长度和它们。split
unstack
length
rbind
. <- split(x$MedName, x$Name)
#. <- unstack(x[2:1]) #Alternative
do.call(rbind, lapply(., `length<-`, max(lengths(.))))
#do.call(rbind, lapply(., "[", 1:max(lengths(.)))) #Alternative
#t(sapply(., "[", 1:max(lengths(.)))) #Alternative
# [,1] [,2] [,3]
#Name1 "atenolol 25mg" "aspirin 81mg" "sildenafil 100mg"
#Name2 "atenolol 50mg" "enalapril 20mg" NA
请注意,结果是 .如果需要,请另外使用。matrix
data.frame
as.data.frame.matrix
数据
x <- read.table(header=TRUE, text="
Name MedName
Name1 'atenolol 25mg'
Name1 'aspirin 81mg'
Name1 'sildenafil 100mg'
Name2 'atenolol 50mg'
Name2 'enalapril 20mg'")
基准
bench::mark(check=FALSE,
Jaap1 = dcast(setDT(y), Name ~ rowid(Name, prefix = "medication"), value.var = "MedName"),
Jaap2 = x %>% group_by(Name) %>% mutate(rn = paste0("medication",row_number())) %>% spread(rn, MedName),
mnel = {data_with_index <- ddply(x, .(Name), mutate, index = paste0('medication', 1:length(Name)))
dcast(setDT(data_with_index), Name ~ index, value.var = 'MedName') },
thelatemail = reshape(within(x, uniqid <- ave(as.character(Name), Name, FUN = seq_along)), idvar = "Name", timevar = "uniqid", direction = "wide"),
a5c1d2h2i1m1n2o1r2t1 = dcast.data.table(getanID(y, "Name"), Name ~ .id, value.var = "MedName"),
"Anthony Damico" = {. <- x[order(x[, "Name"]),]
.$time <- unlist( lapply( rle( as.character( .[ , "Name" ] ) )$lengths , seq_len ) )
reshape( . , idvar = "Name" , direction = 'wide' ) },
"Ric S" = x %>% group_by(Name) %>% mutate(row_n = row_number()) %>%
pivot_wider(id_cols = Name, names_from = row_n, values_from = MedName, names_glue = "medication{row_n}"),
"Darren Tsai" = x %>% chop(-Name) %>% unnest_wider(MedName, names_sep = ""),
moodymudskipper = x %>% group_by(Name) %>% do(as_tibble(t(unlist(.[2])))),
GKi = {. <- split(x$MedName, x$Name)
do.call(rbind, lapply(., "[", 1:max(lengths(.)))) },
GKi2 = {. <- split(x$MedName, x$Name)
do.call(rbind, lapply(., `length<-`, max(lengths(.))))}
)
结果
expression min median `itr/sec` mem_al…¹ gc/se…² n_itr n_gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:by> <dbl> <int> <dbl>
1 Jaap1 910.83µs 1.5ms 645. 274.5KB 8.32 310 4
2 Jaap2 9.64ms 11.53ms 86.9 16.5KB 9.15 38 4
3 mnel 4.69ms 5.91ms 164. 280.1KB 6.31 78 3
4 thelatemail 619.74µs 898.36µs 959. 0B 12.6 458 6
5 a5c1d2h2i1m1n2o1r2t1 1.36ms 2.18ms 426. 291.4KB 8.70 196 4
6 Anthony Damico 565.34µs 794.76µs 1123. 0B 12.5 537 6
7 Ric S 13.51ms 15.59ms 62.4 64.2KB 6.69 28 3
8 Darren Tsai 3.75ms 5.02ms 199. 22.7KB 8.54 93 4
9 moodymudskipper 12.88ms 15.88ms 63.5 26.5KB 6.81 28 3
10 GKi 31.58µs 35.73µs 21448. 0B 17.2 9992 8
11 GKi2 29.41µs 32.77µs 23717. 0B 16.6 9993 7
在这种情况下,GKi 的速度比第二个快 15 倍左右,并且属于没有分配额外内存的组。
上一个:将多个值列调整为宽格式
下一个:迁移 R 库
评论