提问人:Loudog3232 提问时间:10/13/2023 最后编辑:thelatemailLoudog3232 更新时间:10/13/2023 访问量:47
如何根据一组唯一的 ID 变量和一个连续增加和重置的变量在 R 中连接连续的行?
How do I concatenate consecutive rows in R based on a unique set of ID variables and a consecutively increasing and resetting variable?
问:
数据框示例:
df <- data.frame(ID = c(1, 1, 2, 2, 2, 2, 2),
Name = c("Alice", "Alice", "Bob", "Bob", "Bob", "Bob", "Bob"),
Age = c(25, 25, 30, 30, 30, 30, 30),
LINE = c(1, 2, 1, 2, 1, 2, 3),
NOTE_TEXT = c("This is the fir", "st note",
"This is the seco", "nd note",
"This is ", "the th", "ird note"))
从本质上讲,由于来自数据源拉取的字符限制,我的完整“注释”被拆分为跨多行的“NOTE_TEXT”子字符串。属于同一“Note”的子字符串由另一个名为“LINE”的变量连续列出,该变量可以是 1 到 65 之间的任何值(绝大多数都在 4 行以内)。我想将属于同一“注释”的“NOTE_TEXT”合并为一行,并创建一个新变量来表示属于同一组 ID、名称、年龄、变量的每个“注释”的唯一性。
生成的 DataFrame 如下所示:
data.frame(ID = c(1, 2, 2),
Name = c("Alice", "Bob", "Bob"),
Age = c(25, 30, 30),
Note = c(1, 1, 2),
NOTE_TEXT = c("This is the first note",
"This is the second note",
"This is the third note"))
我想我需要使用某种for循环来循环每组唯一变量的“LINE”,但我不确定从哪里开始。 感谢您的帮助!
答:
1赞
Jon Spring
10/13/2023
#1
df |>
mutate(note_num = cumsum(LINE == 1), .by = c(ID, Name, Age)) |>
summarize(NOTE_TEXT = paste(NOTE_TEXT, collapse = ""),
.by = c(ID, Name, Age, note_num))
结果
ID Name Age note_num NOTE_TEXT
1 1 Alice 25 1 This is the first note
2 2 Bob 30 1 This is the second note
3 2 Bob 30 2 This is the third note
评论
0赞
Loudog3232
10/14/2023
谢谢乔恩,这正是我需要的。感谢您的帮助。
1赞
gl00ten
10/13/2023
#2
i <- 1
text <- df$NOTE_TEXT[1]
CONCAT_NOTE_TEXT <- character(0)
for (i in 2:nrow(df)) {
if (df$LINE[i] != 1) {
text <- paste0(text, df$NOTE_TEXT[i])
} else {
CONCAT_NOTE_TEXT <- c(CONCAT_NOTE_TEXT, text)
text <- df$NOTE_TEXT[i]
}
}
CONCAT_NOTE_TEXT <- c(CONCAT_NOTE_TEXT, text)
result_df <- data.frame(
ID = df$ID[df$LINE == 1],
Name = df$Name[df$LINE == 1],
Age = df$Age[df$LINE == 1],
CONCAT_NOTE_TEXT = CONCAT_NOTE_TEXT
)
ID Name Age CONCAT_NOTE_TEXT
1 1 Alice 25 This is the first note
2 2 Bob 30 This is the second note
3 2 Bob 30 This is the third note
评论
1赞
Loudog3232
10/14/2023
感谢 gl00ten 的回复并向我展示如何通过循环在基础 r 中完成此操作。我喜欢 dplyr 方法,因为它更简洁且可扩展到我的实际数据集,但我从这个答案中学到了很多东西。
下一个:为什么数组可以包含字符串的元素?
评论