提问人:USER12345 提问时间:8/10/2023 最后编辑:PatrickUSER12345 更新时间:8/10/2023 访问量:33
R语言中的文本挖掘:删除每个文档的第一句话
Text mining in R: delete first sentence of each document
问:
我有几份文件,不需要每份文件的第一句话。 到目前为止,我找不到解决方案。
下面是一个示例。数据的结构如下所示
case_number | 发短信 |
---|---|
1 | 今天是个好日子。阳光明媚。 |
2 | 今天是糟糕的一天。下雨了。 |
所以结果应该是这样的
case_number | 发短信 |
---|---|
1 | 阳光明媚。 |
2 | 下雨了。 |
下面是示例数据集:
case_number <- c(1, 2)
text <- c("Today is a good day. It is sunny.",
"Today is a bad day. It is rainy.")
data <- data.frame(case_number, text)
答:
1赞
margusl
8/10/2023
#1
如果句子可能包含一些标点符号(例如缩写或数字),并且您仍在使用一些文本挖掘库,那么让它处理标记化是完全有意义的。
跟:{tidytext}
library(dplyr)
library(tidytext)
# exmple with punctuation in 1st sentence
data <- data.frame(case_number = c(1, 2),
text = c("Today is a good day, above avg. for sure, by 5.1 points. It is sunny.",
"Today is a bad day. It is rainy."))
# tokenize to sentences, converting tokens to lowercase is optional
data %>%
unnest_sentences(s, text)
#> case_number s
#> 1 1 today is a good day, above avg. for sure, by 5.1 points.
#> 2 1 it is sunny.
#> 3 2 today is a bad day.
#> 4 2 it is rainy.
# drop 1st record of every case_number group
data %>%
unnest_sentences(s, text) %>%
filter(row_number() > 1, .by = case_number)
#> case_number s
#> 1 1 it is sunny.
#> 2 2 it is rainy.
创建于 2023-08-10 with reprex v2.0.2
评论