R语言中的文本挖掘：删除每个文档的第一句话-解网

问：

我有几份文件，不需要每份文件的第一句话。到目前为止，我找不到解决方案。

下面是一个示例。数据的结构如下所示

case_number	发短信
1	今天是个好日子。阳光明媚。
2	今天是糟糕的一天。下雨了。

所以结果应该是这样的

case_number	发短信
1	阳光明媚。
2	下雨了。

下面是示例数据集：

case_number <- c(1, 2)

text <- c("Today is a good day. It is sunny.",
          "Today is a bad day. It is rainy.")

data <- data.frame(case_number, text)

r 文本挖掘

library(dplyr)
library(tidytext)

# exmple with punctuation in 1st sentence
data <- data.frame(case_number = c(1, 2),
                   text = c("Today is a good day, above avg. for sure, by 5.1 points. It is sunny.",
                            "Today is a bad day. It is rainy."))
# tokenize to sentences, converting tokens to lowercase is optional
data %>% 
  unnest_sentences(s, text)
#>   case_number                                                        s
#> 1           1 today is a good day, above avg. for sure, by 5.1 points.
#> 2           1                                             it is sunny.
#> 3           2                                      today is a bad day.
#> 4           2                                             it is rainy.

# drop 1st record of every case_number group
data %>% 
  unnest_sentences(s, text) %>% 
  filter(row_number() > 1, .by = case_number)
#>   case_number            s
#> 1           1 it is sunny.
#> 2           2 it is rainy.

^{创建于 2023-08-10 with reprex v2.0.2}

上一个：1500 个 ID 中最常见的二元组计数，而不在一个 ID 中重复计数

下一个：在 R 中从类似 xml 的文件中提取和结构化数据

R语言中的文本挖掘：删除每个文档的第一句话

Text mining in R: delete first sentence of each document

评论