1500 个 ID 中最常见的二元组计数，而不在一个 ID 中重复计数-解网

问：

我正在尝试计算 1500 个 IDS 中最常见的二元组（每行 1 个 ID，有 1 个事件），而不计算每个 ID（行）中超过 1 倍的二元组。例如，如果我有以下 ID，我只想在每个 ID 中计算 1 次“工作日”。在我的分析中，“工作日”应该出现的次数摘要应该是 2。一旦“工作日”被计入 ID，我就不希望它再次被计算在内。

ID Text
1  "The work day was horrible. On this particular work day, I made 5 mistakes....."
2  "This long work day was the best for me. I miss a long work day, because I get into a rhythm....."

这是我的代码。它给出了 40 个最常出现的二元组的总数，作为显示 2 个单词二元组和计数的直方图。我不确定它是否将上面列出的每个 ID 的 bigram 的出现次数计算为 1 倍以上，尽管我确实相信它只是获取所有“事件”并计算 2 个单词 bigram 发生的次数，而不管 ID 如何。

Sum1 %>% 
    unnest_tokens(word, "Event", token = "ngrams", n = 2) %>% 
    separate(word, c("word1", "word2"), sep = " ") %>% 
    filter(!word1 %in% stop_words$word) %>%
    filter(!word2 %in% stop_words$word) %>% 
    unite(word,word1, word2, sep = " ") %>% 
    count(word, sort = TRUE) %>% 
    slice(1:40) %>% 
    ggplot() + geom_bar(aes(x=reorder(word,n), y=n), stat = "identity", fill = "#de5833") +
    theme_minimal() +
    coord_flip()

R NLP 文本挖掘

评论

答：

0赞 I_O 8/15/2023 #1

像这样的东西？

    library(tidytext)
    library(dplyr)

    d <- data.frame(ID = 1:2,
                    txt = c('a particular word', 
                            'a particular word a phrase and a particular word')
                    )

## > d

      ID                                              txt
    1  1                                a particular word
    2  2 a particular word a phrase and a particular word

使用基础 R 并从原始文本中过滤掉停用词，最后仅保留每个 ID 的唯一二元组：strsplitFilterdistinct

d |>
  rowwise() |>
  mutate(txt = strsplit(txt, split = '\\s')[[1]] |> 
           Filter(f = \(x) !(x %in% get_stopwords()$word)) |>
           paste(collapse = ' ')
         ) |>
  unnest_tokens(input = txt, output = 'tokens',
                token = 'ngrams', n = 2) |>
  distinct(ID, tokens)

(strsplit返回一个列表，该列表的单个项目（单词 vector）必须在 ing 之前摘取）[[1]]Filter

输出：

+ # A tibble: 4 x 2
# Rowwise: 
     ID tokens           
  <int> <chr>            
1     1 particular word  
2     2 particular word  
3     2 word phrase      
4     2 phrase particular

最后，二元组是这样的：count

## earlier steps (see above) 
## ... |>
count(tokens)

+ # A tibble: 3 x 2
# Rowwise: 
  tokens                n
  <chr>             <int>
1 particular word       2
2 phrase particular     1
3 word phrase           1

评论

0赞 MfM 8/15/2023

最终输出应具有 'particular word' = 2，ID 1 为 1x，ID 2 为 1x，“word phrase”= 1，“phrase particular”= 1。这样，如果 ID 重复了相同的短语，则最终计数不会过度膨胀。

0赞 I_O 8/15/2023

对不起，错过了那个。请参阅编辑。

0赞 MfM 8/15/2023

@I_O谢谢。|> 是否表示 %>% 的管道？

0赞 MfM 8/15/2023

另外，我是否需要过滤掉标点符号并删除所有大写字母？如果是这样，我是否需要在 mutate 函数中执行此操作？

0赞 I_O 8/15/2023

是的，是 tidyverse 管道之后的原生 R 管道。不过，您肯定可以使用后者。|>%>%

上一个：R 文本聚类（单词属于哪个聚类）

下一个：R语言中的文本挖掘：删除每个文档的第一句话