主题建模时 tm 包中的错误

Error in tm package while topic modelling

提问人:I_like_insights 提问时间:12/29/2022 更新时间:12/29/2022 访问量:58

问:

我在尝试从 R 中的包创建语料库对象时遇到错误。tm

数据是从网站上抓取的,我在下面包含了完整的代码,因此您可以运行并查看数据是如何收集的以及数据是如何创建的。最后一行代码是我卡住的地方!(我已经修改了循环,所以它应该在几秒钟内运行)。

任何帮助将不胜感激。:)

library(tidyverse)
library(rvest)
########################################## 
# WEB SCRAPING FROM SCHOLARLYKITCHEN.COM #
##########################################

# create loop that iteratively adds page numbers onto
# keep the loop numbers small for testing before full data is pulled in
output <- character()
for (i in 1:2) { 
  
  article.links <- paste0("https://scholarlykitchen.sspnet.org/archives/page/", i ,"/") %>%
    read_html() %>%
    html_nodes(".list-article__title") %>%
    html_nodes("a") %>% 
    html_attr("href")
  
  output <- c(output, article.links) 
  
}

# get all comments
get.comments <- function(output) {
  article.page <- read_html(output)
  article.comments <- article.page %>% html_nodes(".comment") %>% html_text() %>% trimws(which = "both")
  return(article.comments)
}

text <- sapply(output, FUN = get.comments, USE.NAMES = FALSE)

# get all dates
get.dates <- function(output) {
  article.page <- read_html(output)
  article.comments <- article.page %>% html_nodes(".comment__meta__date") %>% html_text() %>% trimws(which = "both")
  return(article.comments)
}

dates <- sapply(output, FUN = get.dates, USE.NAMES = FALSE)

# create the made df for the analysis
df <- tibble(
    text = unlist(text, recursive = TRUE), # unlist is needed because sapply (for some reason) creates a list
    dates = unlist(dates, recursive = TRUE)
)

# extract dates from meta data
df$dates <- as.character(gsub(",","",df$dates))
df$dates <- as.Date(df$dates, "%B%d%Y")


###################
# TOPIC MODELLING #
###################
library(tm)
library(topicmodels)

# create df ready for topic modelling
# this needs to have very specifically names columns

df.tm <- df[-2] # create dupelicate for backup (dates not needed for topic modelling yet)
df.tm$doc_id <- row.names(df) # create a unique id for each row as is needed by the tm package
df.tm <- df.tm[c(2,1)] # reorders the columns

# From the comments text, create the corpus
corpus <- VCorpus(DataframeSource(df))

错误如下

Error in DataframeSource(df) : 
  all(!is.na(match(c("doc_id", "text"), names(x)))) is not TRUE
R LDA TM 主题建模

评论

3赞 user20650 12/29/2022
PUNT:你的意思是在最后一行使用吗?df.tm
0赞 I_like_insights 12/29/2022
我的 R 目前被占用,但我会尝试这个,并在测试后立即回复您。如果这是我的错误,我会对自己感到非常沮丧!哈哈
1赞 I_like_insights 12/29/2022
是的,我们有它!在所有这些代码之后,我犯了最愚蠢的错误。已修复,谢谢。

答:

0赞 Nicolás Velasquez 12/29/2022 #1

DataframeSource()要求 DF 在其第一列中具有文档索引,并且必须将其标记为“doc_id”。

尝试:

df_with_id <- rowid_to_column(df, var = "doc_id") # Alternatively, generate a doc index that better represents your collection of documents.
corpus <- VCorpus(DataframeSource(df))

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 1
Content:  documents: 141