主题建模时 tm 包中的错误-解网

问：

我在尝试从 R 中的包创建语料库对象时遇到错误。tm

数据是从网站上抓取的，我在下面包含了完整的代码，因此您可以运行并查看数据是如何收集的以及数据是如何创建的。最后一行代码是我卡住的地方！（我已经修改了循环，所以它应该在几秒钟内运行）。

任何帮助将不胜感激。:)

library(tidyverse)
library(rvest)
########################################## 
# WEB SCRAPING FROM SCHOLARLYKITCHEN.COM #
##########################################

# create loop that iteratively adds page numbers onto
# keep the loop numbers small for testing before full data is pulled in
output <- character()
for (i in 1:2) { 
  
  article.links <- paste0("https://scholarlykitchen.sspnet.org/archives/page/", i ,"/") %>%
    read_html() %>%
    html_nodes(".list-article__title") %>%
    html_nodes("a") %>% 
    html_attr("href")
  
  output <- c(output, article.links) 
  
}

# get all comments
get.comments <- function(output) {
  article.page <- read_html(output)
  article.comments <- article.page %>% html_nodes(".comment") %>% html_text() %>% trimws(which = "both")
  return(article.comments)
}

text <- sapply(output, FUN = get.comments, USE.NAMES = FALSE)

# get all dates
get.dates <- function(output) {
  article.page <- read_html(output)
  article.comments <- article.page %>% html_nodes(".comment__meta__date") %>% html_text() %>% trimws(which = "both")
  return(article.comments)
}

dates <- sapply(output, FUN = get.dates, USE.NAMES = FALSE)

# create the made df for the analysis
df <- tibble(
    text = unlist(text, recursive = TRUE), # unlist is needed because sapply (for some reason) creates a list
    dates = unlist(dates, recursive = TRUE)
)

# extract dates from meta data
df$dates <- as.character(gsub(",","",df$dates))
df$dates <- as.Date(df$dates, "%B%d%Y")


###################
# TOPIC MODELLING #
###################
library(tm)
library(topicmodels)

# create df ready for topic modelling
# this needs to have very specifically names columns

df.tm <- df[-2] # create dupelicate for backup (dates not needed for topic modelling yet)
df.tm$doc_id <- row.names(df) # create a unique id for each row as is needed by the tm package
df.tm <- df.tm[c(2,1)] # reorders the columns

# From the comments text, create the corpus
corpus <- VCorpus(DataframeSource(df))

错误如下

Error in DataframeSource(df) : 
  all(!is.na(match(c("doc_id", "text"), names(x)))) is not TRUE

R LDA TM 主题建模

主题建模时 tm 包中的错误

Error in tm package while topic modelling

评论