提问人:I_like_insights 提问时间:12/29/2022 更新时间:12/29/2022 访问量:58
主题建模时 tm 包中的错误
Error in tm package while topic modelling
问:
我在尝试从 R 中的包创建语料库对象时遇到错误。tm
数据是从网站上抓取的,我在下面包含了完整的代码,因此您可以运行并查看数据是如何收集的以及数据是如何创建的。最后一行代码是我卡住的地方!(我已经修改了循环,所以它应该在几秒钟内运行)。
任何帮助将不胜感激。:)
library(tidyverse)
library(rvest)
##########################################
# WEB SCRAPING FROM SCHOLARLYKITCHEN.COM #
##########################################
# create loop that iteratively adds page numbers onto
# keep the loop numbers small for testing before full data is pulled in
output <- character()
for (i in 1:2) {
article.links <- paste0("https://scholarlykitchen.sspnet.org/archives/page/", i ,"/") %>%
read_html() %>%
html_nodes(".list-article__title") %>%
html_nodes("a") %>%
html_attr("href")
output <- c(output, article.links)
}
# get all comments
get.comments <- function(output) {
article.page <- read_html(output)
article.comments <- article.page %>% html_nodes(".comment") %>% html_text() %>% trimws(which = "both")
return(article.comments)
}
text <- sapply(output, FUN = get.comments, USE.NAMES = FALSE)
# get all dates
get.dates <- function(output) {
article.page <- read_html(output)
article.comments <- article.page %>% html_nodes(".comment__meta__date") %>% html_text() %>% trimws(which = "both")
return(article.comments)
}
dates <- sapply(output, FUN = get.dates, USE.NAMES = FALSE)
# create the made df for the analysis
df <- tibble(
text = unlist(text, recursive = TRUE), # unlist is needed because sapply (for some reason) creates a list
dates = unlist(dates, recursive = TRUE)
)
# extract dates from meta data
df$dates <- as.character(gsub(",","",df$dates))
df$dates <- as.Date(df$dates, "%B%d%Y")
###################
# TOPIC MODELLING #
###################
library(tm)
library(topicmodels)
# create df ready for topic modelling
# this needs to have very specifically names columns
df.tm <- df[-2] # create dupelicate for backup (dates not needed for topic modelling yet)
df.tm$doc_id <- row.names(df) # create a unique id for each row as is needed by the tm package
df.tm <- df.tm[c(2,1)] # reorders the columns
# From the comments text, create the corpus
corpus <- VCorpus(DataframeSource(df))
错误如下
Error in DataframeSource(df) :
all(!is.na(match(c("doc_id", "text"), names(x)))) is not TRUE
答:
0赞
Nicolás Velasquez
12/29/2022
#1
DataframeSource()
要求 DF 在其第一列中具有文档索引,并且必须将其标记为“doc_id”。
尝试:
df_with_id <- rowid_to_column(df, var = "doc_id") # Alternatively, generate a doc index that better represents your collection of documents.
corpus <- VCorpus(DataframeSource(df))
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 1
Content: documents: 141
评论
df.tm