R 文本聚类(单词属于哪个聚类)

R Text Clustering (words belong to what Cluster)

提问人:Vincent ISOZ 提问时间:8/15/2023 更新时间:8/15/2023 访问量:81

问:

我从这里获取了以下文本聚类代码脚本:

https://medium.com/@SAPCAI/text-clustering-with-r-an-introduction-for-data-scientists-c406e7454e76

#we first create the dataframe structure in the memory
dataframe <- data.frame(ID=character(), 
                      datetime=character(), 
                      content=character(), 
                      label=factor()) 

#we load the data (zip file) and unzip them on the hardrive
source.url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/00438/Health-News-Tweets.zip' 
target.directory <- '/tmp/clustering-r' 
temporary.file <- tempfile() 
download.file(source.url, temporary.file) 
unzip(temporary.file, exdir = target.directory) 

#we read the files
target.directory <- paste(target.directory, 'Health-Tweets', sep = '/') 
files <- list.files(path = target.directory, pattern='.txt$') 
#filling the dataframe by reading the text content 
for (f in files) { 
  news.filename = paste(target.directory , f, sep ='/') 
  news.label <- substr(f, 0, nchar(f) - 4) #on supprime le ".txt" 
  news.data <- read.csv(news.filename, 
                        encoding = "UTF-8", 
                        header = FALSE, 
                        quote = "", 
                        sep = "|", 
                        col.names = c("ID", "datetime", "content")) 
  #we get rid of empty rows
  news.data <- news.data[news.data$content != "", ] 
  news.data['label'] = news.label #we add a label to each tweet
  #if your computer is not powerful enough, decrease the percentage
  #from the initial dataset size
  percent_of_dataset <- 0.3
  news.data <- head(news.data, floor(nrow(news.data) * percent_of_dataset))
  dataframe <- rbind(dataframe, news.data) #row appending
} 

#deleting the temporary directory
unlink(target.directory, recursive =  TRUE) 

#just look at a part of the content
head(dataframe)

#a little bit of cleaning to get rid of most URLs
library("stringr")
sentences<-str_replace_all(dataframe$content,"https?://[^\\s]+", "")
#we look at a part of the result
head(sentences)

#part where we prepare the corpus
library("tm")
corpus <- tm::Corpus(tm::VectorSource(sentences)) 
corpus
#classical cleaning part
#suppression of english stop-words
corpus.cleaned <- tm::tm_map(corpus, tm::removeWords, tm::stopwords("english"))
#stemming of words
corpus.cleaned <- tm::tm_map(corpus, tm::stemDocument, language = "english") 
#suppression of blank spaces
corpus.cleaned <- tm::tm_map(corpus.cleaned, tm::stripWhitespace)
#we look to the content of the resulting object
corpus.cleaned

tdm <- tm::DocumentTermMatrix(corpus.cleaned) 
#we look a the tdm structure
tdm
dim(tdm) #its dimension to have an idea of the number of words involved
#we compute the TfIdf Weight
tdtdm.tfidf <- tm::weightTfIdf(tdm)
tdtdm.tfidf

tdm.tfidf <- tm::removeSparseTerms(tdtdm.tfidf, sparse=0.999) 
tdm.tfidf
tfidf.matrix <- as.matrix(tdm.tfidf) 
#we take a look to the content of the first 10 rows and 10 columns
tfidf.matrix[0:10,0:10]

#we will now compute the cosine distance
library("proxy")
#we start a timer because it's interesting to see how long it take 
#depending on the initial percentage of the dataset
#100% takes 2 days of computation
ptm <- proc.time()
dist.matrix <- proxy::dist(tfidf.matrix, method = "cosine")
#we stop the timer
proc.time() - ptm
str(dist.matrix)

#now we run the clustering part
library("dbscan")
truth.K <- 16 #because why not...
clustering.kmeans <- kmeans(tfidf.matrix, truth.K) 
str(clustering.kmeans) #just to see the result structure
clustering.hierarchical <- hclust(dist.matrix, method = "ward.D2")
str(clustering.hierarchical) #just to see the result structure
clustering.dbscan <- dbscan::hdbscan(dist.matrix, minPts = 10)
str(clustering.dbscan) #just to see the result structure

#now comes the part where we want to know which of the top X  most frequent
#word belongs to which cluster and with the corresponding plot (where we can see the words)
#????

我如何知道前 X(最频繁)单词中的哪些属于 kmeans、dbscan 和 hclus 的哪个集群?我在 2 天内尝试过,但我失败了(我不是专业人士,我这样做是为了好玩......

感谢您的帮助

r k-means 文本挖掘 分层聚类 dbscan

评论


答: 暂无答案