提问人:Vincent ISOZ 提问时间:8/15/2023 更新时间:8/15/2023 访问量:81
R 文本聚类(单词属于哪个聚类)
R Text Clustering (words belong to what Cluster)
问:
我从这里获取了以下文本聚类代码脚本:
https://medium.com/@SAPCAI/text-clustering-with-r-an-introduction-for-data-scientists-c406e7454e76
#we first create the dataframe structure in the memory
dataframe <- data.frame(ID=character(),
datetime=character(),
content=character(),
label=factor())
#we load the data (zip file) and unzip them on the hardrive
source.url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/00438/Health-News-Tweets.zip'
target.directory <- '/tmp/clustering-r'
temporary.file <- tempfile()
download.file(source.url, temporary.file)
unzip(temporary.file, exdir = target.directory)
#we read the files
target.directory <- paste(target.directory, 'Health-Tweets', sep = '/')
files <- list.files(path = target.directory, pattern='.txt$')
#filling the dataframe by reading the text content
for (f in files) {
news.filename = paste(target.directory , f, sep ='/')
news.label <- substr(f, 0, nchar(f) - 4) #on supprime le ".txt"
news.data <- read.csv(news.filename,
encoding = "UTF-8",
header = FALSE,
quote = "",
sep = "|",
col.names = c("ID", "datetime", "content"))
#we get rid of empty rows
news.data <- news.data[news.data$content != "", ]
news.data['label'] = news.label #we add a label to each tweet
#if your computer is not powerful enough, decrease the percentage
#from the initial dataset size
percent_of_dataset <- 0.3
news.data <- head(news.data, floor(nrow(news.data) * percent_of_dataset))
dataframe <- rbind(dataframe, news.data) #row appending
}
#deleting the temporary directory
unlink(target.directory, recursive = TRUE)
#just look at a part of the content
head(dataframe)
#a little bit of cleaning to get rid of most URLs
library("stringr")
sentences<-str_replace_all(dataframe$content,"https?://[^\\s]+", "")
#we look at a part of the result
head(sentences)
#part where we prepare the corpus
library("tm")
corpus <- tm::Corpus(tm::VectorSource(sentences))
corpus
#classical cleaning part
#suppression of english stop-words
corpus.cleaned <- tm::tm_map(corpus, tm::removeWords, tm::stopwords("english"))
#stemming of words
corpus.cleaned <- tm::tm_map(corpus, tm::stemDocument, language = "english")
#suppression of blank spaces
corpus.cleaned <- tm::tm_map(corpus.cleaned, tm::stripWhitespace)
#we look to the content of the resulting object
corpus.cleaned
tdm <- tm::DocumentTermMatrix(corpus.cleaned)
#we look a the tdm structure
tdm
dim(tdm) #its dimension to have an idea of the number of words involved
#we compute the TfIdf Weight
tdtdm.tfidf <- tm::weightTfIdf(tdm)
tdtdm.tfidf
tdm.tfidf <- tm::removeSparseTerms(tdtdm.tfidf, sparse=0.999)
tdm.tfidf
tfidf.matrix <- as.matrix(tdm.tfidf)
#we take a look to the content of the first 10 rows and 10 columns
tfidf.matrix[0:10,0:10]
#we will now compute the cosine distance
library("proxy")
#we start a timer because it's interesting to see how long it take
#depending on the initial percentage of the dataset
#100% takes 2 days of computation
ptm <- proc.time()
dist.matrix <- proxy::dist(tfidf.matrix, method = "cosine")
#we stop the timer
proc.time() - ptm
str(dist.matrix)
#now we run the clustering part
library("dbscan")
truth.K <- 16 #because why not...
clustering.kmeans <- kmeans(tfidf.matrix, truth.K)
str(clustering.kmeans) #just to see the result structure
clustering.hierarchical <- hclust(dist.matrix, method = "ward.D2")
str(clustering.hierarchical) #just to see the result structure
clustering.dbscan <- dbscan::hdbscan(dist.matrix, minPts = 10)
str(clustering.dbscan) #just to see the result structure
#now comes the part where we want to know which of the top X most frequent
#word belongs to which cluster and with the corresponding plot (where we can see the words)
#????
我如何知道前 X(最频繁)单词中的哪些属于 kmeans、dbscan 和 hclus 的哪个集群?我在 2 天内尝试过,但我失败了(我不是专业人士,我这样做是为了好玩......
感谢您的帮助
答: 暂无答案
评论