提问人:Rei Daemondheart 提问时间:7/30/2023 最后编辑:doneforaiurRei Daemondheart 更新时间:8/1/2023 访问量:52
WordCloud:显示二元组时出现问题
WordCloud: problem with displaying bigrams
问:
我想从废弃的 Twitter 数据中实现词云。问题是单词 states 出现了 214 次,而 state - 64。只有一条推文中出现了“美国”一词的组合。尽管如此,我的词云是用这种组合而不是正确的组合形成的。
我生成世界云的代码:
raw_tweets = []
STOPWORDS = [
'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your',
'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her',
'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs',
'themselves', 'what', 'which', 'who', 'would', 'whom', 'this', 'that', 'these', 'those',
'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if',
'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with',
'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',
'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over',
'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where',
'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other',
'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too',
'very', 't', 'can', 'will', 'just', 'don', 'should', 'now'
]
for tweet in df['Tweet']:
raw_tweets.append(tweet)
raw_string = ''.join(raw_tweets)
no_links = re.sub(r'http\S+', '', raw_string)
no_unicode = re.sub(r"\\[a-z][a-z]?[0-9]+", '', no_links)
no_special_characters = re.sub('[^A-Za-z ]+', '', no_unicode)
words = no_special_characters.split(" ")
words = [w for w in words if len(w) > 2]
words = [w.lower() for w in words]
import numpy as np
import matplotlib.pyplot as plt
import re
from PIL import Image
from wordcloud import WordCloud
from IPython.display import Image as im
mask = np.array(Image.open('Logo_location'))
wc = WordCloud(background_color="white", max_words=2000, mask=mask, stopwords=STOPWORDS, relative_scaling=1)
wc.generate(','.join(words))
f = plt.figure(figsize=(13,13))
plt.imshow(wc, interpolation='bilinear')
plt.title('Twitter Generated Cloud', size=30)
plt.axis("off")
plt.show()
生成的词云;
答:
0赞
doneforaiur
7/30/2023
#1
...“状态”一词出现 214 次,而“状态”出现 - 64 次。只有一条推文中出现了“美国”一词的组合。尽管如此,我的词云还是用这种组合形成的。
您正在生成词云,而不是关键字云。对于这个特定的面具来说,它只是碰巧并排,但对于不同的面具,结果可能会有所不同。此外,您正在执行 ed 推文,因此输出已经是一个单词列表。(我强烈建议你使用 ,否则,推文的结尾和开头会融合在一起。.split(" ")
join("")
join(" ")
您当前的代码不包含包含 2 个单词的短语,例如“United States”。如果要包含它们,可以:
phrases = [words[i]+' '+words[i+1] for i in range(0, len(words)-1)]
如果要排除出现次数不少于一次的短语:
unique_phrases = set(phrases)
repeated_phrases = []
for phrase in unique_phrases:
if " ".join(words).count(phrase) > 1:
repeated_phrases.append(phrase)
组合,用于输入:
tweets = ["I live in the states", "Stack Overflow", "United States of America", "Stack" ,"United States", "Overflow", "State", "States", "States of America"]
输出将是:
repeated_phrases = ['states of', 'united states', 'of america']
最后,如果连接并生成词云,则输出将同时包含“州”和“美国”。你会想玩重复短语的阈值,因为 1 太低了,但适用于我的简短示例。words
repeated_phrases
编辑;docs 提到了为给定输入生成二元组的参数。你也可以传递你的 as,默认情况下会生成二元组,但仍然会有很多视觉上小而无意义的二元组,例如“状态”等。collocations
words
wc.generate(" ".join(words))
评论
word cloud is formed with this combination
United State
United States
state