提问人:Alexander 提问时间:7/23/2023 最后编辑:Alexander 更新时间:7/23/2023 访问量:53
删除停用词也会在频率分布期间删除单词之间的空格
Removing stopwords also removes spaces between words during frequency distribution
问:
我希望从文本中删除非索引字以优化我的频率分布结果
我的初始频率分布代码是这样写的:
# Determine the frequency distribution
from nltk.tokenize import word_tokenize
tokens = nltk.word_tokenize(review_comments)
fdist = FreqDist(tokens)
fdist
这将返回
FreqDist({"'": 521, ',': 494, "'the": 22, 'a': 16, "'of": 16, "'is": 12, "'to": 10, "'for": 9, "'it": 8, "'that": 8, ...})
我想使用以下代码删除非索引字
# Delete all the alpanum.
# Filter out tokens that are neither alphabets nor numbers (to eliminate punctuation marks, etc.).
filtered = [word for word in review_comments if word.isalnum()]
# Remove all the stopwords
# Download the stopword list.
nltk.download ('stopwords')
from nltk.corpus import stopwords
# Create a set of English stopwords.
english_stopwords = set(stopwords.words('english'))
# Create a filtered list of tokens without stopwords.
filtered2 = [x for x in filtered if x.lower() not in english_stopwords]
# Define an empty string variable.
filtered2_string = ''
for value in filtered:
# Add each filtered token word to the string.
filtered2_string = filtered2_string + value + ''
现在我再次运行 fdist
from nltk.tokenize import word_tokenize
trial= nltk.word_tokenize(filtered2_string)
fdist1 = FreqDist(trial)
fdist1
这将返回代码
频率Dist({'whenitcomestoadmsscreenthespaceonthescreenitselfisatanabsolutepremiumthefactthat50ofthisspaceiswastedonartandnotterriblyinformativeorneededartaswellmakesitcompletelyuselesstheonlyreasonthatigaveit2starsandnot1wasthattechnicallyspeakingitcanatleaststillstanduptoblockyournotesanddicerollsotherthanthatitdropstheballcompletelyanopenlettertogaleforce9yourunpaintedminiaturesareverynotbadyourspellcardsaregreatyourboardgamesaremehyourdmscreenshoweverarefreakingterribleimstillwaitingforasinglescreenthat未被污染': 1})
review_comments = ''
for i in range(newdf.shape[1]):
# Add each comment.
review_comments = review_comments + newdf['tokens1'][i]```
How do I get the stopwords to not remove the spaces and count the words individually?
I removed the stopwords and rerun the frequency distribution hoping to get the most frequent words.
答:
NLP 任务中的清理通常是在字符串上执行的,而不是在字符串上执行的,以利用内置功能/方法。但是,如果需要,您也可以使用自己的字符逻辑从头开始执行此操作。in 以标记的形式出现,用于清理文本语料库。您可以添加更多需要从列表中删除的令牌。例如,如果您需要删除英语停用词和标点符号,请执行以下操作:tokens
characters
stopwords
nltk
import string
from nltk.tokenize import word_tokenize
tokens = word_tokenize(review_comments)
## Add any additional punctuations/ words you want to eliminate here, like below
english_stop_plus_punct = set(stopwords.words('english') + ["call"] +
list(string.punctuation + "“”’"))
filtered2 = [x for x in tokens if x.lower() not in english_stop_plus_punct]
fdist1 = nltk.FreqDist(filtered2)
fdist1
#### FreqDist({'presence': 3, 'meaning': 2, 'might': 2, 'Many': 1, 'psychologists': 1, 'knowing': 1, 'life': 1, 'drive': 1, 'look': 1, ...})
关于“生命的意义”的文章中的示例文本:
review_comments = """ Many psychologists call knowing your life’s meaning “presence,” and the drive to look for it “search.” They are not mutually exclusive: You might or might not search, whether you already have a sense of meaning or not. Some people low in presence don’t bother searching—they are “stuck.” Some are high in presence but keep searching—we can call them “seekers.” """
您的代码正在标记字符而不是单词,这是带有示例输入数据的更新代码
nltk.download ('stopwords')
import nltk
from nltk.tokenize import word_tokenize
from collections import Counter
from nltk.corpus import stopwords
review_comments="the quick brown fox jumps over the lazy dog !"
tokens = word_tokenize(review_comments)
print("tokens are",tokens)
word_freq = Counter(tokens)
print("freq",word_freq)
filtered = [word for word in tokens if word.isalnum()]
print("after alnum removal",filtered)
english_stopwords = set(stopwords.words('english'))
filtered2 = [x for x in filtered if x.lower() not in english_stopwords]
print("after stopwords removal",filtered2)
filtered2_string = ' '.join(filtered2)
print(filtered2_string)
tokens = word_tokenize(filtered2_string)
print("tokens are",tokens)
word_freq = Counter(tokens)
print("freq",word_freq)
输出:
tokens are ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '!']
freq Counter({'the': 2, 'quick': 1, 'brown': 1, 'fox': 1, 'jumps': 1, 'over': 1, 'lazy': 1, 'dog': 1, '!': 1})
after alnum removal ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
after stopwords removal ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']
quick brown fox jumps lazy dog
tokens are ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']
freq Counter({'quick': 1, 'brown': 1, 'fox': 1, 'jumps': 1, 'lazy': 1, 'dog': 1})
评论