删除停用词也会在频率分布期间删除单词之间的空格

Removing stopwords also removes spaces between words during frequency distribution

提问人:Alexander 提问时间:7/23/2023 最后编辑:Alexander 更新时间:7/23/2023 访问量:53

问:

我希望从文本中删除非索引字以优化我的频率分布结果

我的初始频率分布代码是这样写的:

# Determine the frequency distribution 
from nltk.tokenize import word_tokenize
tokens = nltk.word_tokenize(review_comments)
fdist = FreqDist(tokens)
fdist

这将返回

FreqDist({"'": 521, ',': 494, "'the": 22, 'a': 16, "'of": 16, "'is": 12, "'to": 10, "'for": 9, "'it": 8, "'that": 8, ...})

我想使用以下代码删除非索引字

# Delete all the alpanum.
# Filter out tokens that are neither alphabets nor numbers (to eliminate punctuation marks, etc.).
filtered = [word for word in review_comments if word.isalnum()]

# Remove all the stopwords
# Download the stopword list.
nltk.download ('stopwords')
from nltk.corpus import stopwords

# Create a set of English stopwords.
english_stopwords = set(stopwords.words('english'))

# Create a filtered list of tokens without stopwords.
filtered2 = [x for x in filtered if x.lower() not in english_stopwords]

# Define an empty string variable.
filtered2_string = ''

for value in filtered:
    # Add each filtered token word to the string.
    filtered2_string = filtered2_string + value + ''
    

现在我再次运行 fdist

from nltk.tokenize import word_tokenize
trial= nltk.word_tokenize(filtered2_string)
fdist1 = FreqDist(trial)
fdist1

这将返回代码

频率Dist({'whenitcomestoadmsscreenthespaceonthescreenitselfisatanabsolutepremiumthefactthat50ofthisspaceiswastedonartandnotterriblyinformativeorneededartaswellmakesitcompletelyuselesstheonlyreasonthatigaveit2starsandnot1wasthattechnicallyspeakingitcanatleaststillstanduptoblockyournotesanddicerollsotherthanthatitdropstheballcompletelyanopenlettertogaleforce9yourunpaintedminiaturesareverynotbadyourspellcardsaregreatyourboardgamesaremehyourdmscreenshoweverarefreakingterribleimstillwaitingforasinglescreenthat未被污染': 1})

review_comments = ''
for i in range(newdf.shape[1]):
    # Add each comment.
    review_comments = review_comments + newdf['tokens1'][i]```


How do I get the stopwords to not remove the spaces and count the words individually?




I removed the stopwords and rerun the frequency distribution hoping to get the most frequent words.
python nltk 标记频率 停用词

评论

0赞 yashaswi k 7/23/2023
您能否提供review_comments数据以重现输出
0赞 Alexander 7/23/2023
我已经编辑以提供review_comments
0赞 yashaswi k 7/23/2023
请查看更新后的代码以供参考

答:

2赞 Mankind_2000 7/23/2023 #1

NLP 任务中的清理通常是在字符串上执行的,而不是在字符串上执行的,以利用内置功能/方法。但是,如果需要,您也可以使用自己的字符逻辑从头开始执行此操作。in 以标记的形式出现,用于清理文本语料库。您可以添加更多需要从列表中删除的令牌。例如,如果您需要删除英语停用词和标点符号,请执行以下操作:tokenscharactersstopwordsnltk

import string
from nltk.tokenize import word_tokenize

tokens = word_tokenize(review_comments)

## Add any additional punctuations/ words you want to eliminate here, like below
english_stop_plus_punct = set(stopwords.words('english') + ["call"] + 
                          list(string.punctuation + "“”’"))

filtered2 = [x for x in tokens if x.lower() not in english_stop_plus_punct]

fdist1 = nltk.FreqDist(filtered2)
fdist1

#### FreqDist({'presence': 3, 'meaning': 2, 'might': 2, 'Many': 1, 'psychologists': 1, 'knowing': 1, 'life': 1, 'drive': 1, 'look': 1, ...})

关于“生命的意义”的文章中的示例文本:

review_comments = """ Many psychologists call knowing your life’s meaning “presence,” and the drive to look for it “search.” They are not mutually exclusive: You might or might not search, whether you already have a sense of meaning or not. Some people low in presence don’t bother searching—they are “stuck.” Some are high in presence but keep searching—we can call them “seekers.” """
0赞 yashaswi k 7/23/2023 #2

您的代码正在标记字符而不是单词,这是带有示例输入数据的更新代码

nltk.download ('stopwords')
import nltk
from nltk.tokenize import word_tokenize
from collections import Counter
from nltk.corpus import stopwords

review_comments="the quick brown fox jumps over the lazy dog !"

tokens = word_tokenize(review_comments)
print("tokens are",tokens)

word_freq = Counter(tokens)
print("freq",word_freq)

filtered = [word for word in tokens if word.isalnum()]
print("after alnum removal",filtered)

english_stopwords = set(stopwords.words('english'))

filtered2 = [x for x in filtered if x.lower() not in english_stopwords]
print("after stopwords removal",filtered2)

filtered2_string = ' '.join(filtered2)

print(filtered2_string)

tokens = word_tokenize(filtered2_string)
print("tokens are",tokens)

word_freq = Counter(tokens)
print("freq",word_freq)

输出:

tokens are ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '!']
freq Counter({'the': 2, 'quick': 1, 'brown': 1, 'fox': 1, 'jumps': 1, 'over': 1, 'lazy': 1, 'dog': 1, '!': 1})
after alnum removal ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
after stopwords removal ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']
quick brown fox jumps lazy dog
tokens are ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']
freq Counter({'quick': 1, 'brown': 1, 'fox': 1, 'jumps': 1, 'lazy': 1, 'dog': 1})