如何在不获取额外空格和逗号的情况下对列表进行标记 (Python)

How to tokenize the list without getting extra spaces and commas (Python)

提问人:Popeye 提问时间:6/26/2023 最后编辑:Popeye 更新时间:6/27/2023 访问量:86

问:

df = pd.DataFrame({'id' : ['a','b','c','d','e'],
              'title' : ['amd ryzen 7 5800x cpu processor', 'amd ryzen 8 5200x cpu processor','amd ryzen 5 2400x cpu processor',
                        'amd ryzen computer accessories for processor','amd ryzen cpu processor for gamers'],
              'gen_key' : ['amd, ryzen, processor, cpu, gamer','amd, ryzen, processor, cpu, gamer','amd, ryzen, processor, cpu, gamer',
                          'amd, ryzen, processor, cpu, gamer','amd, ryzen, processor, cpu, gamer'],
              'elas_key' : ['ryzen-7, best processor for processing, sale now for christmas gift',
                           'ryzen-8, GAMER, best processor for processing, sale now for christmas gift',
                           'ryzen-5, best processor for gamers, sale now for christmas gift, amd',
                           'ryzen accessories, gamers:, headsets, pro; players best, hurry up to avail promotion',
                           'processor, RYZEN, gamers best world, available on sale']})

所以这是我的数据帧,我正在尝试进行预处理并将最终的“elas_key”作为小写集,没有停用词、特定的标点符号、某些客观声明、复数名词、与“gen_key”和“title”的重复项以及标题中没有的组织名称。所以我已经处理了某些事情,但我有点被困在标记化上,在标记列表时我得到了额外的空格和逗号:

def lower_case(new_keys):
  lower = list(w.lower() for w in new_keys)
  return lower 

stop = stopwords.words('english')
other_claims = ['best','sale','available','avail','new','hurry','promotion']
stop += other_claims

def stopwords_removal(new_keys):
  stop_removed = [' '.join([word for word in x.split() if word not in stop]) for x in new_keys]
- return stop_removed

def remove_specific_punkt(new_keys):
  punkt = list(filter(None, [re.sub(r'[;:-]', r'', i) for i in new_keys]))
  return punkt
df['elas_key'] = df['elas_key'].apply(remove_specific_punkt)
df

去掉标点符号后,得到下表(命名为List1):enter image description here

但是当我运行标记化脚本时,我得到了一个添加了逗号和空格的列表列表,我尝试使用 strip()、replace() 来删除它们,但没有给我预期的结果

def word_tokenizing(new_keys):
  tokenized_words = [word_tokenize(i) for i in new_keys]
  return tokenized_words
df['elas_key'] = df['elas_key'].apply(word_tokenizing)
df

该表如下(命名为 List2):enter image description here

有人可以帮我解决这个问题吗? 同样在删除非索引字后,我得到了一些这样的行:

[processor, ryzen, gamers world,]

实际列表是:

[processor, ryzen, gamers best world, available on sale]

但是像“可用”、“在”、“销售”这样的词要么是停用词,要么是other_claims词,即使这些词被删除了,但我在最后得到了一个额外的“,”

删除停用词、标点符号和other_claims后,我的预期输出应该如下所示:

[[ryzen,7, processor,processing]]
[[ryzen,8, gamer, processor,processing]]
[[ryzen,5, processor,gamers, amd]]
[[ryzen,accessories, gamers, headsets, pro,players]]
[[processor, ryzen, gamers,world]]

就像 ryzen7 是一个词一样,它变成了 ryzen,7 如果关键字在多行中,我能够做到这一点,例如:

[ryzen, 7]
[processor, processing]
[gamers, world]

这样我就更容易pos_tag它们了

如果问题太混乱,我深表歉意,我有点处于学习阶段

python pandas list tokenize stop-words

评论

1赞 mozway 6/26/2023
你想要什么还不清楚,假设作为输入,你期望什么作为输出?'ryzen-7, best processor for processing, sale now for christmas gift'
0赞 Popeye 6/26/2023
@user19077881我不想删除逗号,因为在正常的逗号中应该在那里,但我不希望在输出中获得额外的逗号
0赞 Popeye 6/26/2023
@mozway 抱歉,我已经编辑了我的评论

答:

0赞 Timus 6/27/2023 #1

您可以尝试以下操作:

from nltk import word_tokenize
from nltk.corpus import stopwords

other_claims = ['best', 'sale', 'available', 'avail', 'new', 'hurry', 'promotion']
STOPS = set(stopwords.words('english') + other_claims)
def remove_stops(words):
    if (words := [word for word in words if word not in STOPS]):
        return words

def word_tokenizing(words):
  return [token for word in words for token in word_tokenize(word)]

df['elas_key'] = (
    df['elas_key'].str.lower()
    .str.split(', ').explode()
    .str.replace(r'[;:-]', r' ', regex=True).str.strip()
    .str.split().map(remove_stops).dropna().str.join(' ')
    .groupby(level=0).agg(list)
    .map(word_tokenizing)
)

示例数据帧的结果(仅列):elas_key

                                                    elas_key  
0         [ryzen, 7, processor, processing, christmas, gift]  
1  [ryzen, 8, gamer, processor, processing, christmas, gift]  
2        [ryzen, 5, processor, gamers, christmas, gift, amd]  
3       [ryzen, accessories, gamers, headsets, pro, players]  
4                          [processor, ryzen, gamers, world]