提问人:Popeye 提问时间:6/26/2023 最后编辑:Popeye 更新时间:6/27/2023 访问量:86
如何在不获取额外空格和逗号的情况下对列表进行标记 (Python)
How to tokenize the list without getting extra spaces and commas (Python)
问:
df = pd.DataFrame({'id' : ['a','b','c','d','e'],
'title' : ['amd ryzen 7 5800x cpu processor', 'amd ryzen 8 5200x cpu processor','amd ryzen 5 2400x cpu processor',
'amd ryzen computer accessories for processor','amd ryzen cpu processor for gamers'],
'gen_key' : ['amd, ryzen, processor, cpu, gamer','amd, ryzen, processor, cpu, gamer','amd, ryzen, processor, cpu, gamer',
'amd, ryzen, processor, cpu, gamer','amd, ryzen, processor, cpu, gamer'],
'elas_key' : ['ryzen-7, best processor for processing, sale now for christmas gift',
'ryzen-8, GAMER, best processor for processing, sale now for christmas gift',
'ryzen-5, best processor for gamers, sale now for christmas gift, amd',
'ryzen accessories, gamers:, headsets, pro; players best, hurry up to avail promotion',
'processor, RYZEN, gamers best world, available on sale']})
所以这是我的数据帧,我正在尝试进行预处理并将最终的“elas_key”作为小写集,没有停用词、特定的标点符号、某些客观声明、复数名词、与“gen_key”和“title”的重复项以及标题中没有的组织名称。所以我已经处理了某些事情,但我有点被困在标记化上,在标记列表时我得到了额外的空格和逗号:
def lower_case(new_keys):
lower = list(w.lower() for w in new_keys)
return lower
stop = stopwords.words('english')
other_claims = ['best','sale','available','avail','new','hurry','promotion']
stop += other_claims
def stopwords_removal(new_keys):
stop_removed = [' '.join([word for word in x.split() if word not in stop]) for x in new_keys]
- return stop_removed
def remove_specific_punkt(new_keys):
punkt = list(filter(None, [re.sub(r'[;:-]', r'', i) for i in new_keys]))
return punkt
df['elas_key'] = df['elas_key'].apply(remove_specific_punkt)
df
但是当我运行标记化脚本时,我得到了一个添加了逗号和空格的列表列表,我尝试使用 strip()、replace() 来删除它们,但没有给我预期的结果
def word_tokenizing(new_keys):
tokenized_words = [word_tokenize(i) for i in new_keys]
return tokenized_words
df['elas_key'] = df['elas_key'].apply(word_tokenizing)
df
有人可以帮我解决这个问题吗? 同样在删除非索引字后,我得到了一些这样的行:
[processor, ryzen, gamers world,]
实际列表是:
[processor, ryzen, gamers best world, available on sale]
但是像“可用”、“在”、“销售”这样的词要么是停用词,要么是other_claims词,即使这些词被删除了,但我在最后得到了一个额外的“,”
删除停用词、标点符号和other_claims后,我的预期输出应该如下所示:
[[ryzen,7, processor,processing]]
[[ryzen,8, gamer, processor,processing]]
[[ryzen,5, processor,gamers, amd]]
[[ryzen,accessories, gamers, headsets, pro,players]]
[[processor, ryzen, gamers,world]]
就像 ryzen7 是一个词一样,它变成了 ryzen,7 如果关键字在多行中,我能够做到这一点,例如:
[ryzen, 7]
[processor, processing]
[gamers, world]
这样我就更容易pos_tag它们了
如果问题太混乱,我深表歉意,我有点处于学习阶段
答:
0赞
Timus
6/27/2023
#1
您可以尝试以下操作:
from nltk import word_tokenize
from nltk.corpus import stopwords
other_claims = ['best', 'sale', 'available', 'avail', 'new', 'hurry', 'promotion']
STOPS = set(stopwords.words('english') + other_claims)
def remove_stops(words):
if (words := [word for word in words if word not in STOPS]):
return words
def word_tokenizing(words):
return [token for word in words for token in word_tokenize(word)]
df['elas_key'] = (
df['elas_key'].str.lower()
.str.split(', ').explode()
.str.replace(r'[;:-]', r' ', regex=True).str.strip()
.str.split().map(remove_stops).dropna().str.join(' ')
.groupby(level=0).agg(list)
.map(word_tokenizing)
)
示例数据帧的结果(仅列):elas_key
elas_key
0 [ryzen, 7, processor, processing, christmas, gift]
1 [ryzen, 8, gamer, processor, processing, christmas, gift]
2 [ryzen, 5, processor, gamers, christmas, gift, amd]
3 [ryzen, accessories, gamers, headsets, pro, players]
4 [processor, ryzen, gamers, world]
评论
'ryzen-7, best processor for processing, sale now for christmas gift'