如何在不获取额外空格和逗号的情况下对列表进行标记（Python）-解网

问：

df = pd.DataFrame({'id' : ['a','b','c','d','e'],
              'title' : ['amd ryzen 7 5800x cpu processor', 'amd ryzen 8 5200x cpu processor','amd ryzen 5 2400x cpu processor',
                        'amd ryzen computer accessories for processor','amd ryzen cpu processor for gamers'],
              'gen_key' : ['amd, ryzen, processor, cpu, gamer','amd, ryzen, processor, cpu, gamer','amd, ryzen, processor, cpu, gamer',
                          'amd, ryzen, processor, cpu, gamer','amd, ryzen, processor, cpu, gamer'],
              'elas_key' : ['ryzen-7, best processor for processing, sale now for christmas gift',
                           'ryzen-8, GAMER, best processor for processing, sale now for christmas gift',
                           'ryzen-5, best processor for gamers, sale now for christmas gift, amd',
                           'ryzen accessories, gamers:, headsets, pro; players best, hurry up to avail promotion',
                           'processor, RYZEN, gamers best world, available on sale']})

所以这是我的数据帧，我正在尝试进行预处理并将最终的“elas_key”作为小写集，没有停用词、特定的标点符号、某些客观声明、复数名词、与“gen_key”和“title”的重复项以及标题中没有的组织名称。所以我已经处理了某些事情，但我有点被困在标记化上，在标记列表时我得到了额外的空格和逗号：

def lower_case(new_keys):
  lower = list(w.lower() for w in new_keys)
  return lower 

stop = stopwords.words('english')
other_claims = ['best','sale','available','avail','new','hurry','promotion']
stop += other_claims

def stopwords_removal(new_keys):
  stop_removed = [' '.join([word for word in x.split() if word not in stop]) for x in new_keys]
- return stop_removed

def remove_specific_punkt(new_keys):
  punkt = list(filter(None, [re.sub(r'[;:-]', r'', i) for i in new_keys]))
  return punkt
df['elas_key'] = df['elas_key'].apply(remove_specific_punkt)
df

去掉标点符号后，得到下表（命名为List1）：

但是当我运行标记化脚本时，我得到了一个添加了逗号和空格的列表列表，我尝试使用 strip（）、replace（）来删除它们，但没有给我预期的结果

def word_tokenizing(new_keys):
  tokenized_words = [word_tokenize(i) for i in new_keys]
  return tokenized_words
df['elas_key'] = df['elas_key'].apply(word_tokenizing)
df

该表如下（命名为 List2）：

有人可以帮我解决这个问题吗？同样在删除非索引字后，我得到了一些这样的行：

[processor, ryzen, gamers world,]

实际列表是：

[processor, ryzen, gamers best world, available on sale]

但是像“可用”、“在”、“销售”这样的词要么是停用词，要么是other_claims词，即使这些词被删除了，但我在最后得到了一个额外的“，”

删除停用词、标点符号和other_claims后，我的预期输出应该如下所示：

[[ryzen,7, processor,processing]]
[[ryzen,8, gamer, processor,processing]]
[[ryzen,5, processor,gamers, amd]]
[[ryzen,accessories, gamers, headsets, pro,players]]
[[processor, ryzen, gamers,world]]

就像 ryzen7 是一个词一样，它变成了 ryzen，7 如果关键字在多行中，我能够做到这一点，例如：

[ryzen, 7]
[processor, processing]
[gamers, world]

这样我就更容易pos_tag它们了

如果问题太混乱，我深表歉意，我有点处于学习阶段

python pandas list tokenize stop-words

from nltk import word_tokenize
from nltk.corpus import stopwords

other_claims = ['best', 'sale', 'available', 'avail', 'new', 'hurry', 'promotion']
STOPS = set(stopwords.words('english') + other_claims)
def remove_stops(words):
    if (words := [word for word in words if word not in STOPS]):
        return words

def word_tokenizing(words):
  return [token for word in words for token in word_tokenize(word)]

df['elas_key'] = (
    df['elas_key'].str.lower()
    .str.split(', ').explode()
    .str.replace(r'[;:-]', r' ', regex=True).str.strip()
    .str.split().map(remove_stops).dropna().str.join(' ')
    .groupby(level=0).agg(list)
    .map(word_tokenizing)
)

示例数据帧的结果（仅列）：elas_key

                                                    elas_key  
0         [ryzen, 7, processor, processing, christmas, gift]  
1  [ryzen, 8, gamer, processor, processing, christmas, gift]  
2        [ryzen, 5, processor, gamers, christmas, gift, amd]  
3       [ryzen, accessories, gamers, headsets, pro, players]  
4                          [processor, ryzen, gamers, world]

上一个：删除停用词也会在频率分布期间删除单词之间的空格

下一个：从 pandas 数据帧中删除特定单词

如何在不获取额外空格和逗号的情况下对列表进行标记（Python）

How to tokenize the list without getting extra spaces and commas (Python)

评论

如何在不获取额外空格和逗号的情况下对列表进行标记 （Python）

How to tokenize the list without getting extra spaces and commas (Python)

评论

如何在不获取额外空格和逗号的情况下对列表进行标记（Python）