如何加速熊猫应用函数与for循环？（蟒蛇）-解网

问：

我正在为我的项目编写文本。我想通过在字典中查找所有单词来替换数据集文本中的单词。我的字典是这样的;replacement_dict ={'t1' : 'tebir', 't2':'teki', 'number':'no', ...}

数据帧中的示例文本;将是 ;"Hello my t1 is not okey, please help my number is bla bla" "Hello my tebir is not okey, please help my no is bla bla"

我写了以下代码;

import pandas
def replacament(row,replacement_dict):  
    text = row['text']
    text = text.lower()
    for i, j in replacement_dict.items():
        text = re.sub(r"\b%s\b" % i, j, text)          
    return text
data['text2'] = data.apply(replacament, axis = 1, args=(replacement_dict,))

但需要 8 小时才能完成。我的日期集行大小是 600000。我怎样才能加快这个应用功能？谢谢

python 替换 apply

import re
import swifter
import pandas as pd
import time

data = pd.DataFrame({'text': ["t2 my t1 is not okey, number please help my  is bla bla"] * 600000})
replacement_dict = {'t1': 'tebir', 't2': 'teki', 'number': 'no'}

start_time = time.time()
def replace_text(text):
    for k, v in replacement_dict.items():
        text = text.replace(k, v)
    return text

data['text2_parallel'] = data['text'].swifter.apply(replace_text)
parallel_time = time.time() - start_time
print(f"Parallelized time: {parallel_time:.2f} seconds")
print(data[['text2_parallel']])

输出：

Pandas Apply: 100%|████████████████████████████████████████████████████████| 600000/600000 [00:01<00:00, 458107.50it/s]
Parallelized time: 1.43 seconds
                                           text2_parallel
0       teki my tebir is not okey, no please help my  ...
1       teki my tebir is not okey, no please help my  ...
2       teki my tebir is not okey, no please help my  ...
3       teki my tebir is not okey, no please help my  ...
4       teki my tebir is not okey, no please help my  ...
...                                                   ...
599995  teki my tebir is not okey, no please help my  ...
599996  teki my tebir is not okey, no please help my  ...
599997  teki my tebir is not okey, no please help my  ...
599998  teki my tebir is not okey, no please help my  ...
599999  teki my tebir is not okey, no please help my  ...

[600000 rows x 1 columns]

1赞 Melcore 11/14/2023 #3

使用熊猫。DataFrame.replace 函数。

像这样使用它，正则表达式参数为 True

import pandas

test = pandas.DataFrame({"name": ["Good morning England", "Land and Freedom", "This is England"]})
replace_dict = {"England" : "French", "morning": "night"}

test["name"] = test["name"].replace(replace_dict, regex=True)
print(test)

测试数据帧现在是：

                name
0  Good night French
1   Land and Freedom
2     This is French

如何加速熊猫应用函数与for循环？（蟒蛇）

How to speed up pandas apply function with for loops? (Python)

评论

评论

评论