在 Pandas Dataframe 中编码列-解网

问：

我有一些文本列，类似于下面的示例。我已经尝试过使用 UTF-8，但它不起作用。我试图用它来确认该列是否是 UTF-8 并确认它是。但到目前为止，还没有任何东西转换了文本。你知道有什么办法可以解决吗？我也会在下面留下我的代码。.decodechardet

列中的文本示例：

SECRETARIA DE FINANÃÂÃÂÃÂÃÂAS
RECURSOS ORDINÃÂÃÂÃÂÃÂRIOS
IMPOSTOS, TAXAS E CONTRIBUIÃÂÃÂÃÂÃÂÃÂ...
INDENIZAÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂES

我的代码：

检测编码

import chardet

column= 'orgao_nome'

def detect_encoding(cell):
    result = chardet.detect(cell.encode())
    return result['encoding']

df['detect_encoding'] = df[column].apply(detect_encoding)

解码

column= 'orgao_nome'

target_encoding = 'utf-8'

df[column] = df[column].str.encode(target_encoding, errors='ignore').str.decode(target_encoding)

Python Pandas 数据帧编码 UTF-8

ÃÂÃÂÃÂÃÂAS看起来很像使用 Latin1 而不是 UTF8 解码的 UTF8 字节。US-ASCII 范围之外的字符使用两个字节进行编码。如果字节被解码为 Latin1，则第一个字节变为第一个字节。这些数据是如何加载的？这就是需要修复错误的地方Ã

0赞 Fabio dos Santos 9/8/2023

我正在使用 pd.read_csv（'recife-dados-receitas-2021.csv'， sep=';'，encoding = 'utf8'）

0赞 Fabio dos Santos 9/8/2023

我意识到源文件已经以这种方式出现

0赞 Panagiotis Kanavos 9/8/2023

这意味着无论生成什么文本，都会两次使用错误的代码页 - 一次将 UTF8 转换为 Latin1 字符，生成并损坏的文本存储为 UTF8，导致每个字符有 2 个字节。ÃÂ

答： 暂无答案

上一个：PHP / 输出文件编码 / 从 ANSI 更改为 UTF-8-BOM

下一个：适用于 Linux 的 Windows 子系统（WSL） - 编译 Essentia

在 Pandas Dataframe 中编码列

Enconding columns in Pandas Dataframe

评论