从具有重复元标头的网站构造的 BeautifulSoup 对象的矛盾编码。如何确保编码不混淆？-解网

问：

我已经使用模块从网站获取了数据。我从元标头中知道本文档的源编码是“iso-8859-1”。我也知道在创建对象时自动转码为“UTF-8”。BeautifulSoupBeutifulSoupBeautifulSoup

import requests
from bs4 import BeautifulSoup

url = "https://www.assemblee-nationale.fr/12/cri/2003-2004/20040001.asp"
r=requests.get(url)
soup_data=BeautifulSoup(r.content, 'lxml')

print(soup_data.prettify())

不幸的是，该网站有一个重复的元素。

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

在使用 prettify 检查对象时，我意识到只转换了其中一个元标记。BeautifulSoupBeautifulSoup

<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="text/html; charset=iso-8859-1" http-eqiv="Content-Type"/>

因此，我对我的对象的实际编码是什么感到困惑。BeautifulSoup

此外，在数据处理过程中，我意识到我的 PyCharm 控制台没有正确解析此对象的某些文本元素。这些字符串是“iso-8859-1”代码字符。因此，我怀疑该对象要么仍在 ISO 编码中，要么更糟，不知何故混淆了。

['\xa0\xa0\xa0\xa0M. le président.' '\xa0\xa0\xa0\xa0M. le président.'

我在运行 numpy 函数后第一次看到这些 ISO 字符。

series = np.apply_along_axis(lambda x: x[0].get_text(), 0, [df])

关于如何摆脱这种情况的任何建议？我想将对象转换为 UTF-8（并 100% 确定它完全是 UTF-8）。

python html beautifulsoup 编码 UTF-8

import requests
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector

url = "https://www.assemblee-nationale.fr/12/cri/2003-2004/20040001.asp"
r = requests.get(url)

encoding = EncodingDetector.find_declared_encoding(r.content, is_html=True)
soup_data = BeautifulSoup(r.content, "lxml", from_encoding=encoding)

print(soup_data.prettify())

import requests
from bs4 import BeautifulSoup

url = "https://www.assemblee-nationale.fr/12/cri/2003-2004/20040001.asp"
r=requests.get(url)
print(f'{r.encoding=}')
print(f'{r.apparent_encoding=}')
print()
soup_data=BeautifulSoup(r.content, 'lxml')
print(repr(soup_data.find('a',href="http://www2.assemblee-nationale.fr/scrutins/liste/(legislature)/15/(type)/AUT").text))
print(repr(soup_data.find('a',href="#",accesskey="0").text))
print()
#Using the correct encoding
soup_data=BeautifulSoup(r.content, 'lxml', from_encoding='Windows-1252')
print(repr(soup_data.find('a',href="http://www2.assemblee-nationale.fr/scrutins/liste/(legislature)/15/(type)/AUT").text))
print(repr(soup_data.find('a',href="#",accesskey="0").text))

输出。请注意“谴责...”中的代码点首先是“d'accessibilité”。ISO-8859-1 中不存在（U+2026）和（U+2019）代码点，字节 0x85 和 0x92 分别转换为 U+0085 和 U+0092，它们是不可打印的控制代码。我曾经将它们显示为转义码。\x85\x92…’repr()

r.encoding='ISO-8859-1'
r.apparent_encoding='Windows-1252'

'Autres scrutins solennels (déclarations, motions de censure\x85)'
'Politique d\x92accessibilité'

'Autres scrutins solennels (déclarations, motions de censure…)'
'Politique d’accessibilité'

上一个：为什么 Safari 将我的 TM 符号呈现为“¢”？

下一个：Python 正在将我的 Base64-String 解码为虚假的 String-Representation

从具有重复元标头的网站构造的 BeautifulSoup 对象的矛盾编码。如何确保编码不混淆？

Ambivalent encoding of BeautifulSoup object constructed from website with duplicate meta header. How do I make sure the encoding is not mixed up?

评论

评论