在 XML parser.feed(text) xml.etree.ElementTree.ParseError 中:格式不正确(标记无效):第 1 行,第 0 列

in XML parser.feed(text) xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0

提问人:Yasser Mohamed 提问时间:8/9/2023 最后编辑:Yasser Mohamed 更新时间:8/11/2023 访问量:181

问:

这段代码我在 Ping AI 中测试过并有效,但在我的 Vstudio 中不起作用

import urllib.request
import urllib.parse
import urllib.error
import xml.etree.ElementTree as ET
import ssl
api_key = False
if api_key is False:
    api_key = 42
    service_url = 'http://py4e-data.dr-chuck.net/xml?'
else:
    service_url = 'https://maps.googleapis.com/maps/api/geocode/xml?'
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

address = 'http://py4e-data.dr-chuck.net/comments_42.xml'
parm = dict()
parm['address'] = address
if api_key is not False:
    parm['key'] = api_key
url = service_url+urllib.parse.urlencode(parm)
print('retrieve ', url)

uh = urllib.request.urlopen(url, context=ctx).read()

print('retrieved ', len(uh), 'characters')
datas = uh.decode()
tree = ET.fromstring(datas)
suum = 0
count = 0
counts = tree.findall('.//count')
for i in counts:
    suum +=int(i.text)
    count +=1
print(suum)
print(count)

我的输出应该是。

Retrieving http://py4e-data.dr-chuck.net/comments_42.xml
Retrieved 4189 characters
Count: 50
Sum: 2...

但我的输出是。

retrieve http://py4e-data.dr-chuck.net/xml?address=http%3A%2F%2Fpy4e-data.dr-chuck.net%2Fcomments_42.xml&key=42
retrieved  36 characters
Traceback (most recent call last):
  File "e:\ياسر\python_file\project_etree.py", line 28, in <module>
    tree = ET.fromstring(datas)
           ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\beaut\AppData\Local\Programs\Python\Python311\Lib\xml\etree\ElementTree.py", line 1338, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0

我不知道为什么我会得到. 并且应该使用 XML 解析,因为它是一个测验,并尝试了解为什么对我不起作用问题在哪里。retrieved 36 characters

python xml 解析 网页抓取 beautifulsoup

评论

1赞 mzjn 8/9/2023
这看起来与 stackoverflow.com/q/76843884/407651 非常相似
0赞 Yasser Mohamed 8/9/2023
它是一样的,仍然没有找到我应该做什么@mzjn - 我试图修复 URL(就像你说的那样)但没有用
0赞 Parfait 8/9/2023
请不要以几乎完全相同的编码尝试发布几乎相同的问题。这篇文章应该作为上一篇文章的副本关闭。
0赞 jdweng 8/9/2023
格式正确的 XML 文件在根目录下只有一个元素。Error 表示根目录下有一个数组。大多数 XML 库不接受格式不正确的文件。我可以解析,但需要查看 XML。
0赞 joanis 8/9/2023
这回答了你的问题吗?ParseError:格式不正确的 XML(来自使用 urllib.request.urlopen 打开的 URL)

答:

0赞 Andrej Kesely 8/10/2023 #1

您可以尝试使用,它在下面使用:lxml

import requests
from bs4 import BeautifulSoup

url = "http://py4e-data.dr-chuck.net/comments_42.xml"
soup = BeautifulSoup(requests.get(url).content, "xml")

comments = soup.select("comment")

count, s = 0, 0
for c in soup.select("comment"):
    count += 1
    s += sum(int(c.text) for c in c.select("count"))

print(count)
print(s)

指纹:

50
2553

评论

0赞 Yasser Mohamed 8/10/2023
我应该使用XML解析,因为它是一个测验
0赞 Hermann12 8/10/2023 #2

Pandas 将显示 dataframe:

import requests
import pandas as pd
import numpy as np

url=r"https://py4e-data.dr-chuck.net/comments_42.xml"
request = requests.get(url)
print(f"Retrieving: {url}")
print(f"Retrieved {len(request.text)} characters")

df = pd.read_xml(request.text, xpath='//comment')
#print(df)
print("Count:", df.shape[0])

sums = df.select_dtypes(np.number).sum().rename('total')
print("Sum:", sums)

输出:

Retrieving: https://py4e-data.dr-chuck.net/comments_42.xml
Retrieved 4189 characters
Count: 50
Sum: count    2553
Name: total, dtype: int64

对于 xml.etree.ElementTree,请使用 Session:

import requests
import xml.etree.ElementTree as ET

s = requests.Session()

url=r"https://py4e-data.dr-chuck.net/comments_42.xml"

r = s.get(url)
print(r.status_code)
print(type(r.text))

tree = ET.fromstring(r.text)
print(tree)
for elem in tree.iter():
    # do your things
    print(elem.tag)

或者,如果您需要 urllib,它也可以工作:

import urllib.request
import xml.etree.ElementTree as ET

url="https://py4e-data.dr-chuck.net/comments_42.xml"

with urllib.request.urlopen(url) as f:
    xml = f.read()
    # xml is a byte string
    # print(xml)

root = ET.fromstring(xml)

for elem in root.iter():
    # do what you like with the xml content
    print(elem.text)
-1赞 jdweng 8/10/2023 #3

可以使用 Powershell

using assembly System.Xml.Linq

$uri = 'https://py4e-data.dr-chuck.net/comments_42.xml'

$doc = [System.Xml.Linq.XDocument]::Load($uri)

$comments = $doc.Descendants('comment')

$groups = [System.Linq.Enumerable]::GroupBy($comments,  [Func[object,object]]{ param($x) $x[0].Element('name').Value}, [Func[object,object]]{ param($y) $y[0].Element('count').Value})

$dict = [System.Linq.Enumerable]::ToDictionary($groups, [Func[object,object]]{ param($x) $x.Key}, [Func[object,object]]{ param($y) $y})

$dict

结果

Key         Value
---         -----
Romina      97
Laurie      97
Bayli       90
Siyona      90
Taisha      88
Alanda      87
Ameelia     87
Prasheeta   80
Asif        79
Risa        79
Zi          78
Danyil      76
Ediomi      76
Barry       72
Lance       72
Hattie      66
Mathu       66
Bowie       65
Samara      65
Uchenna     64
Shauni      61
Georgia     61
Rivan       59
Kenan       58
Hassan      57
Isma        57
Samanthalee 54
Alexa       51
Caine       49
Grady       47
Anne        40
Rihan       38
Alexei      37
Indie       36
Rhuairidh   36
Annoushka   32
Kenzi       25
Shahd       24
Irvine      22
Carys       21
Skye        19
Atiya       18
Rohan       18
Nuala       14
Maram       12
Carlo       12
Japleen     9
Breeanna    7
Zaaine      3
Inika       2