提问人:PracticingPython 提问时间:5/25/2021 最后编辑:PracticingPython 更新时间:5/25/2021 访问量:109
将 xml 元素的生成器转换为列表会产生未关闭的标记错误
Converting generator of xml elements to list produces unclosed token error
问:
下面是我用来从一些 xml 中提取的代码,最终以数据帧的形式发送到 csv。请注意,这是在 Pyspark 中完成的。在获取初始 url 并使用“last”元素生成完整的 url 列表后,我正在使用 Concurrent 来加快提取过程,因为大多数搜索都会生成相当多的结果。这将生成一个生成器,然后将其转换为列表,最终保存为 csv。
import urllib.request
import xml.etree.ElementTree
from xml.etree.ElementTree import ParseError
from xml.etree import ElementTree
from pyspark.sql.types import *
from pyspark.sql import Row
from datetime import datetime
from urllib.error import HTTPError, URLError
import threading
from concurrent import futures
import os
startTime = datetime.now()
print(startTime)
url = "url string"
response = urllib.request.urlopen(url)
bytes_ = response.read()
root = xml.etree.ElementTree.fromstring(bytes_)
namespaces = {
"namespace" : "./namespace/path", ...
}
next_element=root.findall("./root:link", namespaces=namespaces)
for line in next_element:
if line.attrib["rel"]=="last":
next_url_list = ["{}{}{}".format(line.attrib["href"].split("start=")[0],"start=",i*10) for i in range(0,1000)]
schema = StructType([StructField('string', StringType()), ...])
df = sqlContext.createDataFrame([],schema)
def task(next_url):
award = []
xpaths = [
"./xpath/paths", ...
]
_fields = [
{'text' : ''},
{'text' : ''},
{'text' : ''},
{'text' : '','name' : ''},
{'text' : ''},
{'text' : ''},
{'text' : ''},
{'text' : ''},
{'text' : ''},
{'text' : ''},
{'text' : ''},
{'text' : ''},
{'text' : ''},
{'text' : ''},
{'text' : ''},
{'text' : ''},
{'text' : '', 'name' : '', 'ID' : '', 'name' : ''},
{'text' : '', 'name' : '', 'region' : '', 'location' : ''},
{'text' : ''}
]
schema = StructType([StructField('string', StringType()), ...])
df = sqlContext.createDataFrame([],schema)
response = urllib.request.urlopen(next_url)
bytes_ = response.read()
root = xml.etree.ElementTree.fromstring(bytes_)
for count in range(0,len(root.findall("./root/path", namespaces=namespaces))):
for ele, xpath in enumerate(xpaths):
try:
attribs = list(root.findall(xpath,namespaces=namespaces)[count].attrib.keys())
for attrib in attribs:
for i in _fields[ele].keys():
if attrib == i:
_fields[ele][i] = root.findall(xpath, namespaces=namespaces)[count].attrib[attrib]
_fields[ele]["text"] =root.findall(xpath, namespaces=namespaces)[count].text
except IndexError:
pass
award.append(_fields[ele].values())
award_list = [item for sublist in award for item in sublist]
award.clear()
myrdd = sc.parallelize([award_list])
newRow = spark.createDataFrame(myrdd, schema)
df=df.unionAll(newRow)
return df
ex = futures.ThreadPoolExecutor(max_workers = 300)
results = ex.map(task, next_url_list)
real_results = list(results)
for i in real_results:
df=df.unionAll(i)
print(datetime.now()-startTime)
根据控制台,我的问题发生在:
real_results = list(results)
Traceback (most recent call last):
File "testing.py", line 152, in <module>
real_results = list(results)
File "C:\Users\123\Desktop\Tools\Enthought\.edm\envs\py36\lib\concurrent\futures\_base.py", line 586, in result_iterator
yield fs.pop().result()
File "C:\Users\123\Desktop\Tools\Enthought\.edm\envs\py36\lib\concurrent\futures\_base.py", line 432, in result
return self.__get_result()
File "C:\Users\123\Desktop\Tools\Enthought\.edm\envs\py36\lib\concurrent\futures\_base.py", line 384, in __get_result
raise self._exception
File "C:\Users\123\Desktop\Tools\Enthought\.edm\envs\py36\lib\concurrent\futures\thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "testing.py", line 124, in task
root = xml.etree.ElementTree.fromstring(bytes_)
File "C:\Users\123\Desktop\Tools\Enthought\.edm\envs\py36\lib\xml\etree\ElementTree.py", line 1315, in XML
return parser.close()
xml.etree.ElementTree.ParseError: unclosed token: line 2497, column 10
我知道该网站基本上无法加载页面的完整 xml,但我不明白为什么它发生在这一步而不是在任务功能中,或者更重要的是,我如何在不必完全重新开始的情况下恢复。
任何想法都会非常有帮助。更新了完整的回溯
答: 暂无答案
上一个:使用 pandas 生成趋势数据
评论
task