将 xml 元素的生成器转换为列表会产生未关闭的标记错误-解网

问：

下面是我用来从一些 xml 中提取的代码，最终以数据帧的形式发送到 csv。请注意，这是在 Pyspark 中完成的。在获取初始 url 并使用“last”元素生成完整的 url 列表后，我正在使用 Concurrent 来加快提取过程，因为大多数搜索都会生成相当多的结果。这将生成一个生成器，然后将其转换为列表，最终保存为 csv。


import urllib.request
import xml.etree.ElementTree
from xml.etree.ElementTree import ParseError
from xml.etree import ElementTree
from pyspark.sql.types import *
from pyspark.sql import Row
from datetime import datetime
from urllib.error import HTTPError, URLError
import threading
from concurrent import futures
import os

startTime = datetime.now()
print(startTime)

url =  "url string"
    

response = urllib.request.urlopen(url)
bytes_ = response.read()
root = xml.etree.ElementTree.fromstring(bytes_)

namespaces = {
        "namespace" : "./namespace/path", ...
}
next_element=root.findall("./root:link", namespaces=namespaces)    
for line in next_element:
    if line.attrib["rel"]=="last":
        next_url_list = ["{}{}{}".format(line.attrib["href"].split("start=")[0],"start=",i*10) for i in range(0,1000)]

schema = StructType([StructField('string', StringType()), ...])


df = sqlContext.createDataFrame([],schema)

def task(next_url):
    award = []
    
    xpaths = [
    "./xpath/paths", ...
    ]
    
    _fields = [
        {'text' : ''},
        {'text' : ''},
        {'text' : ''},
        {'text' : '','name' : ''},
        {'text' : ''},
        {'text' : ''},
        {'text' : ''},
        {'text' : ''},
        {'text' : ''},
        {'text' : ''},
        {'text' : ''},
        {'text' : ''},
        {'text' : ''},
        {'text' : ''},
        {'text' : ''},
        {'text' : ''},
        {'text' : '', 'name' : '', 'ID' : '', 'name' : ''},
        {'text' : '', 'name' : '', 'region' : '', 'location' : ''},
        {'text' : ''}
        ]
    schema = StructType([StructField('string', StringType()), ...])

    df = sqlContext.createDataFrame([],schema)
    response = urllib.request.urlopen(next_url)
    bytes_ = response.read()
    root = xml.etree.ElementTree.fromstring(bytes_)

    for count in range(0,len(root.findall("./root/path", namespaces=namespaces))):
    
        for ele, xpath in enumerate(xpaths):
            try:
                attribs = list(root.findall(xpath,namespaces=namespaces)[count].attrib.keys())
            
                for attrib in attribs:
                    for i in _fields[ele].keys():
            
                        if attrib == i:
                            _fields[ele][i] = root.findall(xpath, namespaces=namespaces)[count].attrib[attrib]
                            
                _fields[ele]["text"] =root.findall(xpath, namespaces=namespaces)[count].text
               
            except IndexError:
                pass
            award.append(_fields[ele].values())
            award_list = [item for sublist in award for item in sublist]
        award.clear()
        myrdd = sc.parallelize([award_list])
        newRow = spark.createDataFrame(myrdd, schema)
        df=df.unionAll(newRow)
    return df   
ex = futures.ThreadPoolExecutor(max_workers = 300)
results = ex.map(task, next_url_list)
real_results = list(results)
for i in real_results:
    df=df.unionAll(i)



print(datetime.now()-startTime)

根据控制台，我的问题发生在：

real_results = list(results)

Traceback (most recent call last):
  File "testing.py", line 152, in <module>
    real_results = list(results)
  File "C:\Users\123\Desktop\Tools\Enthought\.edm\envs\py36\lib\concurrent\futures\_base.py", line 586, in result_iterator
    yield fs.pop().result()
  File "C:\Users\123\Desktop\Tools\Enthought\.edm\envs\py36\lib\concurrent\futures\_base.py", line 432, in result
    return self.__get_result()
  File "C:\Users\123\Desktop\Tools\Enthought\.edm\envs\py36\lib\concurrent\futures\_base.py", line 384, in __get_result
    raise self._exception
  File "C:\Users\123\Desktop\Tools\Enthought\.edm\envs\py36\lib\concurrent\futures\thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "testing.py", line 124, in task
    root = xml.etree.ElementTree.fromstring(bytes_)
  File "C:\Users\123\Desktop\Tools\Enthought\.edm\envs\py36\lib\xml\etree\ElementTree.py", line 1315, in XML
    return parser.close()
xml.etree.ElementTree.ParseError: unclosed token: line 2497, column 10

我知道该网站基本上无法加载页面的完整 xml，但我不明白为什么它发生在这一步而不是在任务功能中，或者更重要的是，我如何在不必完全重新开始的情况下恢复。

任何想法都会非常有帮助。更新了完整的回溯

python xml

将 xml 元素的生成器转换为列表会产生未关闭的标记错误

Converting generator of xml elements to list produces unclosed token error

评论