将 xml 元素的生成器转换为列表会产生未关闭的标记错误

Converting generator of xml elements to list produces unclosed token error

提问人:PracticingPython 提问时间:5/25/2021 最后编辑:PracticingPython 更新时间:5/25/2021 访问量:109

问:

下面是我用来从一些 xml 中提取的代码,最终以数据帧的形式发送到 csv。请注意,这是在 Pyspark 中完成的。在获取初始 url 并使用“last”元素生成完整的 url 列表后,我正在使用 Concurrent 来加快提取过程,因为大多数搜索都会生成相当多的结果。这将生成一个生成器,然后将其转换为列表,最终保存为 csv。


import urllib.request
import xml.etree.ElementTree
from xml.etree.ElementTree import ParseError
from xml.etree import ElementTree
from pyspark.sql.types import *
from pyspark.sql import Row
from datetime import datetime
from urllib.error import HTTPError, URLError
import threading
from concurrent import futures
import os

startTime = datetime.now()
print(startTime)

url =  "url string"
    

response = urllib.request.urlopen(url)
bytes_ = response.read()
root = xml.etree.ElementTree.fromstring(bytes_)

namespaces = {
        "namespace" : "./namespace/path", ...
}
next_element=root.findall("./root:link", namespaces=namespaces)    
for line in next_element:
    if line.attrib["rel"]=="last":
        next_url_list = ["{}{}{}".format(line.attrib["href"].split("start=")[0],"start=",i*10) for i in range(0,1000)]

schema = StructType([StructField('string', StringType()), ...])


df = sqlContext.createDataFrame([],schema)

def task(next_url):
    award = []
    
    xpaths = [
    "./xpath/paths", ...
    ]
    
    _fields = [
        {'text' : ''},
        {'text' : ''},
        {'text' : ''},
        {'text' : '','name' : ''},
        {'text' : ''},
        {'text' : ''},
        {'text' : ''},
        {'text' : ''},
        {'text' : ''},
        {'text' : ''},
        {'text' : ''},
        {'text' : ''},
        {'text' : ''},
        {'text' : ''},
        {'text' : ''},
        {'text' : ''},
        {'text' : '', 'name' : '', 'ID' : '', 'name' : ''},
        {'text' : '', 'name' : '', 'region' : '', 'location' : ''},
        {'text' : ''}
        ]
    schema = StructType([StructField('string', StringType()), ...])

    df = sqlContext.createDataFrame([],schema)
    response = urllib.request.urlopen(next_url)
    bytes_ = response.read()
    root = xml.etree.ElementTree.fromstring(bytes_)

    for count in range(0,len(root.findall("./root/path", namespaces=namespaces))):
    
        for ele, xpath in enumerate(xpaths):
            try:
                attribs = list(root.findall(xpath,namespaces=namespaces)[count].attrib.keys())
            
                for attrib in attribs:
                    for i in _fields[ele].keys():
            
                        if attrib == i:
                            _fields[ele][i] = root.findall(xpath, namespaces=namespaces)[count].attrib[attrib]
                            
                _fields[ele]["text"] =root.findall(xpath, namespaces=namespaces)[count].text
               
            except IndexError:
                pass
            award.append(_fields[ele].values())
            award_list = [item for sublist in award for item in sublist]
        award.clear()
        myrdd = sc.parallelize([award_list])
        newRow = spark.createDataFrame(myrdd, schema)
        df=df.unionAll(newRow)
    return df   
ex = futures.ThreadPoolExecutor(max_workers = 300)
results = ex.map(task, next_url_list)
real_results = list(results)
for i in real_results:
    df=df.unionAll(i)



print(datetime.now()-startTime)

根据控制台,我的问题发生在:

real_results = list(results)

Traceback (most recent call last):
  File "testing.py", line 152, in <module>
    real_results = list(results)
  File "C:\Users\123\Desktop\Tools\Enthought\.edm\envs\py36\lib\concurrent\futures\_base.py", line 586, in result_iterator
    yield fs.pop().result()
  File "C:\Users\123\Desktop\Tools\Enthought\.edm\envs\py36\lib\concurrent\futures\_base.py", line 432, in result
    return self.__get_result()
  File "C:\Users\123\Desktop\Tools\Enthought\.edm\envs\py36\lib\concurrent\futures\_base.py", line 384, in __get_result
    raise self._exception
  File "C:\Users\123\Desktop\Tools\Enthought\.edm\envs\py36\lib\concurrent\futures\thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "testing.py", line 124, in task
    root = xml.etree.ElementTree.fromstring(bytes_)
  File "C:\Users\123\Desktop\Tools\Enthought\.edm\envs\py36\lib\xml\etree\ElementTree.py", line 1315, in XML
    return parser.close()
xml.etree.ElementTree.ParseError: unclosed token: line 2497, column 10

我知道该网站基本上无法加载页面的完整 xml,但我不明白为什么它发生在这一步而不是在任务功能中,或者更重要的是,我如何在不必完全重新开始的情况下恢复。

任何想法都会非常有帮助。更新了完整的回溯

python xml

评论

1赞 Barmar 5/25/2021
请出示完整的回溯。错误必须发生在函数中的某个位置。task
1赞 mzjn 5/25/2021
老实说,我不知道你在做什么。请将其缩小到一个最小的可重复示例
0赞 PracticingPython 5/25/2021
@Barmar,我已经编辑了我的 OP 并进行了完整的追溯。
0赞 mzjn 5/25/2021
下面是我用来从一些xml中提取的代码...那是什么 XML?我知道该网站基本上是失败的......那是什么网站?添加的回溯无济于事。

答: 暂无答案