使用 ElemTree Python 从 xml 标签和标签的可选子项中获取文本元素

Get text elements from xml tag and tag's optional children with ElemTree Python

提问人:SlowBear 提问时间:7/26/2023 最后编辑:SlowBear 更新时间:8/10/2023 访问量:65

问:

我有一个xml文档(保存在我的驱动器上):

xml="""
<?xml version="1.0">
<front>
<z id="37">some text sitting here</z>
<label>&#26;</label>
<z id="38">Another sentence to read.</z>
<z id="39">The contents of a document.</z> 
<sec id="sec-introduction" sec-type="intro">
<z id="40">This document is called <std><std-id std-id-link-type="urn" std-id-type="undated">101...</std-id><std-ref>help me!</std-ref></std>, <italic>Stcks guide to xml text</italic>. </z></front>
"""

我想提取所有文本元素以存储在类似于以下内容的数据帧中:

同上 发短信
37 一些文字坐在这里
38 再读一句话
... ...
40 这份文件叫做101...帮帮我!XML 文本的堆栈指南

我用它来生成一个数据帧,但它错过了位于额外标签中的文本

file = ('[my_file_location.xml')
tree = ET.parse(file)
root = tree.getroot()

xmltext = []

for z in root.iter('z'):
    txt = z.text
    xmltext.append(txt)

我显然可以得到“一些文本坐在这里”和“另一个句子要阅读”元素,但我无法从 p 标签中的元素中获取任何文本,即 等std-astd-b

python xml xml 解析 elementtree

评论

0赞 SlowBear 7/26/2023
已更新帖子以包含我的代码。

答:

2赞 Daniel Haley 7/26/2023 #1

在这种情况下,最简单的方法是用于获取文本。itertext()

import xml.etree.ElementTree as ET
from pprint import pprint

xml = """<front>
<z id="37">some text sitting here</z>
<label>foo</label>
<z id="38">Another sentence to read.</z>
<z id="39">The contents of a document.</z> 
<sec id="sec-introduction" sec-type="intro">
<z id="40">This document is called <std><std-id std-id-link-type="urn" std-id-type="undated">101...</std-id><std-ref>help me!</std-ref></std>, <italic>Stcks guide to xml text</italic>. </z>
</sec>
</front>
"""
root = ET.fromstring(xml)

xmltext = []

for z in root.iter('z'):
    txt = "".join(z.itertext())
    xmltext.append(txt)

pprint(xmltext)

打印输出...

['some text sitting here',
 'Another sentence to read.',
 'The contents of a document.',
 'This document is called 101...help me!, Stcks guide to xml text. ']
1赞 Hermann12 7/27/2023 #2

你可以像 @Daniel Haley 中提到的 iterparse() 和 itertext() 一起使用,如果需要,还可以加上 pandas()。


import pandas as pd
import xml.etree.ElementTree as ET
from io import StringIO

xml = """<front>
<z id="37">some text sitting here</z>
<label>foo</label>
<z id="38">Another sentence to read.</z>
<z id="39">The contents of a document.</z> 
<sec id="sec-introduction" sec-type="intro">
<z id="40">This document is called <std><std-id std-id-link-type="urn" std-id-type="undated">101...</std-id><std-ref>help me!</std-ref></std>, <italic>Stcks guide to xml text</italic>. </z>
</sec>
</front>
"""
f = StringIO(xml)

data = []
columns = ['id', 'TEXT']

for event, elem in ET.iterparse(f, events=('start', 'end', 'comment', 'pi')):
    #print(event, elem.tag, elem.attrib, elem.text, elem.tail)
    if elem.get('id') is not None and event == 'end':
        if elem.get('id').isnumeric() and elem.text:
            # I gave @Daniel a vote
            txt = "".join(elem.itertext())
            print(elem.get('id'), txt)
            row = elem.get('id'), txt
            data.append(row)
            

print()           
df = pd.DataFrame(data, columns=columns)
print(df.to_string(index=False))

输出:

37 some text sitting here
38 Another sentence to read.
39 The contents of a document.
40 This document is called 101...help me!, Stcks guide to xml text. 

id                                                              TEXT
37                                            some text sitting here
38                                         Another sentence to read.
39                                       The contents of a document.
40 This document is called 101...help me!, Stcks guide to xml text.