使用 ElemTree Python 从 xml 标签和标签的可选子项中获取文本元素-解网

问：

我有一个xml文档（保存在我的驱动器上）：

xml="""
<?xml version="1.0">
<front>
<z id="37">some text sitting here</z>
<label>&#26;</label>
<z id="38">Another sentence to read.</z>
<z id="39">The contents of a document.</z> 
<sec id="sec-introduction" sec-type="intro">
<z id="40">This document is called <std><std-id std-id-link-type="urn" std-id-type="undated">101...</std-id><std-ref>help me!</std-ref></std>, <italic>Stcks guide to xml text</italic>. </z></front>
"""

我想提取所有文本元素以存储在类似于以下内容的数据帧中：

同上	发短信
37	一些文字坐在这里
38	再读一句话
...	...
40	这份文件叫做101...帮帮我！XML 文本的堆栈指南

我用它来生成一个数据帧，但它错过了位于额外标签中的文本

file = ('[my_file_location.xml')
tree = ET.parse(file)
root = tree.getroot()

xmltext = []

for z in root.iter('z'):
    txt = z.text
    xmltext.append(txt)

我显然可以得到“一些文本坐在这里”和“另一个句子要阅读”元素，但我无法从 p 标签中的元素中获取任何文本，即等std-astd-b

python xml xml 解析 elementtree

import xml.etree.ElementTree as ET
from pprint import pprint

xml = """<front>
<z id="37">some text sitting here</z>
<label>foo</label>
<z id="38">Another sentence to read.</z>
<z id="39">The contents of a document.</z> 
<sec id="sec-introduction" sec-type="intro">
<z id="40">This document is called <std><std-id std-id-link-type="urn" std-id-type="undated">101...</std-id><std-ref>help me!</std-ref></std>, <italic>Stcks guide to xml text</italic>. </z>
</sec>
</front>
"""
root = ET.fromstring(xml)

xmltext = []

for z in root.iter('z'):
    txt = "".join(z.itertext())
    xmltext.append(txt)

pprint(xmltext)

打印输出...

['some text sitting here',
 'Another sentence to read.',
 'The contents of a document.',
 'This document is called 101...help me!, Stcks guide to xml text. ']

1赞 Hermann12 7/27/2023 #2

你可以像 @Daniel Haley 中提到的 iterparse（）和 itertext（）一起使用，如果需要，还可以加上 pandas（）。


import pandas as pd
import xml.etree.ElementTree as ET
from io import StringIO

xml = """<front>
<z id="37">some text sitting here</z>
<label>foo</label>
<z id="38">Another sentence to read.</z>
<z id="39">The contents of a document.</z> 
<sec id="sec-introduction" sec-type="intro">
<z id="40">This document is called <std><std-id std-id-link-type="urn" std-id-type="undated">101...</std-id><std-ref>help me!</std-ref></std>, <italic>Stcks guide to xml text</italic>. </z>
</sec>
</front>
"""
f = StringIO(xml)

data = []
columns = ['id', 'TEXT']

for event, elem in ET.iterparse(f, events=('start', 'end', 'comment', 'pi')):
    #print(event, elem.tag, elem.attrib, elem.text, elem.tail)
    if elem.get('id') is not None and event == 'end':
        if elem.get('id').isnumeric() and elem.text:
            # I gave @Daniel a vote
            txt = "".join(elem.itertext())
            print(elem.get('id'), txt)
            row = elem.get('id'), txt
            data.append(row)
            

print()           
df = pd.DataFrame(data, columns=columns)
print(df.to_string(index=False))

输出：

37 some text sitting here
38 Another sentence to read.
39 The contents of a document.
40 This document is called 101...help me!, Stcks guide to xml text. 

id                                                              TEXT
37                                            some text sitting here
38                                         Another sentence to read.
39                                       The contents of a document.
40 This document is called 101...help me!, Stcks guide to xml text.

上一个：XSD 验证失败

下一个：如何在python中从XML文件中提取某个部分

使用 ElemTree Python 从 xml 标签和标签的可选子项中获取文本元素

Get text elements from xml tag and tag's optional children with ElemTree Python

评论