提问人:SlowBear 提问时间:7/26/2023 最后编辑:SlowBear 更新时间:8/10/2023 访问量:65
使用 ElemTree Python 从 xml 标签和标签的可选子项中获取文本元素
Get text elements from xml tag and tag's optional children with ElemTree Python
问:
我有一个xml文档(保存在我的驱动器上):
xml="""
<?xml version="1.0">
<front>
<z id="37">some text sitting here</z>
<label></label>
<z id="38">Another sentence to read.</z>
<z id="39">The contents of a document.</z>
<sec id="sec-introduction" sec-type="intro">
<z id="40">This document is called <std><std-id std-id-link-type="urn" std-id-type="undated">101...</std-id><std-ref>help me!</std-ref></std>, <italic>Stcks guide to xml text</italic>. </z></front>
"""
我想提取所有文本元素以存储在类似于以下内容的数据帧中:
同上 | 发短信 |
---|---|
37 | 一些文字坐在这里 |
38 | 再读一句话 |
... | ... |
40 | 这份文件叫做101...帮帮我!XML 文本的堆栈指南 |
我用它来生成一个数据帧,但它错过了位于额外标签中的文本
file = ('[my_file_location.xml')
tree = ET.parse(file)
root = tree.getroot()
xmltext = []
for z in root.iter('z'):
txt = z.text
xmltext.append(txt)
我显然可以得到“一些文本坐在这里”和“另一个句子要阅读”元素,但我无法从 p 标签中的元素中获取任何文本,即 等std-a
std-b
答:
2赞
Daniel Haley
7/26/2023
#1
在这种情况下,最简单的方法是用于获取文本。itertext()
import xml.etree.ElementTree as ET
from pprint import pprint
xml = """<front>
<z id="37">some text sitting here</z>
<label>foo</label>
<z id="38">Another sentence to read.</z>
<z id="39">The contents of a document.</z>
<sec id="sec-introduction" sec-type="intro">
<z id="40">This document is called <std><std-id std-id-link-type="urn" std-id-type="undated">101...</std-id><std-ref>help me!</std-ref></std>, <italic>Stcks guide to xml text</italic>. </z>
</sec>
</front>
"""
root = ET.fromstring(xml)
xmltext = []
for z in root.iter('z'):
txt = "".join(z.itertext())
xmltext.append(txt)
pprint(xmltext)
打印输出...
['some text sitting here',
'Another sentence to read.',
'The contents of a document.',
'This document is called 101...help me!, Stcks guide to xml text. ']
1赞
Hermann12
7/27/2023
#2
你可以像 @Daniel Haley 中提到的 iterparse() 和 itertext() 一起使用,如果需要,还可以加上 pandas()。
import pandas as pd
import xml.etree.ElementTree as ET
from io import StringIO
xml = """<front>
<z id="37">some text sitting here</z>
<label>foo</label>
<z id="38">Another sentence to read.</z>
<z id="39">The contents of a document.</z>
<sec id="sec-introduction" sec-type="intro">
<z id="40">This document is called <std><std-id std-id-link-type="urn" std-id-type="undated">101...</std-id><std-ref>help me!</std-ref></std>, <italic>Stcks guide to xml text</italic>. </z>
</sec>
</front>
"""
f = StringIO(xml)
data = []
columns = ['id', 'TEXT']
for event, elem in ET.iterparse(f, events=('start', 'end', 'comment', 'pi')):
#print(event, elem.tag, elem.attrib, elem.text, elem.tail)
if elem.get('id') is not None and event == 'end':
if elem.get('id').isnumeric() and elem.text:
# I gave @Daniel a vote
txt = "".join(elem.itertext())
print(elem.get('id'), txt)
row = elem.get('id'), txt
data.append(row)
print()
df = pd.DataFrame(data, columns=columns)
print(df.to_string(index=False))
输出:
37 some text sitting here
38 Another sentence to read.
39 The contents of a document.
40 This document is called 101...help me!, Stcks guide to xml text.
id TEXT
37 some text sitting here
38 Another sentence to read.
39 The contents of a document.
40 This document is called 101...help me!, Stcks guide to xml text.
上一个:XSD 验证失败
评论