使用自关闭标记从 xml 导出文本-解网

问：

我有一组 XML TEI 文件，其中包含文档的转录。我想解析这些XML文件并仅提取文本信息。

我的 XML 如下所示：

<?xml version='1.0' encoding='UTF8'?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
  <text>
    <body>
      <ab>
        <pb n="page1"/>
          <cb n="1"/>
            <lb xml:id="DD1" n="1"/>my sentence 1
            <lb xml:id="DD2" n="2"/>my sentence 2
            <lb xml:id="DD3" n="3"/>my sentence 3
          <cb n="2"/>
            <lb xml:id="DD1" n="1"/>my sentence 4
            <lb xml:id="DD2" n="2"/>my sentence 5
        <pb n="page2"/>
          <cb n="1"/>
            <lb xml:id="DD1" n="1"/>my sentence 1
            <lb xml:id="DD2" n="2"/>my sentence 2
          <cb n="2"/>
            <lb xml:id="DD1" n="1"/>my sentence 3
            <lb xml:id="DD1" n="2"/>my sentence 4
      </ab>
    </body>
  </text>
</TEI>

我尝试使用LXML访问信息，方法是：

with open(file,'r') as my_file:
    
    root = ET.parse(my_file, parser = ET.XMLParser(encoding = 'utf-8'))
    list_pages = root.findall('.//{http://www.tei-c.org/ns/1.0}pb')
    for page in list_pages:
        liste_text = page.findall('.//{http://www.tei-c.org/ns/1.0}lb')
    
    final_text = []
    
    for content in liste_text:
        final_text.append(content.text)

我想在最后有这样的东西：

page1
my sentence 1
my sentence 2
my sentence 3
my sentence 4
my sentence 5
page2
my sentence 1
my sentence 2
my sentence 3
my sentence 4

如果我成功访问 lb 对象，则不会链接到任何文本信息。你能帮我提取这些信息吗？谢谢

python-3.x xml 解析 lxml tei

from lxml import etree
root = etree.parse(my_file)
for p in root.xpath('//*[name()="pb"]'):
    print(p.xpath('./@n')[0].strip())
    for lb in p.xpath('.//following-sibling::*[not(name()="cb")]'):
        if lb.xpath('name()') == "pb":
            break
        else:
            print(lb.tail.strip())

输出应为预期输出。

上一个：XML 文件管理器按标记值划分到子文件夹中

下一个：我在使用 Python 从 XML 文件中以正确的顺序提取正确的数据时遇到问题

使用自关闭标记从 xml 导出文本

Export text from xml with self-closing tag

评论