使用 zipfile 和 ElementTree 在 python 中解析 word 文档，但直到下一个标题才提取段落-解网

问：

我正在使用 zipfile 和 ElementTree 来提取标题（在 Heading2 样式中）及其相应的段落。这是我的代码：

import zipfile
import xml.etree.ElementTree as ET

doc = zipfile.ZipFile('./data/test.docx').read('word/document.xml')
root = ET.fromstring(doc)


ns = {'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
body = root.find('w:body', ns)  # find the XML "body" tag
p_sections = body.findall('w:p', ns)  # under the body tag, find all the paragraph sections

for p in p_sections:
    text_elems = p.findall('.//w:t', ns)
    print(''.join([t.text for t in text_elems]))
    print()

def is_heading2_section(p):
    """Returns True if the given paragraph section has been styled as a Heading2"""
    return_val = False
    heading_style_elem = p.find(".//w:pStyle[@w:val='Heading2']", ns)
    if heading_style_elem is not None:
        return_val = True
    return return_val
 
 
def get_section_text(p):
    """Returns the joined text of the text elements under the given paragraph tag"""
    return_val = ''
    text_elems = p.findall('.//w:t', ns)
    if text_elems is not None:
        return_val = ''.join([t.text for t in text_elems])
    return return_val
 

section_labels = [get_section_text(s) if is_heading2_section(s) else '' for s in p_sections]
section_text = [{'title': t, 'text': get_section_text(p_sections[i+1])} for i, t in enumerate(section_labels) if len(t) > 0]

但是，如果我的 Word 文档是这样的：

标题 2

第1段

第2段

另一个标题 2

第3段

最后一行代码中section_text的变量将仅提取标题2：第1段另一个标题 2：第 3 段缺少第2段。

有没有办法显示

标题 2：第 1 段和第 2 段

另一个标题 2：第 3 段在变量section_text最后一行代码中？谢谢。

Python XML 解析

使用 zipfile 和 ElementTree 在 python 中解析 word 文档，但直到下一个标题才提取段落

parse word documents in python with zipfile and ElementTree but fail to extract paragraphs until next header

评论