有条件地将 XML(Word 文档)中的节点替换为 python?

Conditionally replace node in XML (a Word doc) with python?

提问人:DOR 提问时间:9/26/2023 最后编辑:DOR 更新时间:9/27/2023 访问量:66

问:

如何替换以附近标签内容为条件的 xml?

我有一个很长的 Word 文档,其中包含许多开发人员内容字段,特别是下拉列表。我想更改某些(但不是全部)下拉列表中的选项,这些选项以下拉列表的附近标签为条件。具体来说,如果附近的标签以“_marker”结尾,我想替换列表选项。我通过跨节点迭代以查找满足条件的标签(即以“_marker”结尾),然后为下拉列表选项粘贴适当的 xml 代码并删除旧的 xml 方面取得了一些进展。但是,只有一个“_marker”下拉列表被正确更新(文档中的最后一个),而所有其他“_marker”下拉列表的下拉列表内容被正确删除。所以,这是一个部分解决方案。如何正确更新其余的“_marker”下拉列表?lxmlxpath

我的代码在底部。下面是从 Word 文档中提取的 xml 代码的示例。我在一些我想要更改和不希望更改的地方添加了评论哈希:

<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:cx="http://schemas.microsoft.com/office/drawing/2014/chartex" xmlns:cx1="http://schemas.microsoft.com/office/drawing/2015/9/8/chartex" xmlns:cx2="http://schemas.microsoft.com/office/drawing/2015/10/21/chartex" xmlns:cx3="http://schemas.microsoft.com/office/drawing/2016/5/9/chartex" xmlns:cx4="http://schemas.microsoft.com/office/drawing/2016/5/10/chartex" xmlns:cx5="http://schemas.microsoft.com/office/drawing/2016/5/11/chartex" xmlns:cx6="http://schemas.microsoft.com/office/drawing/2016/5/12/chartex" xmlns:cx7="http://schemas.microsoft.com/office/drawing/2016/5/13/chartex" xmlns:cx8="http://schemas.microsoft.com/office/drawing/2016/5/14/chartex" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:aink="http://schemas.microsoft.com/office/drawing/2016/ink" xmlns:am3d="http://schemas.microsoft.com/office/drawing/2017/model3d" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:oel="http://schemas.microsoft.com/office/2019/extlst" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w16cex="http://schemas.microsoft.com/office/word/2018/wordml/cex" xmlns:w16cid="http://schemas.microsoft.com/office/word/2016/wordml/cid" xmlns:w16="http://schemas.microsoft.com/office/word/2018/wordml" xmlns:w16sdtdh="http://schemas.microsoft.com/office/word/2020/wordml/sdtdatahash" xmlns:w16se="http://schemas.microsoft.com/office/word/2015/wordml/symex" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 w16se w16cid w16 w16cex w16sdtdh wp14">
  <w:body>
    <w:sdt>
      <w:sdtPr>
        <w:alias w:val="test_drop_down"/>
        <w:tag w:val="test_drop_down"/>
        <w:id w:val="-387181634"/>
        <w:placeholder>
          <w:docPart w:val="F84271CC265B4C44BFD0FEF4977C3363"/>
        </w:placeholder>
        <w:showingPlcHdr/>
        <w:dropDownList> ## Here is a dropdown tag without '_marker', so no changes necessary to the nearby dropdown
          <w:listItem w:value="Choose an item."/>
          <w:listItem w:displayText="1" w:value="1"/>
          <w:listItem w:displayText="2" w:value="2"/>
          <w:listItem w:displayText="3" w:value="3"/>
          <w:listItem w:displayText="4" w:value="4"/>
          <w:listItem w:displayText="5" w:value="5"/>
        </w:dropDownList>
      </w:sdtPr>
      <w:sdtEndPr/>
      <w:sdtContent>
        <w:p w14:paraId="3E667F97" w14:textId="40FCA028" w:rsidR="00DD3971" w:rsidRDefault="00DD3971">
          <w:r w:rsidRPr="00547EE5">
            <w:rPr>
              <w:rStyle w:val="PlaceholderText"/>
            </w:rPr>
            <w:t>Choose an item.</w:t>
          </w:r>
        </w:p>
      </w:sdtContent>
    </w:sdt>
    <w:sdt>
      <w:sdtPr>
        <w:alias w:val="test_drop_down_marker"/>
        <w:tag w:val="test_drop_down_marker"/>
        <w:id w:val="1273827251"/>
        <w:placeholder>
          <w:docPart w:val="7B1BB8A989B0431A9916F716166BD235"/>
        </w:placeholder>
        <w:showingPlcHdr/>
        <w:dropDownList> ## Here is a dropdown tag with "_marker", so I'd like to change the next list of dropdowns a few lines below
          <w:listItem w:value="Choose an item."/>
          <w:listItem w:displayText="1" w:value="1"/>
          <w:listItem w:displayText="2" w:value="2"/>
          <w:listItem w:displayText="3" w:value="3"/>
          <w:listItem w:displayText="4" w:value="4"/>
          <w:listItem w:displayText="5" w:value="5"/>
        </w:dropDownList>
      </w:sdtPr>
      <w:sdtEndPr/>
      <w:sdtContent>
        <w:p w14:paraId="491EDFBD" w14:textId="77777777" w:rsidR="000C66BB" w:rsidRDefault="000C66BB" w:rsidP="000C66BB">
          <w:r w:rsidRPr="00547EE5">
            <w:rPr>
              <w:rStyle w:val="PlaceholderText"/>
            </w:rPr>
            <w:t>Choose an item.</w:t>
          </w:r>
        </w:p>
      </w:sdtContent>
    </w:sdt>
    <w:sdt>
      <w:sdtPr>
        <w:alias w:val="test_drop_down_marker"/>
        <w:tag w:val="test_drop_down_marker"/>
        <w:id w:val="1947423421"/>
        <w:placeholder>
          <w:docPart w:val="CAC6472BBA50466E8618F7940D1F2320"/>
        </w:placeholder>
        <w:showingPlcHdr/>
        <w:dropDownList>
          <w:listItem w:value="Choose an item."/>
          <w:listItem w:displayText="1" w:value="1"/>
          <w:listItem w:displayText="2" w:value="2"/>
          <w:listItem w:displayText="3" w:value="3"/>
          <w:listItem w:displayText="4" w:value="4"/>
          <w:listItem w:displayText="5" w:value="5"/>
        </w:dropDownList>
      </w:sdtPr>
      <w:sdtEndPr/>
      <w:sdtContent>
        <w:p w14:paraId="60B5F611" w14:textId="79AEBD34" w:rsidR="00F55481" w:rsidRDefault="000C66BB">
          <w:r w:rsidRPr="00547EE5">
            <w:rPr>
              <w:rStyle w:val="PlaceholderText"/>
            </w:rPr>
            <w:t>Choose an item.</w:t>
          </w:r>
        </w:p>
      </w:sdtContent>
    </w:sdt>
    <w:sectPr w:rsidR="00F55481">
      <w:pgSz w:w="12240" w:h="15840"/>
      <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/>
      <w:cols w:space="720"/>
      <w:docGrid w:linePitch="360"/>
    </w:sectPr>
  </w:body>
</w:document>

这是我的代码,它得到了部分方式。它会删除“_marker”附近所有不需要的下拉列表,但只会在满足条件的最后一个下拉列表中添加(将下拉列表选项从 1-5 更改为 0-4)。我根据使用它的各种尝试构建了代码(如何在 python 中替换 word docx 中的整个 xml 元素,就好像它们是字符串一样)和这个(使用 lxml 在另一个元素之后附加元素)解决方案:newxml3

for node in xml_file.xpath(f'//w:tag[contains(@w:val, "_marker")]/parent::*', namespaces={'w':"http://schemas.openxmlformats.org/wordprocessingml/2006/main"}):
    contentnav = node.xpath(f".//w:dropDownList", namespaces={'w':"http://schemas.openxmlformats.org/wordprocessingml/2006/main"})[0]
    contentnav.addprevious(newxml3)
    node.remove(contentnav)

这是我得到的输出,在它起作用和不起作用的地方有哈希注释:

<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:cx="http://schemas.microsoft.com/office/drawing/2014/chartex" xmlns:cx1="http://schemas.microsoft.com/office/drawing/2015/9/8/chartex" xmlns:cx2="http://schemas.microsoft.com/office/drawing/2015/10/21/chartex" xmlns:cx3="http://schemas.microsoft.com/office/drawing/2016/5/9/chartex" xmlns:cx4="http://schemas.microsoft.com/office/drawing/2016/5/10/chartex" xmlns:cx5="http://schemas.microsoft.com/office/drawing/2016/5/11/chartex" xmlns:cx6="http://schemas.microsoft.com/office/drawing/2016/5/12/chartex" xmlns:cx7="http://schemas.microsoft.com/office/drawing/2016/5/13/chartex" xmlns:cx8="http://schemas.microsoft.com/office/drawing/2016/5/14/chartex" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:aink="http://schemas.microsoft.com/office/drawing/2016/ink" xmlns:am3d="http://schemas.microsoft.com/office/drawing/2017/model3d" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:oel="http://schemas.microsoft.com/office/2019/extlst" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w16cex="http://schemas.microsoft.com/office/word/2018/wordml/cex" xmlns:w16cid="http://schemas.microsoft.com/office/word/2016/wordml/cid" xmlns:w16="http://schemas.microsoft.com/office/word/2018/wordml" xmlns:w16sdtdh="http://schemas.microsoft.com/office/word/2020/wordml/sdtdatahash" xmlns:w16se="http://schemas.microsoft.com/office/word/2015/wordml/symex" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 w16se w16cid w16 w16cex w16sdtdh wp14">
  <w:body>
    <w:sdt>
      <w:sdtPr>
        <w:alias w:val="test_drop_down"/>
        <w:tag w:val="test_drop_down"/>
        <w:id w:val="-387181634"/>
        <w:placeholder>
          <w:docPart w:val="F84271CC265B4C44BFD0FEF4977C3363"/>
        </w:placeholder>
        <w:showingPlcHdr/>
        <w:dropDownList> ## This dropdown list is untouched. Good! the tag does not have "_marker" so that part is working.
          <w:listItem w:value="Choose an item."/>
          <w:listItem w:displayText="1" w:value="1"/>
          <w:listItem w:displayText="2" w:value="2"/>
          <w:listItem w:displayText="3" w:value="3"/>
          <w:listItem w:displayText="4" w:value="4"/>
          <w:listItem w:displayText="5" w:value="5"/>
        </w:dropDownList>
      </w:sdtPr>
      <w:sdtEndPr/>
      <w:sdtContent>
        <w:p w14:paraId="3E667F97" w14:textId="40FCA028" w:rsidR="00DD3971" w:rsidRDefault="00DD3971">
          <w:r w:rsidRPr="00547EE5">
            <w:rPr>
              <w:rStyle w:val="PlaceholderText"/>
            </w:rPr>
            <w:t>Choose an item.</w:t>
          </w:r>
        </w:p>
      </w:sdtContent>
    </w:sdt>
    <w:sdt>
      <w:sdtPr>
        <w:alias w:val="test_drop_down_marker"/>
        <w:tag w:val="test_drop_down_marker"/>
        <w:id w:val="1273827251"/>
        <w:placeholder>
          <w:docPart w:val="7B1BB8A989B0431A9916F716166BD235"/>
        </w:placeholder>
        <w:showingPlcHdr/> ## The old dropDownList node was here but is gone now, which is correct since the nearby tag is "_marker" but the new dropDownList was not added. Only part of the code worked.
      </w:sdtPr>
      <w:sdtEndPr/>
      <w:sdtContent>
        <w:p w14:paraId="491EDFBD" w14:textId="77777777" w:rsidR="000C66BB" w:rsidRDefault="000C66BB" w:rsidP="000C66BB">
          <w:r w:rsidRPr="00547EE5">
            <w:rPr>
              <w:rStyle w:val="PlaceholderText"/>
            </w:rPr>
            <w:t>Choose an item.</w:t>
          </w:r>
        </w:p>
      </w:sdtContent>
    </w:sdt>
    <w:sdt>
      <w:sdtPr>
        <w:alias w:val="test_drop_down_marker"/>
        <w:tag w:val="test_drop_down_marker"/>
        <w:id w:val="1947423421"/>
        <w:placeholder>
          <w:docPart w:val="CAC6472BBA50466E8618F7940D1F2320"/>
        </w:placeholder>
        <w:showingPlcHdr/>
        <w:dropDownList> ## The old dropDownList node is gone here AND the new dropDownList is added. So the code worked perfectly here.
          <w:listItem w:value="Choose an item."/>
          <w:listItem w:displayText="NA" w:value="NA"/>
          <w:listItem w:displayText="0" w:value="0"/>
          <w:listItem w:displayText="1" w:value="1"/>
          <w:listItem w:displayText="2" w:value="2"/>
          <w:listItem w:displayText="3" w:value="3"/>
          <w:listItem w:displayText="4" w:value="4"/>
        </w:dropDownList>
      </w:sdtPr>
      <w:sdtEndPr/>
      <w:sdtContent>
        <w:p w14:paraId="60B5F611" w14:textId="79AEBD34" w:rsidR="00F55481" w:rsidRDefault="000C66BB">
          <w:r w:rsidRPr="00547EE5">
            <w:rPr>
              <w:rStyle w:val="PlaceholderText"/>
            </w:rPr>
            <w:t>Choose an item.</w:t>
          </w:r>
        </w:p>
      </w:sdtContent>
    </w:sdt>
    <w:sectPr w:rsidR="00F55481">
      <w:pgSz w:w="12240" w:h="15840"/>
      <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/>
      <w:cols w:space="720"/>
      <w:docGrid w:linePitch="360"/>
    </w:sectPr>
  </w:body>
</w:document>
python xml xpath lxml

评论

1赞 Hermann12 9/27/2023
此 xml 缺少命名空间声明。请将根标签与命名空间定义共享
0赞 DOR 9/27/2023
刚刚更新了 xml 文本中的命名空间定义。它是 word 文档的标准定义。您是否希望我删除xml中的##散列注释?我添加这些是为了指定我希望在哪里进行更改,以及我开发的代码在哪里工作和不工作。

答:

0赞 Hermann12 9/27/2023 #1

搜索下拉列表并将其删除:

import xml.etree.ElementTree as ET

tree = ET.parse('word.docx')
root = tree.getroot()

ns = dict([node for _, node in ET.iterparse('word.docx', events=['start-ns'])])
#print(ns)

for prefix, uri in ns.items():
    ET.register_namespace(prefix, uri)

    
sdtpr = root.findall('.//w:sdtPr', namespaces=ns)

for dropl in sdtpr:
    for elem in dropl:
        if elem.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val')=='test_drop_down_marker':
            dDl =dropl.find('.//w:dropDownList', namespaces=ns)
            if dDl == None:
                pass
            else:
                dropl.remove(dDl)                     

ET.dump(root)