提问人:E-Flo 提问时间:9/6/2023 更新时间:9/6/2023 访问量:51
添加/替换 XML 标记的 Python 脚本
Python script that adds/replaces XML tags
问:
我有这个 Python 脚本,它应该在 XML 文档中查找现有标签,并用新的、更具描述性的标签替换它们。问题是,在我运行脚本后,它似乎只捕获我输入的文本字符串的每几个实例。我确信它为什么会这样,这背后有一些原因,但我似乎无法弄清楚。
import xml.etree.ElementTree as ET
from lxml import etree
def replace_specific_line_tags(input_file, output_file, replacements):
# Parse the XML file using lxml
tree = etree.parse(input_file)
root = tree.getroot()
for target_text, replacement_tag in replacements:
# Find all <line> tags with the specific target text under <content> and replace them with the new tag
for line_tag in root.xpath('.//content/page/line[contains(., "{}")]'.format(target_text)):
parent = line_tag.getparent()
# Create the new tag with the desired tag name
new_tag = etree.Element(replacement_tag)
# Copy the attributes of the original <line> tag to the new tag
for attr, value in line_tag.attrib.items():
new_tag.set(attr, value)
# Copy the text of the original <line> tag to the new tag
new_tag.text = line_tag.text
# Replace the original <line> tag with the new tag
parent.replace(line_tag, new_tag)
# Write the updated XML back to the file
with open(output_file, 'wb') as f:
tree.write(f, encoding='utf-8', xml_declaration=True)
if __name__ == '__main__':
input_file_name = 'beforeTagEdits.xml'
output_file_name = 'afterTagEdits.xml'
# List of target texts and their corresponding replacement tags
replacements = [
('The Washington Post', 'title'),
# Add more target texts and their replacement tags as needed
]
replace_specific_line_tags(input_file_name, output_file_name, replacements)
由于代码正在工作,只是没有完全预期,我尝试更改一些文本字符串以匹配原始文件中已知的确切字符串,但这似乎并不能解决问题。下面是当前 XML 文档的示例:
<root>
<content>
<line>The Washington Post</line>
<line>The Washington Post</line>
</content>
</root>
答:
0赞
Hermann12
9/6/2023
#1
您可以在找到搜索到的文本后对树进行 iter() 并重命名标签:
import xml.etree.ElementTree as ET
xml= """<root>
<content>
<line>The Washington Post</line>
<line>The Washington Post</line>
<tag>Der Spiegel</tag>
</content>
</root>"""
root = ET.fromstring(xml)
pattern ={'title':['The Washington Post', 'Der Spiegel']}
for k, v in pattern.items():
for elem in root.iter():
if elem.text in v:
elem.tag = k
ET.dump(root)
输出:
<root>
<content>
<title>The Washington Post</title>
<title>The Washington Post</title>
<title>Der Spiegel</title>
</content>
</root>
评论
beautifulsoup
content/page/line
content/line