提问人:Suwandi Cahyadi 提问时间:3/20/2021 最后编辑:Suwandi Cahyadi 更新时间:3/21/2021 访问量:291
Linux 脚本,用于提取在其子标签之一中具有特定值的 XML 父标签
Linux script to extract XML parent tags having certain values in one of its child tag
问:
我有一个格式如下的 XML
<SaleEvent>
...
<Extention>
<ReceiptId>111</ReceiptId>
...
</Extention>
...
</SaleEvent>
<SaleEvent>
...
<Extention>
<ReceiptId>123</ReceiptId>
...
</Extention>
...
</SaleEvent>
<SaleEvent>
...
<Extention>
<ReceiptId>456</ReceiptId>
...
</Extention>
...
</SaleEvent>
<RefundEvent>
...
<Extention>
<ReceiptId>789</ReceiptId>
...
</Extention>
...
</RefundEvent>
我想提取整个 SaleEvent/RefundEvent 标签(及其子标签),该标签的 ReceiptId 为 123, 789(ReceiptId 是其子标签之一,我有要提取的 ReceiptId 列表)。
我尝试使用以下awk命令:
for k in `grep -v -F -x -f $file1.csv $file2.csv`
do
awk -v pattern="$k" 'BEGIN {RS="<SaleEvent" ;FS="<"} $0 ~ pattern && ($NF == "/SaleEvent>") {print RS $0}' $input.xml >> $output.xml
done
该命令将获取 file2.csv 中但不在 file1.csv 中的 ReceiptId 列表,然后存储在 k 中。然后,对于每个 k,尝试将存储在 $k 中的 ReceiptId 从 $input.xml 提取到$output.xml 但它仍然不适用于某些收据,我不知道为什么。
该命令中是否缺少某些内容,但仍有一些收据未提取?是否有其他命令可用于此目的?
实际输入的 XML 文件是最小化版本,因此所有内容都在 1 行中。类似的东西
<SaleEvent><Extention><ReceiptId>111</ReceiptId></Extention></SaleEvent<SaleEvent><Extention><ReceiptId>123</ReceiptId></Extention></SaleEvent> ...
预期的输出是(它不必是美化版本,只是为了可读性)
<SaleEvent>
...
<Extention>
<ReceiptId>123</ReceiptId>
...
</Extention>
...
</SaleEvent>
<RefundEvent>
...
<Extention>
<ReceiptId>789</ReceiptId>
...
</Extention>
...
</RefundEvent>
顺便说一句,我在 cygwin for windows 中运行脚本
谢谢。 问候
答:
我会去python,比如(调用你的文档doc.xml) 安装 lxml,您需要
import pathlib
from lxml import etree
# I put an <all></all> around the file
doc = etree.fromstring(b'<all>'+pathlib.Path("doc.xml").read_bytes()+b'</all>')
# then find things
parents=[]
for i in doc.xpath('//*[text()="123"]'):
parents.append(list(i.iterancestors())[-2])
print(parents)
当您想将“父母”写回文件时,您可以(我确实)在这里找到帮助:
http://www.troubleshooters.com/codecorn/python/lxml.htm#writing_an_xhtml_to_epub_converter_program
请看一下这个解决方案(在 Ed Morton 的帮助下进行了改进和改进,参见下面的评论),它适用于任何 awk:
awk '/<(Sale|Refund)Event>/{f=1} f{i=i $0 ORS} /<\/(Sale|Refund)Event>/{if(i ~ /<ReceiptId>(789|123)<\//){printf "%s", i} i=f=""}' input.xml
输出
<SaleEvent>
...
<Extention>
<ReceiptId>123</ReceiptId>
...
</Extention>
...
</SaleEvent>
<RefundEvent>
...
<Extention>
<ReceiptId>789</ReceiptId>
...
</Extention>
...
</RefundEvent>
解释
awk '
/<(Sale|Refund)Event>/ { # When XML start tags match
f=1} # set f to true
f { i=i $0 ORS} # With f true collect lines and output record separator
/<\/(Sale|Refund)Event>/ { # When XML end tags match
if(i ~ /<ReceiptId>(789|123)<\//){ # if ReceiptId satifies condition ..
printf "%s", i # print lines
}
i=f="" # unset i
}' input.xml
使用缩小的 XML 和动态变量进行更新
BEGIN { rec_id ="<ReceiptId>"var"</" # Construct regexp with variable to match later
}
/<(Sale|Refund)Event>/ { # When XML start tags match
f=1} # set f to true
f { i=i $0 ORS} # With f true collect lines and output record separator
/<\/(Sale|Refund)Event>/ { # When XML end tags match
if(match(i, rec_id)){ # if rec_id matches
printf "%s", i # print lines
}
i=f="" # unset i
}
要处理缩小的 XML,请通过以下工具运行它:tidy
tidy -iq -xml input.xml | awk -v var="$k" -f tst.awk
评论
<(SaleEvent|RefundEvent)>
可以理解为(与相同),更重要的是需要锚定为,否则它会错误地匹配。<(Sale|Refund)Event>
/<\/.../
/<ReceiptId>(789|123)/
/<ReceiptId>(789|123)</
<ReceiptId>123456
如果你的输入总是完全按照你显示的结构,那么使用 GNU awk for multit-char RS 和 RT:
$ awk -v RS='</(Sale|Refund)Event>[[:blank:]]*\n' -v ORS= '/<ReceiptId>(123|789)</{print $0 RT}' file
<SaleEvent>
...
<Extention>
<ReceiptId>123</ReceiptId>
...
</Extention>
...
</SaleEvent>
<RefundEvent>
...
<Extention>
<ReceiptId>789</ReceiptId>
...
</Extention>
...
</RefundEvent>
评论