提问人:Aiyu Sheng 提问时间:11/19/2022 更新时间:11/19/2022 访问量:58
使用 LXML 库使用额外的“\n”和空格进行 XML 解析
xml parsing with extra '\n' and whitespaces using lxml library
问:
我用 lxml 库编写了一个 python 程序,以使用其 xpath 解析 xml 文件。value 和 xpath 都是正确的,但它会返回许多“\n”和空格,就像 xml 文件的格式一样。
这是我的代码:
from lxml import etree
from xml.dom import minidom
#data = minidom.parse('D:/LocalSpark/bitmap.xml')
sigxml = etree.parse('D:/LocalSpark/bitmap.xml',etree.XMLParser(remove_blank_text=True, load_dtd=True))
xpath = '/OneMessage[@Name="NR RRCReconfiguration"]/BalongMessage/Content/L3MessageContent/DL-DCCH-Message/message/c1/rrcReconfiguration/criticalExtensions/rrcReconfiguration/measConfig/measObjectToAddModList/MeasObjectToAddMod/measObject/measObjectNR/referenceSignalConfig/ssb-ConfigMobility/ssb-ToMeasure/setup/mediumBitmap'
info = 10000000
for node in sigxml.xpath(xpath):
print('node: ', node)
print('node.tag: ',node.tag)
print('node.text:',node.text)
print('node.item:',node.items())
print('node.attrib:',node.attrib)
if info == node.text:
print("%s info do exist!"%info)
else:
print("%s info do not exist!!!"%info)
下面是 XML 文件:
<OneMessage Name="NR RRCReconfiguration" MsgTimeStamp="1668594368290"><BalongMessage><Header><usRsvd>4608</usRsvd><ucbMdmId>0</ucbMdmId><ucbMsgType>3</ucbMsgType><ucbRsvd>0</ucbRsvd><ulMsgClsID>26080000</ulMsgClsID><ullbTimeStamp>1853637.763054</ullbTimeStamp><ullbCpuTransID>38693</ullbCpuTransID><usSocpTransID>20388</usSocpTransID><ullLocalTime>133129368818699187</ullLocalTime><ulTransNo>6107</ulTransNo><ulSendPID>131072</ulSendPID><ulRecvPID>0</ulRecvPID><ulPrimID>00000003</ulPrimID><ucbOtaDirect>DL(1)</ucbOtaDirect><ucbPrintLevel>63</ucbPrintLevel><ulDataSize>56</ulDataSize></Header><Content><L3MessageContent><DL-DCCH-Message>
<message>
<c1>
<rrcReconfiguration>
<criticalExtensions>
<rrcReconfiguration>
<measConfig>
<measObjectToAddModList>
<MeasObjectToAddMod>
<measObject>
<measObjectNR>
<referenceSignalConfig>
<ssb-ConfigMobility>
<ssb-ToMeasure>
<setup>
<mediumBitmap>
10000000
</mediumBitmap>
</setup>
</ssb-ToMeasure>
</ssb-ConfigMobility>
</referenceSignalConfig>
</measObjectNR>
</measObject>
</MeasObjectToAddMod>
</measObjectToAddModList>
</measConfig>
</rrcReconfiguration>
</criticalExtensions>
</rrcReconfiguration>
</c1>
</message>
</DL-DCCH-Message>
</L3MessageContent></Content></BalongMessage></OneMessage>
结果如下:
node: <Element mediumBitmap at 0x22e3c645f80>
node.tag: mediumBitmap
node.text:
10000000
node.item: []
node.attrib: {}
10000000 info do not exist!!!
我的问题是,显然代码可以读取并找到mediumBitmap这个元素,但正如它在xml文件中显示的那样,它在它之前和之后都有\n。因此,当程序继续运行时,它会返回 mediumBitmap 的文本值为
\n 10000000 \n
但不仅仅是 10000000
它是来自项目的标准 xml,所以我无法编辑它。
我试图添加解析或使用remove_blank_text=True
minidom
全部失败
答:
0赞
ScottC
11/19/2022
#1
有很多方法可以去除空格和换行符,但是,一个简单的技术是使用正则表达式来删除它们。
关键线是这样的:
int(re.sub(r'[\\n\s]*', '', node.text))
它搜索并替换所有回车符和空格,并将它们转换为空格。然后强制转换为,使变量相应地匹配。node.text
''
int
info
代码如下:
from lxml import etree
from xml.dom import minidom
import re
#data = minidom.parse('D:/LocalSpark/bitmap.xml')
sigxml = etree.parse('D:/LocalSpark/bitmap.xml',etree.XMLParser(remove_blank_text=True, load_dtd=True))
xpath = '/OneMessage[@Name="NR RRCReconfiguration"]/BalongMessage/Content/L3MessageContent/DL-DCCH-Message/message/c1/rrcReconfiguration/criticalExtensions/rrcReconfiguration/measConfig/measObjectToAddModList/MeasObjectToAddMod/measObject/measObjectNR/referenceSignalConfig/ssb-ConfigMobility/ssb-ToMeasure/setup/mediumBitmap'
info = 10000000
for node in sigxml.xpath(xpath):
print('node: ', node)
print('node.tag: ',node.tag)
print('node.text:',node.text)
print('node.item:',node.items())
print('node.attrib:',node.attrib)
if info == int(re.sub(r'[\\n\s]*', '', node.text)):
print("%s info do exist!"%info)
else:
print("%s info do not exist!!!"%info)
评论