提问人:pyj 提问时间:4/21/2022 更新时间:4/21/2022 访问量:186
Python XML 查找标记的特定位置
Python XML finding the specific location of a tag
问:
我目前正在使用 python 中内置的 lxml.etree 解析 xml 文件。 我遇到了一些关于提取元素标签中的文本的问题。
以下是我当前问题的示例代码。
<body> <P> Title 1 </P> This is a sample text after the the p tag that contains the title. </body>
<body> sample text sample text sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. </body>
<body> sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. <P> Another P tag not containing a title </P></body>
<body> <P> Title 2 </P> This is a sample text after the the p tag that contains the title. </body>
我的冲突如下:
如果有标题,我使用第一个 P 标签来捕获每个正文标签的标题。标题(在大多数情况下)是紧跟在正文标签之后的第一个 P 标签(因此示例代码第 1 行和第 4 行)。我没有特定的标题名称列表,这就是我使用此方法来捕获标题的原因。
问题是,当正文中不存在标题,但正文标签中的某处有 P 标签,而不是紧随正文标签之后(因此代码行 2 和 3),程序会将第一个 P 标签和其中的文本作为标题。在此方案中,相应的 P 标记不是标题,也不应被视为一个,但由于它被视为一个,因此该 P 标记之前的任何文本都将被忽略,并且不会写入新的文本文件。
为了进一步说明,以下是写入文本文件的内容。
Title 1 : This is a sample text after the the p tag that contains the title.
not a title : This is sample text after a p tag that does not contain a title.
not a title : This is sample text after a p tag that does not contain a title. Another P tag not containing a title
Title 2 : This is a sample text after the the p tag that contains the title.
所需输出到文本文件
Title 1 : This is a sample text after the the p tag that contains the title.
sample text sample text sample text sample text not a title This is sample text after a p tag that does not contain a title.
sample text sample text not a title This is sample text after a p tag that does not contain a title. Another P tag not containing a title
Title 2 : This is a sample text after the the p tag that contains the title.
可能的解决方案:
1. 有什么办法可以找到第一个 P 标签的位置。如果第一个 P 标签紧跟在正文标签之后,我想保留它。我想剥离的任何其他 P 标签,但保留文本。我可以通过使用lxml.etree中的内置函数来做到这一点
strip_tags()
非常感谢对此问题或其他可能的解决方案的任何见解......先谢谢你!
答:
1赞
mbg131
4/21/2022
#1
我能够用 BeautifulSoup 和正则表达式识别标题。
from bs4 import BeautifulSoup as soup
from lxml import etree
import re
markup = """<body> <P> Title 1 </P> This is a sample text after the the p tag that contains the title. </body>
<body> sample text sample text sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. </body>
<body> sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. <P> Another P tag not containing a title </P></body>
<body> <P> Title 2 </P> This is a sample text after the the p tag that contains the title. </body>"""
soup = soup(markup,'html.parser')
titles = soup.select('body')
for title in titles:
groups = re.search('<body> *<p>', str(title))
has_title = groups != None
if has_title:
print(title.p.text)
评论