Python XML 查找标记的特定位置

Python XML finding the specific location of a tag

提问人:pyj 提问时间:4/21/2022 更新时间:4/21/2022 访问量:186

问:

我目前正在使用 python 中内置的 lxml.etree 解析 xml 文件。 我遇到了一些关于提取元素标签中的文本的问题。

以下是我当前问题的示例代码。

<body> <P> Title 1 </P> This is a sample text after the the p tag that contains the title. </body>

<body> sample text sample text sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. </body>

<body> sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. <P> Another P tag not containing a title </P></body> 

<body> <P> Title 2 </P> This is a sample text after the the p tag that contains the title. </body>

我的冲突如下:

如果有标题,我使用第一个 P 标签来捕获每个正文标签的标题。标题(在大多数情况下)是紧跟在正文标签之后的第一个 P 标签(因此示例代码第 1 行和第 4 行)。我没有特定的标题名称列表,这就是我使用此方法来捕获标题的原因。

问题是,当正文中不存在标题,但正文标签中的某处有 P 标签,而不是紧随正文标签之后(因此代码行 2 和 3),程序会将第一个 P 标签和其中的文本作为标题。在此方案中,相应的 P 标记不是标题,也不应被视为一个,但由于它被视为一个,因此该 P 标记之前的任何文本都将被忽略,并且不会写入新的文本文件。

为了进一步说明,以下是写入文本文件的内容。

Title 1 : This is a sample text after the the p tag that contains the title.
not a title : This is sample text after a p tag that does not contain a title.
not a title : This is sample text after a p tag that does not contain a title. Another P tag not containing a title
Title 2 : This is a sample text after the the p tag that contains the title.

所需输出到文本文件

Title 1 : This is a sample text after the the p tag that contains the title.
sample text sample text sample text sample text not a title This is sample text after a p tag that does not contain a title.
sample text sample text not a title This is sample text after a p tag that does not contain a title. Another P tag not containing a title
Title 2 : This is a sample text after the the p tag that contains the title.

可能的解决方案:

1. 有什么办法可以找到第一个 P 标签的位置。如果第一个 P 标签紧跟在正文标签之后,我想保留它。我想剥离的任何其他 P 标签,但保留文本。我可以通过使用lxml.etree中的内置函数来做到这一点

strip_tags()

非常感谢对此问题或其他可能的解决方案的任何见解......先谢谢你!

python xml xml解析 条带标签

评论

0赞 Nacho R 4/21/2022
我确定有一个功能,但我不知道!无论如何,您可以检查正文的 str 并将其与第一个 <p> occurrency 进行比较,如果正文以标题开头,则跟踪它

答:

1赞 mbg131 4/21/2022 #1

我能够用 BeautifulSoup 和正则表达式识别标题。

from bs4 import BeautifulSoup as soup
from lxml import etree
import re


markup = """<body> <P> Title 1 </P> This is a sample text after the the p tag that contains the title. </body>

<body> sample text sample text sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. </body>

<body> sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. <P> Another P tag not containing a title </P></body> 

<body> <P> Title 2 </P> This is a sample text after the the p tag that contains the title. </body>"""


soup = soup(markup,'html.parser')

titles = soup.select('body')

for title in titles:
    
    groups = re.search('<body> *<p>', str(title))
    has_title = groups != None
    if has_title:
        print(title.p.text)