Python XML 查找标记的特定位置-解网

问：

我目前正在使用 python 中内置的 lxml.etree 解析 xml 文件。我遇到了一些关于提取元素标签中的文本的问题。

以下是我当前问题的示例代码。

<body> <P> Title 1 </P> This is a sample text after the the p tag that contains the title. </body>

<body> sample text sample text sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. </body>

<body> sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. <P> Another P tag not containing a title </P></body> 

<body> <P> Title 2 </P> This is a sample text after the the p tag that contains the title. </body>

我的冲突如下：

如果有标题，我使用第一个 P 标签来捕获每个正文标签的标题。标题（在大多数情况下）是紧跟在正文标签之后的第一个 P 标签（因此示例代码第 1 行和第 4 行）。我没有特定的标题名称列表，这就是我使用此方法来捕获标题的原因。

问题是，当正文中不存在标题，但正文标签中的某处有 P 标签，而不是紧随正文标签之后（因此代码行 2 和 3），程序会将第一个 P 标签和其中的文本作为标题。在此方案中，相应的 P 标记不是标题，也不应被视为一个，但由于它被视为一个，因此该 P 标记之前的任何文本都将被忽略，并且不会写入新的文本文件。

为了进一步说明，以下是写入文本文件的内容。

Title 1 : This is a sample text after the the p tag that contains the title.
not a title : This is sample text after a p tag that does not contain a title.
not a title : This is sample text after a p tag that does not contain a title. Another P tag not containing a title
Title 2 : This is a sample text after the the p tag that contains the title.

所需输出到文本文件

Title 1 : This is a sample text after the the p tag that contains the title.
sample text sample text sample text sample text not a title This is sample text after a p tag that does not contain a title.
sample text sample text not a title This is sample text after a p tag that does not contain a title. Another P tag not containing a title
Title 2 : This is a sample text after the the p tag that contains the title.

可能的解决方案：

1. 有什么办法可以找到第一个 P 标签的位置。如果第一个 P 标签紧跟在正文标签之后，我想保留它。我想剥离的任何其他 P 标签，但保留文本。我可以通过使用lxml.etree中的内置函数来做到这一点

strip_tags()

非常感谢对此问题或其他可能的解决方案的任何见解......先谢谢你！

python xml xml解析条带标签

from bs4 import BeautifulSoup as soup
from lxml import etree
import re


markup = """<body> <P> Title 1 </P> This is a sample text after the the p tag that contains the title. </body>

<body> sample text sample text sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. </body>

<body> sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. <P> Another P tag not containing a title </P></body> 

<body> <P> Title 2 </P> This is a sample text after the the p tag that contains the title. </body>"""


soup = soup(markup,'html.parser')

titles = soup.select('body')

for title in titles:
    
    groups = re.search('<body> *<p>', str(title))
    has_title = groups != None
    if has_title:
        print(title.p.text)

上一个：我们如何根据 Stripe 账户 ID 获取 ONBOARD ID？

下一个：有没有办法将 facet.grid 类型的条带标签添加到常规绘图中，以便 grid.extra 中聚合的绘图匹配？

Python XML 查找标记的特定位置

Python XML finding the specific location of a tag

评论