获取可迭代节点之间的所有文本节点

Get all text nodes between iterable nodes

提问人:Minki Choe 提问时间:1/21/2022 最后编辑:Minki Choe 更新时间:1/22/2022 访问量:43

问:

如何使用 python3 和 lxml 库在可迭代节点之间获取文本节点。
我试图从每次迭代中获取所有文本。
<b>

我想要的结果:

[
    ("A1", "Attr1: A1", "Attr2: B1", "Attr3: C1", "D1"),
    ("A2", "Attr1: A2", "Attr2: B2", "Attr3: C2", "D2"),
    ("A3", "Attr1: A3", "Attr2: B3", "Attr3: C3", "D3"),
]

HTML 示例:

<div>
  <b><a href="">A1</a></b>
  <br/>
  <br/>
  Attr1: A1<br/>
  Attr2: B1<br/>
  Attr3: C1<br/>
  D1<br/>
  <br/><br/><br/>
  <b><a href="">A2</a></b>
  <br/>
  <br/>
  Attr1: A2<br/>
  Attr2: B2<br/>
  Attr3: C2<br/>
  D2<br/>
  <br/><br/><br/>
  <b><a href="">A3</a></b>
  <br/>
  <br/>
  Attr1: A3<br/>
  Attr2: B3<br/>
  Attr3: C3<br/>
  D3<br/>
  <br/><br/><br/>
...
</div>

我尝试的代码:

from lxml.html import fromstring

with open("filename.html", "r") as f:
    root = fromstring(f.read())
    heads = root.xpath("//b[a[starts-with(., 'A')]]")
    for head in heads:
        for text in head.xpath(
            "./following-sibling::text()[preceding-sibling::b[not(self)]"
        ):
            print(text)

----
[stdout]

      Attr1: A1

      Attr2: B1

      Attr3: C1

      D1

      

      

      

      

      Attr1: A2

      Attr2: B2

      Attr3: C2

      D2

      

      

      

      

      Attr1: A3

      Attr2: B3

      Attr3: C3

      D3

      

    

      

      

      Attr1: A2

      Attr2: B2

      Attr3: C2

      D2

      

      

      

      

      Attr1: A3

      Attr2: B3

      Attr3: C3

      D3

      

    

      

      

      Attr1: A3

      Attr2: B3

      Attr3: C3

      D3

编辑:我认为换行词不能成为真正的html源代码中的解析标识符。

python html xpath xml 解析 lxml

评论


答:

0赞 HedgeHog 1/21/2022 #1

使用 BeautifulSoup,您可以选择所有包含的内容并迭代它的每一个,直到有下一个 - 要摆脱空字符串,只需使用:<b><a>next_siblings<b>filter()

data = []

for tag in soup.select('b:has(a)'):
    str_list = [tag.text]
    for e in tag.next_siblings:
        if e.name != 'b':
            str_list.append(e.text.strip())
        else:
            break
    data.append(tuple(filter(None, str_list)))

from bs4 import BeautifulSoup

html = '''
<div>
  <b><a href="">A1</a></b>
  <br/>
  <br/>
  Attr1: A1<br/>
  Attr2: B1<br/>
  Attr3: C1<br/>
  D1<br/>
  <br/><br/><br/>
  <b><a href="">A2</a></b>
  <br/>
  <br/>
  Attr1: A2<br/>
  Attr2: B2<br/>
  Attr3: C2<br/>
  D2<br/>
  <br/><br/><br/>
  <b><a href="">A3</a></b>
  <br/>
  <br/>
  Attr1: A3<br/>
  Attr2: B3<br/>
  Attr3: C3<br/>
  D3<br/>
  <br/><br/><br/>
</div>
'''

soup=BeautifulSoup(html,'lxml')

data = []

for tag in soup.select('b:has(a)'):
    str_list = [tag.text]
    for e in tag.next_siblings:
        if e.name != 'b':
            str_list.append(e.text.strip())
        else:
            break
    data.append(tuple(filter(None, str_list)))

data

输出

[('A1', 'Attr1: A1', 'Attr2: B1', 'Attr3: C1', 'D1'),
 ('A2', 'Attr1: A2', 'Attr2: B2', 'Attr3: C2', 'D2'),
 ('A3', 'Attr1: A3', 'Attr2: B3', 'Attr3: C3', 'D3')]
0赞 Granitosaurus 1/21/2022 #2

你可以通过设置xpath的并集功能来实现:

//b/a/text() | //b/following-sibling::text()

将输出

A1

  

  

  Attr1: A1

  Attr2: B1

  Attr3: C1

  D1

  

  
A2

  

  

  Attr1: A2

  Attr2: B2

  Attr3: C2

  D2

  

  
A3

  

  

  Attr1: A3

  Attr2: B3

  Attr3: C3

  D3

您所要做的就是清理空间/重新格式化脚本中的输出。

查看此实时测试人员示例