提问人:Minki Choe 提问时间:1/21/2022 最后编辑:Minki Choe 更新时间:1/22/2022 访问量:43
获取可迭代节点之间的所有文本节点
Get all text nodes between iterable nodes
问:
如何使用 python3 和 lxml 库在可迭代节点之间获取文本节点。
我试图从每次迭代中获取所有文本。<b>
我想要的结果:
[
("A1", "Attr1: A1", "Attr2: B1", "Attr3: C1", "D1"),
("A2", "Attr1: A2", "Attr2: B2", "Attr3: C2", "D2"),
("A3", "Attr1: A3", "Attr2: B3", "Attr3: C3", "D3"),
]
HTML 示例:
<div>
<b><a href="">A1</a></b>
<br/>
<br/>
Attr1: A1<br/>
Attr2: B1<br/>
Attr3: C1<br/>
D1<br/>
<br/><br/><br/>
<b><a href="">A2</a></b>
<br/>
<br/>
Attr1: A2<br/>
Attr2: B2<br/>
Attr3: C2<br/>
D2<br/>
<br/><br/><br/>
<b><a href="">A3</a></b>
<br/>
<br/>
Attr1: A3<br/>
Attr2: B3<br/>
Attr3: C3<br/>
D3<br/>
<br/><br/><br/>
...
</div>
我尝试的代码:
from lxml.html import fromstring
with open("filename.html", "r") as f:
root = fromstring(f.read())
heads = root.xpath("//b[a[starts-with(., 'A')]]")
for head in heads:
for text in head.xpath(
"./following-sibling::text()[preceding-sibling::b[not(self)]"
):
print(text)
----
[stdout]
Attr1: A1
Attr2: B1
Attr3: C1
D1
Attr1: A2
Attr2: B2
Attr3: C2
D2
Attr1: A3
Attr2: B3
Attr3: C3
D3
Attr1: A2
Attr2: B2
Attr3: C2
D2
Attr1: A3
Attr2: B3
Attr3: C3
D3
Attr1: A3
Attr2: B3
Attr3: C3
D3
编辑:我认为换行词不能成为真正的html源代码中的解析标识符。
答:
0赞
HedgeHog
1/21/2022
#1
使用 BeautifulSoup,您可以选择所有包含的内容并迭代它的每一个,直到有下一个 - 要摆脱空字符串,只需使用:<b>
<a>
next_siblings
<b>
filter()
data = []
for tag in soup.select('b:has(a)'):
str_list = [tag.text]
for e in tag.next_siblings:
if e.name != 'b':
str_list.append(e.text.strip())
else:
break
data.append(tuple(filter(None, str_list)))
例
from bs4 import BeautifulSoup
html = '''
<div>
<b><a href="">A1</a></b>
<br/>
<br/>
Attr1: A1<br/>
Attr2: B1<br/>
Attr3: C1<br/>
D1<br/>
<br/><br/><br/>
<b><a href="">A2</a></b>
<br/>
<br/>
Attr1: A2<br/>
Attr2: B2<br/>
Attr3: C2<br/>
D2<br/>
<br/><br/><br/>
<b><a href="">A3</a></b>
<br/>
<br/>
Attr1: A3<br/>
Attr2: B3<br/>
Attr3: C3<br/>
D3<br/>
<br/><br/><br/>
</div>
'''
soup=BeautifulSoup(html,'lxml')
data = []
for tag in soup.select('b:has(a)'):
str_list = [tag.text]
for e in tag.next_siblings:
if e.name != 'b':
str_list.append(e.text.strip())
else:
break
data.append(tuple(filter(None, str_list)))
data
输出
[('A1', 'Attr1: A1', 'Attr2: B1', 'Attr3: C1', 'D1'),
('A2', 'Attr1: A2', 'Attr2: B2', 'Attr3: C2', 'D2'),
('A3', 'Attr1: A3', 'Attr2: B3', 'Attr3: C3', 'D3')]
0赞
Granitosaurus
1/21/2022
#2
你可以通过设置xpath的并集功能来实现:
//b/a/text() | //b/following-sibling::text()
将输出
A1
Attr1: A1
Attr2: B1
Attr3: C1
D1
A2
Attr1: A2
Attr2: B2
Attr3: C2
D2
A3
Attr1: A3
Attr2: B3
Attr3: C3
D3
您所要做的就是清理空间/重新格式化脚本中的输出。
评论