提问人:Vanzy M 提问时间:1/19/2023 最后编辑:Vanzy M 更新时间:1/19/2023 访问量:36
使用 BeautifulSoup 解析 html,我得到了不需要的打印。为什么?
Using BeautifulSoup to parse html, I am getting unwanted prints. Why is that?
问:
我正在使用 beautiful soup 来解析 Jupyter Notebook 上的 HTML 文档。这是文件中的示例。请注意,同一个 HTML 示例会重复多次。下表标记是同级标记,并被其他标记包围
<table class="tableBorder" width="100%" cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr>
<td colspan="2" width="100%" valign="top" bgcolor="#f0f0f0">
<h3 class="formtitle"> Title <a href="somelink">Title</a>
<span class="subText"> Date: 21/Dec/22 </span>
</h3>
</td>
</tr>
<tr>
<td width="20%"><b>Status</b></td>
<td width="80%">shipping</td>
</tr>
</tbody>
</table>
<table class="grid" width="100%" cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr>
<td width="20%" valign="top" bgcolor="#f0f0f0"> <b>some data</b></td>
<td width="30%" valign="top" bgcolor="#ffffff"> some data </td>
<td bgcolor="#f0f0f0"> <b>some data:</b>some data</td>
<td valign="top" nowrap="" bgcolor="#ffffff">vsome data </td>
</tr>
<tr>
<td width="20%" valign="top" bgcolor="#f0f0f0"> <b>some data:</b> </td>
</tr>
</tbody>
</table>
<table class="grid" width="100%" cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr>
<td width="20%" valign="top" bgcolor="#f0f0f0">
<b>Sections</b>
</td>
<td class="noPadding" valign="top" bgcolor="#ffffff">
<table class="blank" width="100%" cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr>
<td colspan="4" bgcolor="#f0f0f0"> <b>Section 1</b> </td>
</tr>
<tr>
<td> Test 1 </td>
<td> <a href="somelink"> Test 1 Code </a> </td>
<td> Test 1 Description </td>
<td> Test 1 Extended Description </td>
</tr>
<tr>
<td colspan="4" bgcolor="#f0f0f0"> <b>Section 2</b> </td>
</tr>
<tr>
<td> Test 2 </td>
<td> <a href="somelink"> Test 2 Code </a> </td>
<td> Test 2 Description </td>
<td> Test 2 Extended Description </td>
</tr>
<tr>
<td> Test 3 </td>
<td> <a href="somelink"> Test 3 Code </a> </td>
<td> Test 3 Description </td>
<td> Test 3 Extended Description </td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
我有以下 python 代码,当我运行它时会打印不需要的结果(重复项)。我不确定我做错了什么
mainHtml = soup.find_all('table', class_='tableBorder')
for main in mainHtml:
print ()
print ("URL : ", main.tbody.tr.td.h3.a["href"])
print ("Title : ", main.tbody.tr.td.h3.a.text)
print ("Status : ", main.tbody.select('tr')[1].select('td')[1].text)
linked = main.find_next_sibling('table', class_='grid')
if linked:
linked = linked.find_next_sibling('table', class_='grid')
if linked:
rows = linked.find_all('tr')
# Iterate through the rows and extract the information
for row in rows:
cells = row.find_all('td')
if len(cells) >= 4:
# Extract the information from the cells
a= cells[0].text.strip()
b = cells[1].text.strip()
c = cells[2].text.strip()
d = cells[3].text.strip()
print(a, b, c, d)
遇到不需要的打印问题的输出如下
Test 1
Test 1 Code
Test 1 Description
Test 1 Extended Description
Test 2
Test 2 Code
Test 2 Description
Test 2 Extended Description
Test 3
Test 3 Code
Test 3 Description
Test 3 Extended Description
Test 1 Test 1 Code Test 1 Description Test 1 Extended Description
Test 2 Test 2 Code Test 2 Description Test 2 Extended Description
Test 3 Test 3 Code Test 3 Description Test 3 Extended Description
由于我在末尾有一个打印语句,因此我只想使用以下格式,并且在发生不需要的打印之后获得它。什么原因,有什么办法可以解决这个问题
Test 1 Test 1 Code Test 1 Description Test 1 Extended Description
Test 2 Test 2 Code Test 2 Description Test 2 Extended Description
Test 3 Test 3 Code Test 3 Description Test 3 Extended Description
答:
0赞
Andrej Kesely
1/19/2023
#1
我对这个问题的看法是“向后搜索”——找到带有描述的表格,然后向后搜索 URL/标题/状态:
soup = BeautifulSoup(html_doc, 'html.parser') # html_doc contains your HTML snippet from the question
for table in soup.select('table:has(b:-soup-contains(Sections))'):
url = table.find_previous('h3').a['href']
title = table.find_previous('h3').a.text
status = table.find_previous(lambda tag: tag.name=='b' and tag.text=='Status').find_next('td').text
print(url)
print(title)
print(status)
print()
for row in table.select('tr:not(:has([colspan]))'):
print(' '.join(td.text.strip() for td in row.select('td')))
指纹:
somelink
Title
shipping
Test 1 Test 1 Code Test 1 Description Test 1 Extended Description
Test 2 Test 2 Code Test 2 Description Test 2 Extended Description
Test 3 Test 3 Code Test 3 Description Test 3 Extended Description
评论
0赞
Andrej Kesely
1/19/2023
@VanzyM 也许将 lambda 更改为lambda tag: tag.name=='b' and 'Status' in tag.text
0赞
Vanzy M
1/19/2023
谢谢你@Andrej。这很有趣。但是,您的解决方案还会从具有相同属性的其他表中获取数据。我感兴趣的表是第二个具有网格属性的表。我们能以某种方式识别它吗?
0赞
Andrej Kesely
1/19/2023
@VanzyM 要选择带有说明的表前面的表,您可以执行以下操作previous_table = table.find_previous('table')
评论