使用 BeautifulSoup 解析 html,我得到了不需要的打印。为什么?

Using BeautifulSoup to parse html, I am getting unwanted prints. Why is that?

提问人:Vanzy M 提问时间:1/19/2023 最后编辑:Vanzy M 更新时间:1/19/2023 访问量:36

问:

我正在使用 beautiful soup 来解析 Jupyter Notebook 上的 HTML 文档。这是文件中的示例。请注意,同一个 HTML 示例会重复多次。下表标记是同级标记,并被其他标记包围

<table class="tableBorder" width="100%" cellspacing="0" cellpadding="0" border="0">
   <tbody>
      <tr>
         <td colspan="2" width="100%" valign="top" bgcolor="#f0f0f0">
            <h3 class="formtitle"> Title <a href="somelink">Title</a>
               <span class="subText"> Date: 21/Dec/22 </span>
            </h3>
         </td>
      </tr>
      <tr>
         <td width="20%"><b>Status</b></td>
         <td width="80%">shipping</td>
      </tr>
   </tbody>
</table>
      
<table class="grid" width="100%" cellspacing="0" cellpadding="0" border="0">
   <tbody>
      <tr>
         <td width="20%" valign="top" bgcolor="#f0f0f0"> <b>some data</b></td>
         <td width="30%" valign="top" bgcolor="#ffffff"> some data </td>
         <td bgcolor="#f0f0f0"> <b>some data:</b>some data</td>
         <td valign="top" nowrap="" bgcolor="#ffffff">vsome data </td>
      </tr>
      <tr>
         <td width="20%" valign="top" bgcolor="#f0f0f0"> <b>some data:</b> </td>
      </tr>
   </tbody>
</table>

<table class="grid" width="100%" cellspacing="0" cellpadding="0" border="0">
   <tbody>
      <tr>
         <td width="20%" valign="top" bgcolor="#f0f0f0">
            <b>Sections</b>
         </td>
         <td class="noPadding" valign="top" bgcolor="#ffffff">
            <table class="blank" width="100%" cellspacing="0" cellpadding="0" border="0">
               <tbody>
                  <tr>
                     <td colspan="4" bgcolor="#f0f0f0"> <b>Section 1</b> </td>
                  </tr>
                  <tr>
                     <td> Test 1 </td>
                     <td> <a href="somelink"> Test 1 Code </a> </td>
                     <td> Test 1 Description </td>
                     <td> Test 1 Extended Description </td>
                  </tr>
                  <tr>
                     <td colspan="4" bgcolor="#f0f0f0"> <b>Section 2</b> </td>
                  </tr>
                  <tr>
                     <td> Test 2 </td>
                     <td> <a href="somelink"> Test 2 Code </a> </td>
                     <td> Test 2 Description </td>
                     <td> Test 2 Extended Description </td>
                  </tr>
                  <tr>
                     <td> Test 3 </td>
                     <td> <a href="somelink"> Test 3 Code </a> </td>
                     <td> Test 3 Description </td>
                     <td> Test 3 Extended Description </td>
                  </tr>
               </tbody>
            </table>
         </td>
      </tr>
   </tbody>
</table>

我有以下 python 代码,当我运行它时会打印不需要的结果(重复项)。我不确定我做错了什么

mainHtml = soup.find_all('table', class_='tableBorder')

for main in mainHtml:
    
    print ()
    print ("URL : ", main.tbody.tr.td.h3.a["href"])
    print ("Title : ", main.tbody.tr.td.h3.a.text)
    print ("Status : ", main.tbody.select('tr')[1].select('td')[1].text)

    linked = main.find_next_sibling('table', class_='grid')
    if linked:
        linked = linked.find_next_sibling('table', class_='grid')
    
    if linked:
        rows = linked.find_all('tr')

#       Iterate through the rows and extract the information
        for row in rows:
        
            cells = row.find_all('td')
            
            if len(cells) >= 4:
                
#               Extract the information from the cells
                a= cells[0].text.strip()
                b = cells[1].text.strip()
                c = cells[2].text.strip()
                d = cells[3].text.strip()
                
                print(a, b, c, d)

遇到不需要的打印问题的输出如下

Test 1 
Test 1 Code 
Test 1 Description 
Test 1 Extended Description

Test 2 
Test 2 Code 
Test 2 Description 
Test 2 Extended Description

Test 3 
Test 3 Code 
Test 3 Description 
Test 3 Extended Description

Test 1 Test 1 Code Test 1 Description Test 1 Extended Description
Test 2 Test 2 Code Test 2 Description Test 2 Extended Description
Test 3 Test 3 Code Test 3 Description Test 3 Extended Description

由于我在末尾有一个打印语句,因此我只想使用以下格式,并且在发生不需要的打印之后获得它。什么原因,有什么办法可以解决这个问题

Test 1 Test 1 Code Test 1 Description Test 1 Extended Description
Test 2 Test 2 Code Test 2 Description Test 2 Extended Description
Test 3 Test 3 Code Test 3 Description Test 3 Extended Description
python 解析 beautifulsoup html 解析

评论


答:

0赞 Andrej Kesely 1/19/2023 #1

我对这个问题的看法是“向后搜索”——找到带有描述的表格,然后向后搜索 URL/标题/状态:

soup = BeautifulSoup(html_doc, 'html.parser')  # html_doc contains your HTML snippet from the question

for table in soup.select('table:has(b:-soup-contains(Sections))'):
    url = table.find_previous('h3').a['href']
    title = table.find_previous('h3').a.text
    status = table.find_previous(lambda tag: tag.name=='b' and tag.text=='Status').find_next('td').text

    print(url)
    print(title)
    print(status)

    print()

    for row in table.select('tr:not(:has([colspan]))'):
        print(' '.join(td.text.strip() for td in row.select('td')))

指纹:

somelink
Title
shipping

Test 1 Test 1 Code Test 1 Description Test 1 Extended Description
Test 2 Test 2 Code Test 2 Description Test 2 Extended Description
Test 3 Test 3 Code Test 3 Description Test 3 Extended Description

评论

0赞 Andrej Kesely 1/19/2023
@VanzyM 也许将 lambda 更改为lambda tag: tag.name=='b' and 'Status' in tag.text
0赞 Vanzy M 1/19/2023
谢谢你@Andrej。这很有趣。但是,您的解决方案还会从具有相同属性的其他表中获取数据。我感兴趣的表是第二个具有网格属性的表。我们能以某种方式识别它吗?
0赞 Andrej Kesely 1/19/2023
@VanzyM 要选择带有说明的表前面的表,您可以执行以下操作previous_table = table.find_previous('table')