Python Beautifulsoup 遍历 innerHTML 中具有特定文本内容的表，然后获取内容直到特定元素之前

Python Beautifulsoup traverse a table with particular text content in innerHTML then get contents until before a particular element

提问人：schnydszch 提问时间：3/19/2022 最后编辑：schnydszch 更新时间：3/21/2022 访问量：185

问：

我有一个html，里面有很多表格可以遍历，如下所示：

<html>
 .. omitted parts since I am interested on the HTML table..
 <table>
  <tbody>
   <tr>
    <td>
     <table>
      <tbody>
       <tr>
        <td class="labeltitle">
         <tbody>
          <tr>
           <td class="labeltitle">
            <font color="FFD700">Floor Activity<a name="#jump_fa"></a></font>
           </td>
           <td class="labelplain">&nbsp;&nbsp;&nbsp;</td>
          </tr>
         </tbody>
        </td>
       </tr>
      </tbody>
     </table>
    </td>
   </tr>
  </tbody>
 </table>
 <table>
  ... omitted just to show the td that I am interested to scrape ...
         <td class="labelplain">&nbsp;Senator(s)</td>
         <td class="labelplain">
          <table>
           <tbody>
            <tr> 
             <td class="labelplain">VILLAR JR., MANUEL B.<br></td>
            </tr>
           </tbody>
          </table>
         </td>
    ... 
 <table>
 <table>
    ... More tables like the table above (the one with VILLAR Jr.)
 </table>
 <table>
  <tbody>
   <tr> 
    <td class="labeltitle">
     <table>
      <tbody>
       <tr> 
        <td class="labeltitle">&nbsp;<font color="FFD700">Vote(s)<a name="#jump_vote"></a></font></td>
        <td class="labelplain">&nbsp;&nbsp;&nbsp;</td>
       </tr>
      </tbody>
     </table>
    </td>
   </tr>
  </tbody>
 </table>   
   
 ... more tables
 
</html>

我要遍历的表是带有类“labeltitle”和具有文本“Floor Activity”的子元素“font”的 td。在它下面的每个表格中，我都想获取html代码，直到具有td class=“labeltitle”的表格，子代码为“font”，文本内容为“Vote（s）”。我正在尝试像这样使用 xpath：

    table = dom.xpath("//table[8]/tbody/tr/td")
    print (table)

但无济于事，我得到的是空数组。任何事情都可以（例如，有或没有 xpath）。

我还尝试了以下方法：

rows = soup.find('a', attrs={'name' :'#jump_fa'}).find_parent('table').find_parent('table')

我能够遍历内容为“地板活动”的表格。上面提到的代码只给了我那个特定父级的表内容，我在下面得到的确切输出：

<tr>
<td class="labeltitle" height="22"><table border="0" cellpadding="0" cellspacing="0" width="100%">
<tr>
<td class="labeltitle" width="50%"> <font color="FFD700">Floor 
                                Activity<a name="#jump_fa"></a></font></td>
<td align="right" class="labelplain" width="50%"> 
                                   </td>
</tr>
</table></td>
</tr>

我正在尝试这个寻找下一个兄弟姐妹，直到某个使用 beautifulsoup 的兄弟姐妹，因为它似乎适合我的用例，但问题是我收到错误“'NoneType'对象没有属性'next_sibling'”，这应该是这种情况，因为 update2 脚本不包括其他表，所以 update2 代码不在等式中。

我的预期输出是一个 json 文件（特殊字符被转义），例如：

{"title":' + '"' + str(var) + '"' + ',"body":" + flooract + ' + "`}

*其中 flooract 是带有特殊字符转义的表的 HTML 代码。示例片段：

<table>\n<tbody>\n<tr>\n<td class=\"labelplain\">&nbsp;Status Date<\/td><td class=\"labelplain\">&nbsp;10/12/2005<\/td>\n<\/tr>\n<tr><td class=\"labelplain\">&nbsp;Parliamentary Status<\/td>\n<td class=\"labelplain\"><table>\n<tbody><tr>\n<td class="labelplain">SPONSORSHIP SPEECH<br>...Until Period of Committee Amendments

示例文件链接： https://issuances-library.senate.gov.ph/54629.html 我附上了网站的图片：

屏幕截图 3，我用红线圈出了我只想从 HTML 文件中获取的内容：

python beautifulsoup html-table 遍历 dom-traversal

答： 暂无答案

上一个：为什么当我在遍历 DOM 时使用节点列表或 HTML 集合中的元素时，我会变得未定义？

下一个：在jquery中设置每个元素的样式

Python Beautifulsoup 遍历 innerHTML 中具有特定文本内容的表，然后获取内容直到特定元素之前

Python Beautifulsoup traverse a table with particular text content in innerHTML then get contents until before a particular element

评论