我想遍历 HTML 代码中的嵌套标签，例如列表或 JSON 文件-解网

问：

例如，我有以下代码：


<div class = "las">
    <div class = "asas">
      <table style="width:100%">
        <tr>
          <th>Firstname</th>
          <th>Lastname</th> 
          <th>Age</th>
        </tr>
        <tr>
          <td>Jill</td>
          <td>Smith</td>
          <td>50</td>
        </tr>
        <tr>
          <td>Eve</td>
          <td>Jackson</td>
          <td>94</td>
        </tr>
        <tr>
          <td>John</td>
          <td>Doe</td>
          <td>80</td>
        </tr>
      </table>
    </div class = "las">
</div class = "asas">

我已将其保存在名为“code”的变量中，我如何访问标签，例如：code[0][0][1][1]。我使用 Beautiful Soup，我知道遍历嵌套标签的唯一方法是使用 .parents 和 .children，这变得非常混乱<td>Smith</td>

python beautifulsoup html 解析

# from bs4 import BeautifulSoup

code_str = '''
<div class = "las">
    <div class = "asas">
      <table style="width:100%">
        <tr><th>Firstname</th><th>Lastname</th><th>Age</th></tr>
        <tr><td>Jill</td><td>Smith</td><td>50</td></tr>
        <tr><td>Eve</td><td>Jackson</td><td>94</td></tr>
        <tr><td>John</td><td>Doe</td><td>80</td></tr>
      </table>
    </div>
</div>
''' 
code = BeautifulSoup(code_str).div

如何访问<td>Smith</td>标签，例如：code[0][0][1][1]

我使用 Beautiful Soup，我知道遍历嵌套标签的唯一方法是使用 .parents 和 .children 变得非常混乱

[ 所以我猜你不会对 Just 或类似的东西感到满意。code.div.table.select('tr')[1].select('td')[1]code.select_one('div>table>tr:nth-child(2)>td:nth-child(2)')

还有 .contents [它返回一个列表而不是像 ] 这样的生成器，但我会小心使用 code.contents[0].contents[0].contents[1].contents[1]，因为可以包含空格^{（例如查看 code.table.contents）。}.children.contents

您可以使用正则表达式删除标签之间的空格

# import re
# from bs4 import BeautifulSoup
code = BeautifulSoup(re.sub('>\s*<', '><', code_str).div

然后应该返回.code.contents[0].contents[0].contents[1].contents[1]<td>Smith</td>

或者，您可以编写一个转换为嵌套列表的类.contents

# import bs4
class indexableTag:
    def __init__(self, origTag:bs4.element.Tag, ignore_whitespace=True):
        self.tag = origTag
        self.tag_contents = [
            indexableTag(c) if isinstance(c,bs4.element.Tag) else c 
            for c in origTag.children
            if not (ignore_whitespace and isinstance(c,str) and not c.strip())
        ]
        
    def __getitem__(self, key): 
        return self.tag_contents[key]

code = indexableTag(bs4.BeautifulSoup(code_str).div)

然后应该返回 bs4 标签。code[0][0][1][1].tag<td>Smith</td>

也许您可以将该表转换为 pandas DataFrame 并从中选择数据。
– AndrejKesely 的评论

您可以使用 read_html^{（view DataFrame）} 将表转换为 DataFrame

# import pandas as pd 

df = pd.read_html(code_str)[0] ##-> df.loc[0]['Lastname'] #='Smith'
df_dict = df.to_dict() ##-> df_dict['Lastname'][0] #='Smith'
df_recs = df.to_dict('records') ##-> df_recs[0]['Lastname'] #='Smith'

上一个：Angular：获取 html 解析错误。自动将一些“_ng..”空属性渲染到 html 中

下一个：使用 ParseDelegator 查找输入或其子项中 /wiki/Geographic_coordinate_system 的第一个匹配项

我想遍历 HTML 代码中的嵌套标签，例如列表或 JSON 文件

i want to iterate through the nested tags in html code like a list or json file

评论