使用 Beautiful Soup 解析 HTML 文档

Parse HTML document with Beautiful Soup

提问人:ValeTrut 提问时间:12/2/2022 更新时间:12/2/2022 访问量:28

问:

我是解析 HTML 文档的新手,我遇到了这个问题。

给出一个 HTML 文档,如下所示:

<h3>File: /home/finxadm/XMW.SET.OXF.CPP/LangCpp/oxf/OMMainThread.h</h3>
<table class="metricstable" width="100%">
<h4>Function: ::OMMainThread::destroyThread()</h4>
<table class="metricstable" width="100%">
<tr><td class="lightheader" align="left">Metric</td><td class="lightheader" align="right">CALLS (STCAL)</td><td class="lightheader" align="right">v(G) (STCYC)</td><td class="lightheader" align="right">GOTO (STGTO)</td><td class="lightheader" align="right">RETURN (STM19)</td><td class="lightheader" align="right">LEVEL (STMIF)</td><td class="lightheader" align="right">PARAM (STPAR)</td><td class="lightheader" align="right">PATH (STPTH)</td><td class="lightheader" align="right">STMT (STST3)</td></tr>
<tr><td class="lightheader" align="left">Values</td><td align="right">1</td><td align="right">1</td><td align="right">0</td><td align="right">0</td><td align="right">0</td><td align="right">0</td><td align="right">1</td><td align="right">1</td></tr>
</table>
<h3>File: /home/finxadm/XMW.SET.OXF.CPP/LangCpp/oxf/OMNullValue.h</h3>
<table class="metricstable" width="100%">
<h4>Function: ::OMNullValue<p{c::Ping}>::get()</h4>
<table class="metricstable" width="100%">
<tr><td class="lightheader" align="left">Metric</td><td class="lightheader" align="right">CALLS (STCAL)</td><td class="lightheader" align="right">v(G) (STCYC)</td><td class="lightheader" align="right">GOTO (STGTO)</td><td class="lightheader" align="right">RETURN (STM19)</td><td class="lightheader" align="right">LEVEL (STMIF)</td><td class="lightheader" align="right">PARAM (STPAR)</td><td class="lightheader" align="right">PATH (STPTH)</td><td class="lightheader" align="right">STMT (STST3)</td></tr>
<tr><td class="lightheader" align="left">Values</td><td align="right">1</td><td align="right">1</td><td align="right">0</td><td align="right">1</td><td align="right">0</td><td align="right">0</td><td align="right">1</td><td align="right">2</td></tr>
</table>
<h4>Function: ::OMNullValue<p{c::Ping}>::initNullBlock()</h4>
<table class="metricstable" width="100%">
<tr><td class="lightheader" align="left">Metric</td><td class="lightheader" align="right">CALLS (STCAL)</td><td class="lightheader" align="right">v(G) (STCYC)</td><td class="lightheader" align="right">GOTO (STGTO)</td><td class="lightheader" align="right">RETURN (STM19)</td><td class="lightheader" align="right">LEVEL (STMIF)</td><td class="lightheader" align="right">PARAM (STPAR)</td><td class="lightheader" align="right">PATH (STPTH)</td><td class="lightheader" align="right">STMT (STST3)</td></tr>
<tr><td class="lightheader" align="left">Values</td><td align="right">0</td><td align="right">2</td><td align="right">0</td><td align="right">0</td><td align="right">1</td><td align="right">0</td><td align="right">2</td><td align="right">5</td></tr>
</table>
<h4>Function: ::OMNullValue<p{c::Pong}>::get()</h4>
<table class="metricstable" width="100%">
<tr><td class="lightheader" align="left">Metric</td><td class="lightheader" align="right">CALLS (STCAL)</td><td class="lightheader" align="right">v(G) (STCYC)</td><td class="lightheader" align="right">GOTO (STGTO)</td><td class="lightheader" align="right">RETURN (STM19)</td><td class="lightheader" align="right">LEVEL (STMIF)</td><td class="lightheader" align="right">PARAM (STPAR)</td><td class="lightheader" align="right">PATH (STPTH)</td><td class="lightheader" align="right">STMT (STST3)</td></tr>
<tr><td class="lightheader" align="left">Values</td><td align="right">1</td><td align="right">1</td><td align="right">0</td><td align="right">1</td><td align="right">0</td><td align="right">0</td><td align="right">1</td><td align="right">2</td></tr>
</table>
<h4>Function: ::OMNullValue<p{c::Pong}>::initNullBlock()</h4>
<table class="metricstable" width="100%">
<tr><td class="lightheader" align="left">Metric</td><td class="lightheader" align="right">CALLS (STCAL)</td><td class="lightheader" align="right">v(G) (STCYC)</td><td class="lightheader" align="right">GOTO (STGTO)</td><td class="lightheader" align="right">RETURN (STM19)</td><td class="lightheader" align="right">LEVEL (STMIF)</td><td class="lightheader" align="right">PARAM (STPAR)</td><td class="lightheader" align="right">PATH (STPTH)</td><td class="lightheader" align="right">STMT (STST3)</td></tr>
<tr><td class="lightheader" align="left">Values</td><td align="right">0</td><td align="right">2</td><td align="right">0</td><td align="right">0</td><td align="right">1</td><td align="right">0</td><td align="right">2</td><td align="right">5</td></tr>
</table>
<h3>File: /home/finxadm/XMW.SET.OXF.CPP/LangCpp/oxf/OMStaticArray.h</h3>
<table class="metricstable" width="100%">
<h4>Function: ::OMStaticArray<p{c::Ping}>::@constructor(,ni)</h4>
<table class="metricstable" width="100%">
<tr><td class="lightheader" align="left">Metric</td><td class="lightheader" align="right">CALLS (STCAL)</td><td class="lightheader" align="right">v(G) (STCYC)</td><td class="lightheader" align="right">GOTO (STGTO)</td><td class="lightheader" align="right">RETURN (STM19)</td><td class="lightheader" align="right">LEVEL (STMIF)</td><td class="lightheader" align="right">PARAM (STPAR)</td><td class="lightheader" align="right">PATH (STPTH)</td><td class="lightheader" align="right">STMT (STST3)</td></tr>
<tr><td class="lightheader" align="left">Values</td><td align="right">4</td><td align="right">2</td><td align="right">0</td><td align="right">0</td><td align="right">1</td><td align="right">1</td><td align="right">2</td><td align="right">2</td></tr>
</table>

我需要的是创建一个这样的数据结构:

<文件名、函数(与该文件相关)、该函数的 STCYC 值>

我试着像这样迭代:

for files_and_functions in soup.find_all(['h3','h4','table']):
        for elem in files_and_functions:
            valore = elem.text

并询问每个 elem 是函数、文件还是 STCYC 值,但我无法摆脱它。 有没有人可以从这个可怕的 HTML 中获取这些信息?谢谢!

html python-3.x beautifulsoup html-解析

评论


答:

0赞 thunderkill 12/2/2022 #1

你可以尝试使用它

 from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup
html = #the HTML code you've written above
parsed_html = BeautifulSoup(html)
print(parsed_html.body.find('div', attrs={'class':'container'}).text)
0赞 Andrej Kesely 12/2/2022 #2

如果包含问题的 HTML 代码段,您可以执行以下操作:html_doc

soup = BeautifulSoup(html_doc, "html.parser")

for t in soup.select("table.metricstable:not(:has(table))"):
    k = [td.text for td in t.tr.find_all("td")]
    v = [td.text for td in t.tr.find_next("tr").find_all("td")]

    d = dict(zip(k, v))

    filename = t.find_previous("h3").text
    function = t.find_previous("h4").text
    styc = d["v(G) (STCYC)"]

    print("{:<50} {:<10} {}".format(function, styc, filename))

指纹:

Function: ::OMMainThread::destroyThread()          1          File: /home/finxadm/XMW.SET.OXF.CPP/LangCpp/oxf/OMMainThread.h
Function: ::OMNullValue::get()                     1          File: /home/finxadm/XMW.SET.OXF.CPP/LangCpp/oxf/OMNullValue.h
Function: ::OMNullValue::initNullBlock()           2          File: /home/finxadm/XMW.SET.OXF.CPP/LangCpp/oxf/OMNullValue.h
Function: ::OMNullValue::get()                     1          File: /home/finxadm/XMW.SET.OXF.CPP/LangCpp/oxf/OMNullValue.h
Function: ::OMNullValue::initNullBlock()           2          File: /home/finxadm/XMW.SET.OXF.CPP/LangCpp/oxf/OMNullValue.h
Function: ::OMStaticArray::@constructor(,ni)       2          File: /home/finxadm/XMW.SET.OXF.CPP/LangCpp/oxf/OMStaticArray.h

评论

0赞 ValeTrut 12/2/2022
非常感谢,这对我帮助很大。但是,即使我复制并粘贴了您的代码片段,我也收到了以下错误: styc = d[“v(G) (STCYC)”] KeyError: 'v(G) (STCYC)' 我真的不知道为什么,我错过了什么吗?
0赞 Andrej Kesely 12/2/2022
@ValeTrut尝试styc = d.get("v(G) (STCYC)", 'Not Found')