如何通过它们的 class 属性解析重复的 HTML 元素?

How can I parse repeated HTML elements by their class attribute?

提问人: 提问时间:2/6/2022 最后编辑:Lance U. Matthews 更新时间:2/6/2022 访问量:81

问:

我正在尝试解析具有基本相同标签的 HTML 文件。

我想得到这个输出:

BTC - 比特币, BEP20(BSC), 比特币(Segwit)

ETH - ERC20, BEP20(BSC), 多边形, 任意币, 极光, 马蒂塞夫姆

USDT - OMNI,TRC20,ERC20,BEP20(BSC),HECO,POLYGON,FTM, AVAX-C ,ARBITRUM,METISEVM

QASH - ERC20型

下面是 HTML 的示例:

<div data-v-326d86f4="" class="table-box">
   <table data-v-326d86f4="">
      <tr data-v-326d86f4="">
         <td data-v-326d86f4="">BTC</td>
         <td data-v-326d86f4="" class="block-chain">
            <div data-v-326d86f4="" class="chain_box"><span data-v-326d86f4="" class="chain_name">Bitcoin</span> <span data-v-326d86f4=""><i data-v-326d86f4="" class="fa fa-caret-down"></i></span></div>
            <div data-v-326d86f4="" class="select-list"><span data-v-326d86f4="">Bitcoin</span><span data-v-326d86f4="">BEP20(BSC)</span><span data-v-326d86f4="">Bitcoin(SegWit)</span></div>
         </td>
         <td data-v-326d86f4="">0.001</td>
         <td data-v-326d86f4="">0.002</td>
      </tr>
      <tr data-v-326d86f4="">
         <td data-v-326d86f4="">ETH</td>
         <td data-v-326d86f4="" class="block-chain">
            <div data-v-326d86f4="" class="chain_box"><span data-v-326d86f4="" class="chain_name">ERC20</span> <span data-v-326d86f4=""><i data-v-326d86f4="" class="fa fa-caret-down"></i></span></div>
            <div data-v-326d86f4="" class="select-list"><span data-v-326d86f4="">ERC20</span><span data-v-326d86f4="">BEP20(BSC)</span><span data-v-326d86f4="">POLYGON</span><span data-v-326d86f4="">ARBITRUM</span><span data-v-326d86f4="">AURORA</span><span data-v-326d86f4="">METISEVM</span></div>
         </td>
         <td data-v-326d86f4="">0.012</td>
         <td data-v-326d86f4="">0.024</td>
      </tr>
      <tr data-v-326d86f4="">
         <td data-v-326d86f4="">USDT</td>
         <td data-v-326d86f4="" class="block-chain">
            <div data-v-326d86f4="" class="chain_box"><span data-v-326d86f4="" class="chain_name">OMNI</span> <span data-v-326d86f4=""><i data-v-326d86f4="" class="fa fa-caret-down"></i></span></div>
            <div data-v-326d86f4="" class="select-list"><span data-v-326d86f4="">OMNI</span><span data-v-326d86f4="">TRC20</span><span data-v-326d86f4="">ERC20</span><span data-v-326d86f4="">BEP20(BSC)</span><span data-v-326d86f4="">HECO</span><span data-v-326d86f4="">POLYGON</span><span data-v-326d86f4="">FTM</span><span data-v-326d86f4="">AVAX-C</span><span data-v-326d86f4="">ARBITRUM</span><span data-v-326d86f4="">METISEVM</span></div>
         </td>
         <td data-v-326d86f4="">30</td>
         <td data-v-326d86f4="">50</td>
      </tr>
      <tr data-v-326d86f4="">
         <td data-v-326d86f4="">QASH</td>
         <td data-v-326d86f4="" class="block-chain">
            <div data-v-326d86f4="" class="chain_box">
               <span data-v-326d86f4="" class="chain_name">ERC20</span> <!---->
            </div>
            <!---->
         </td>
         <td data-v-326d86f4="">513</td>
         <td data-v-326d86f4="">1026</td>
      </tr>
      <!-- ... -->

我正在使用该库但没有成功:HtmlAgilityPack

Dim arqHtml As String = "C:\Users\Mattia\Desktop\ready.html"
Dim myHtml As HtmlAgilityPack.HtmlDocument = New HtmlAgilityPack.HtmlDocument()
myHtml.Load(arqHtml)
Dim myTable As HtmlAgilityPack.HtmlNode = myHtml.DocumentNode.SelectSingleNode("//table")
Dim myRows As HtmlAgilityPack.HtmlNodeCollection = myTable.SelectNodes("tr")
For Each tmpRow As HtmlAgilityPack.HtmlNode In myRows
    Dim myCells As HtmlAgilityPack.HtmlNodeCollection = tmpRow.SelectNodes("td")
    If myCells IsNot Nothing Then
        Dim myToken As String = myCells(0).InnerText
        Dim mySpans As HtmlAgilityPack.HtmlNodeCollection = myCells(1).SelectNodes("div[contains(@class,'select-list')]/span")
        If mySpans IsNot Nothing Then
            Dim myListBChain As New List(Of String)
            For Each mySpan As HtmlAgilityPack.HtmlNode In mySpans
                RichTextBox1.Text += mySpan.InnerText
            Next
            Dim allItensAsString = String.Join(", ", richtextbox1.text)
        End If
    End If
Next

这将返回以下输出:

比特币BEP20(BSC)比特币(SegWit)ERC20BEP20(BSC)POLYGONARBITRUMAURORAMETISEVMOMNITRC20ERC20BEP20(BSC)HECOPOLYGONFTMAVAX-CARBITRUMMETISEVMEOSBEP20(BSC)ERC20BEP20(BSC)TRC20BEP20(BSC)ZILBEP20(BSC)NEOLEGACYNEON3ERC20POLYGONERC20DAGBEP2BEP20(BSC)FTMAVAX-CERC20BEP20(BSC)ERC20BEP20(BSC)ERC20HECOBEP20(BSC)ERC20HECOERC20POLYGONERC20HECOERC20POLYGONERC20BEP20(BSC)BCHBEP20(BSC)ERC20LOOPPOLYGONBEP20(BSC)FTMAVAX-CMETISEVMERC20TOLERC20METAERC20BEP20(BSC)

如何让它返回我想要的输出?

html .net vb.net html-parsing html-agility-pack

评论

0赞 Lance U. Matthews 2/6/2022
请在问题本身中发布要处理的 HTML 的最小示例。外部托管的内容不适合 Stack Overflow,因为可能会发生链接腐烂,但您在现已删除的链接中使用的服务甚至显示“此文档将在 21 小时内过期”,使此问题在那之后毫无用处。
0赞 Lance U. Matthews 2/6/2022
在示例的最后一个中,第一个包含 ,第二个不包含 so returns,因此出现异常。<tr><td>QASH<td><div ... class="select-list">myCells(1).SelectNodes("div[contains(@class,'select-list')]/span")Nothing
0赞 2/6/2022
我刚刚更新了代码,因为我意识到在某些行上没有跨度,所以在继续之前必须检查。你所说的是正确的,因为在这种情况下,硬币QASH只支持一条链,所以它不需要选择列表。
0赞 Lance U. Matthews 2/6/2022
我认为您只需要交换几行:替换为 和 .但是,这不会正确地将 for 多个 s 的值(使用换行符或)连接在一起。RichTextBox1.Text += mySpan.InnerTextmyListBChain.Add(mySpan.InnerText)Dim allItensAsString = String.Join(", ", myListBChain)RichTextBox1.Text += String.Join(", ", myListBChain),RichTextBox1.TexttmpRow
0赞 2/6/2022
刚刚尝试了你建议我的,但输出仍然不清楚,就像Bitcoin, BEP20(BSC), Bitcoin(SegWit)ERC20, BEP20(BSC), POLYGON, ARBITRUM, AURORA, METISEVMOMNI, TRC20, ERC20, BEP20(BSC), HECO, POLYGON, FTM, AVAX-C, ARBITRUM, METISEVMEOS, BEP20(BSC)ERC20, BEP20(BSC)TRC20, BEP20(BSC)ZIL, BEP20(BSC)NEOLEGACY, NEON3ERC20, POLYGONERC20

答:

0赞 Lance U. Matthews 2/6/2022 #1

将我对原始问题的评论纳入示例的最后一个......<tr>

<tr data-v-326d86f4="">
    <td data-v-326d86f4="">QASH</td>
    <td data-v-326d86f4="" class="block-chain">
    <div data-v-326d86f4="" class="chain_box">
        <span data-v-326d86f4="" class="chain_name">ERC20</span> <!---->
    </div>
    <!---->
    </td>
    <td data-v-326d86f4="">513</td>
    <td data-v-326d86f4="">1026</td>
</tr>

...第二个不包含 ,所以......<td><div class="select-list" ... >

myCells(1).SelectNodes("div[contains(@class,'select-list')]/span")

...返回 ,因此 .NothingNullReferenceException

至于构建你想要的输出,首先你需要测试这样的元素是否存在......<div class="select-list" ... >

If mySpans Is Nothing Then

如果没有,则保存元素的内容...<div class="chain_box" ... ><span class="chain_name ... >

Dim chainTextNode As HtmlAgilityPack.HtmlNode = myCells(1).SelectSingleNode(
    "div[contains(@class, 'chain_box')]/span[contains(@class, 'chain_name')]"
)

chainText = If(chainTextNode Is Nothing OrElse String.IsNullOrWhiteSpace(chainTextNode.InnerText), "(unknown)", chainTextNode.InnerText)

我添加了一些额外的处理,以防该元素不存在或没有值。

如果有一个元素,则保存其子元素的值,用逗号分隔...<div class="select-list" ... ><span ... >

chainText = String.Join(", ", mySpans.Select(Function(span) span.InnerText))
' Alternative: chainText = String.Join(", ", From span In mySpans Select span.InnerText)

最后,构建一个新行并将其附加到文本框中...

RichTextBox1.Text &= $"{myToken} - {chainText}{Environment.NewLine}"

完整的代码如下所示...

Dim arqHtml As String = "C:\Users\Mattia\Desktop\ready.html"
Dim myHtml As HtmlAgilityPack.HtmlDocument = New HtmlAgilityPack.HtmlDocument()
myHtml.Load(arqHtml)
Dim myTable As HtmlAgilityPack.HtmlNode = myHtml.DocumentNode.SelectSingleNode("//table")

Dim myRows As HtmlAgilityPack.HtmlNodeCollection = myTable.SelectNodes("tr")
For Each tmpRow As HtmlAgilityPack.HtmlNode In myRows
    Dim myCells As HtmlAgilityPack.HtmlNodeCollection = tmpRow.SelectNodes("td")
    If myCells IsNot Nothing Then
        Dim myToken As String = myCells(0).InnerText
        Dim mySpans As HtmlAgilityPack.HtmlNodeCollection = myCells(1).SelectNodes("div[contains(@class,'select-list')]/span")
        Dim chainText As String

        If mySpans Is Nothing Then
            Dim chainTextNode As HtmlAgilityPack.HtmlNode = myCells(1).SelectSingleNode(
                "div[contains(@class, 'chain_box')]/span[contains(@class, 'chain_name')]"
            )

            chainText = If(chainTextNode Is Nothing OrElse String.IsNullOrWhiteSpace(chainTextNode.InnerText), "(unknown)", chainTextNode.InnerText)
        Else
            chainText = String.Join(", ", mySpans.Select(Function(span) span.InnerText))
            ' Alternative: chainText = String.Join(", ", From span In mySpans Select span.InnerText)
        End If

        RichTextBox1.Text &= $"{myToken} - {chainText}{Environment.NewLine}"
    End If
Next

如果您有一个非常大的输入 HTML 文件,您可以考虑...

...但是,为了提高性能,使用其中一种或两种方法意味着在完全处理 HTML 之前不会显示任何输出。RichTextBox1