如何遍历 HTML 并解析特定数据?

How to iterate through HTML and parse specific data?

提问人:Marco Almeida 提问时间:3/10/2023 更新时间:3/10/2023 访问量:180

问:

下面的 python 代码是从 html 特定数据中提取的,它仅适用于 html 中包含的一个实例。

我需要的是代码来遍历具有多个实例的 html 并检索特定信息。那么,我怎样才能做到这一点呢?

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
<title>Exported Data</title>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <link href="css/style.css" rel="stylesheet"/>
  <script src="js/script.js" type="text/javascript">
  </script>
 </head>
 <body onload="CheckLocation();">
  <div class="page_wrap">
   <div class="page_header">
    <div class="content">
     <div class="text bold">
🤖🥇 𝑬𝒂𝒔𝒚 𝑩𝒐𝒕 - 𝑶𝒗𝒆𝒓 2.5 
     </div>
    </div>
   </div>
   <div class="page_body chat_page">
    <div class="history">
     <div class="message service" id="message-1">
      <div class="body details">
9 March 2023
      </div>
     </div>
     <div class="message default clearfix" id="message3984">
      <div class="pull_left userpic_wrap">
       <div class="userpic userpic2" style="width: 42px; height: 42px">
        <div class="initials" style="line-height: 42px">
?
        </div>
       </div>
      </div>
      <div class="body">
       <div class="pull_right date details" title="09.03.2023 00:27:10 UTC-03:00">
00:27
       </div>
       <div class="from_name">
🤖🥇 𝑬𝒂𝒔𝒚 𝑩𝒐𝒕 - 𝑶𝒗𝒆𝒓 2.5 
       </div>
       <div class="text">
Easy Bot - Over 2.5<br><br>🏆 Liga: Premiership<br>🚦 Entrada: Over 2.5 FT<br>⚽ Jogos: ✅ 03:30  03:33  03:36 ( 03:39)<br><br><strong>Link: </strong><a href="https://www.bet365.com/#/AVR/B146/R%5E1/">https://www.bet365.com/#/AVR/B146/R%5E1/</a><br><br>🍀 24h:100% de acerto nas últimas 24h<br><br>✅✅✅✅✅✅ .
       </div>
      </div>
     </div>
     <div class="message default clearfix" id="message3985">
      <div class="pull_left userpic_wrap">
       <div class="userpic userpic2" style="width: 42px; height: 42px">
        <div class="initials" style="line-height: 42px">
?
        </div>
       </div>
      </div>
      <div class="body">
       <div class="pull_right date details" title="09.03.2023 00:45:16 UTC-03:00">
00:45
       </div>
       <div class="from_name">
🤖🥇 𝑬𝒂𝒔𝒚 𝑩𝒐𝒕 - 𝑶𝒗𝒆𝒓 2.5 
       </div>
       <div class="text">
Easy Bot - Over 2.5<br><br>🏆 Liga: Premiership<br>🚦 Entrada: Over 2.5 FT<br>⚽ Jogos: ✅ 03:48  03:51  03:54 ( 03:57)<br><br><strong>Link: </strong><a href="https://www.bet365.com/#/AVR/B146/R%5E1/">https://www.bet365.com/#/AVR/B146/R%5E1/</a><br><br>🍀 24h:100% de acerto nas últimas 24h<br><br>✅✅✅✅✅✅ .
       </div>
      </div>
     </div>
     </div>
    </div>
   </div>
  </div>
 </body>
</html>
python-3.x beautifulsoup html 解析

评论

0赞 Tranbi 3/10/2023
看一看 beatifulsoup

答:

1赞 Jack Fleeting 3/10/2023 #1

好吧,这个问题比你之前的问题要复杂一些,所以你需要更多的杂技:

for b in soup.select('div[class="body"]'):
    d_str = b.select_one('div.date.details')['title']
    calendar = d_str.split(" ")
    print("Date: ",calendar[0])
    print("Time: ",calendar[1])
    targets = b.select('div.text')
    for target in targets:
        for sts in target.stripped_strings:
            if "⚽ Jogos: " in sts:   
                jugos = [elem for elem in sts.split('⚽ Jogos: ')[1].replace('( ',"(").split(" ") if elem]            
                if "✅" in jugos:
                    ind = jugos.index('✅')+1
                    print("Checkmarked: ", ind)
                    jugos.remove("✅")
                    print(jugos)
                else:
                    print(jugos)
                    print("Checkmarked: NA")
        print('------------------------------------')

输出:

Date:  09.03.2023
Time:  00:27:10
Checkmarked:  1
['03:30', '03:33', '03:36', '(03:39)']
------------------------------------
Date:  09.03.2023
Time:  00:45:16
Checkmarked:  1
['03:48', '03:51', '03:54', '(03:57)']