将 th 标记中的信息附加到 td 行

Append information in the th tags to td rows

提问人:the_special_none 提问时间:5/31/2022 最后编辑:the_special_none 更新时间:5/31/2022 访问量:63

问:

我是一名经济学家,在编码和数据抓取方面苦苦挣扎。 我正在从此网页(https://www.oddsportal.com/basketball/europe/euroleague-2013-2014/results/)上的主表和唯一表中删除数据。我可以通过引用 class 元素来检索带有 python selenium 的 td HTML 标签的所有信息。这同样适用于存储比赛日期和阶段信息的标签。在我的最终数据集中,我希望将信息存储在 th 标签中,在表格中其他行旁边的两行(数据和比赛阶段)中。基本上,对于每场比赛,我希望比赛的日期和阶段排成一排,而不是作为每组比赛的负责人。 我想出的唯一解决方案是索引所有行(同时带有 th 和 td 标签)并构建一个 while 循环以将 th 标签中的信息附加到索引低于 th 标签的下一个索引的 td 行。希望我把自己说清楚(如果没有,我会尝试给出更生动的解释)。但是,由于我的编码能力较差,我无法编写这样的逻辑结构。我不知道我是否需要两个循环来遍历不同的标签(td 和 th)以及如何做到这一点。如果您有任何更简单的解决方案,非常欢迎! 提前感谢您的宝贵帮助!

代码如下:

from selenium import webdriver
import time
import pandas as pd

# Season to filter
seasons_filt = ['2013-2014', '2014-2015', '2015-2016','2016-2017', '2017-2018', '2018-2019']

# Define empty data
data_keys = ["Season", "Match_Time", "Home_Team", "Away_Team", "Home_Odd", "Away_Odd", "Home_Score",
             "Away_Score", "OT", "N_Bookmakers"]
data = dict()
for key in data_keys:
    data[key] = list()
del data_keys
    
# Define 'driver' variable and launch browser
#path = "C:/Users/ALESSANDRO/Downloads/chromedriver_win32/chromedriver.exe"
#path office pc
path = "C:/Users/aldi/Downloads/chromedriver.exe"
driver = webdriver.Chrome(path)

# Loop through pages based on page_num and season
for season_filt in seasons_filt:
    page_num = 0
    while True:
        page_num += 1
                       
        # Get url and navigate it
        page_str = (1 - len(str(page_num)))* '0' + str(page_num)
        url ="https://www.oddsportal.com/basketball/europe/euroleague-" + str(season_filt) + "/results/#/page/" + page_str + "/"
        driver.get(url)
        time.sleep(3)
        
        # Check if page has no data
        if driver.find_elements_by_id("emptyMsg"):
            print("Season {} ended at page {}".format(season_filt, page_num))
            break
        
        try:      
            # Teams
            for el in driver.find_elements_by_class_name('name.table-participant'):
                el = el.text.strip().split(" - ")
                data["Home_Team"].append(el[0])
                data["Away_Team"].append(el[1])
                data["Season"].append(season_filt)
            
            # Scores
            for el in driver.find_elements_by_class_name('center.bold.table-odds.table-score'):
                el = el.text.split(":")
                if el[1][-3:] == " OT":
                    data["OT"].append(True)
                    el[1] = el[1][:-3]
                else:
                    data["OT"].append(False)
                data["Home_Score"].append(el[0])
                data["Away_Score"].append(el[1])
            
            # Match times
            for el in driver.find_elements_by_class_name("table-time"):
                data["Match_Time"].append(el.text)
            
            # Odds
            i = 0
            for el in driver.find_elements_by_class_name("odds-nowrp"):
                i += 1
                if i%2 == 0:
                    data["Away_Odd"].append(el.text)
                else:
                    data["Home_Odd"].append(el.text)
                    
            # N_Bookmakers
            for el in driver.find_elements_by_class_name("center.info-value"):
                data["N_Bookmakers"].append(el.text)
            
            # TODO think of inserting the dates list in the dataframe even if it has a different size (19 rows and not 50)

        except:
            pass

driver.quit()

data = pd.DataFrame(data)
data.to_csv("data_odds.csv", index = False)

我想将此信息作为两行添加到我的数据集中:

for el in driver.find_elements_by_class_name("first2.tl")[1:]:
    el = el.text.strip().split(" - ")
    data["date"].append(el[0])
    data["stage"].append(el[1])



html pandas selenium 硒 -webdriver 网页抓取

评论

0赞 sound wave 5/31/2022
你能提供你写的完整代码吗?
0赞 the_special_none 5/31/2022
我把代码添加到了帖子😊中

答:

1赞 chitown88 5/31/2022 #1

我在这里要改变的东西很少。

  1. 不要覆盖变量。将元素存储在变量中,然后用字符串覆盖元素。它可能在这里对你有用,但你以后可能会因为这种做法而陷入困境,特别是因为你正在迭代这些元素。这也使调试变得困难。el

  2. 我知道 Selenium 有办法解析 html。但我个人觉得 BeautifulSoup 更容易解析,如果您只是想从 html 中提取数据,它会更直观一些。因此,我选择了 BeautifulSoup's 来获取游戏之前的标签,基本上可以获取您的日期和舞台内容。.find_previous()

  3. 最后,我喜欢构建一个字典列表来组成数据框。列表中的每个项目都是一个字典 key:value,其中 key 是列名,value 是数据。在创建列表字典时,您会做相反的事情。现在这没有错,但是如果列表的长度不同,则在尝试创建数据帧时会遇到错误。与我的方式一样,如果由于某种原因缺少一个值,它仍然会创建数据帧,但只会为缺失的数据提供 null 或 nan。

您可能需要对代码做更多工作才能浏览页面,但这会以所需的形式获得数据。

法典:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
import pandas as pd
from bs4 import BeautifulSoup
import re

# Season to filter
seasons_filt = ['2013-2014', '2014-2015', '2015-2016','2016-2017', '2017-2018', '2018-2019']

    
# Define 'driver' variable and launch browser
path = "C:/Users/ALESSANDRO/Downloads/chromedriver_win32/chromedriver.exe"
driver = webdriver.Chrome(path)

rows = []
# Loop through pages based on page_num and season
for season_filt in seasons_filt:
    page_num = 0
    while True:
        page_num += 1
                       
        # Get url and navigate it
        page_str = (1 - len(str(page_num)))* '0' + str(page_num)
        url ="https://www.oddsportal.com/basketball/europe/euroleague-" + str(season_filt) + "/results/#/page/" + page_str + "/"
        driver.get(url)
        time.sleep(3)
        
        # Check if page has no data
        if driver.find_elements_by_id("emptyMsg"):
            print("Season {} ended at page {}".format(season_filt, page_num))
            break
        
        try:      
            soup = BeautifulSoup(driver.page_source, 'html.parser')
            table = soup.find('table', {'id':'tournamentTable'})
            
            trs = table.find_all('tr', {'class':re.compile('.*deactivate.*')})
            for each in trs:
                teams = each.find('td', {'class':'name table-participant'}).text.split(' - ')
                scores = each.find('td', {'class':re.compile('.*table-score.*')}).text.split(':')
                ot = False
                for score in scores:
                    if 'OT' in score:
                        ot == True
                scores = [x.replace('\xa0OT','') for x in scores]
                matchTime = each.find('td', {'class':re.compile('.*table-time.*')}).text
                
                # Odds
                i = 0
                for each_odd in each.find_all('td',{'class':"odds-nowrp"}):
                    i += 1
                    if i%2 == 0:
                        away_odd = each_odd.text
                    else:
                        home_odd = each_odd.text
                        
                n_bookmakers = soup.find('td',{'class':'center info-value'}).text
                
                date_stage = each.find_previous('th', {'class':'first2 tl'}).text.split(' - ')
                date = date_stage[0]
                stage = date_stage[1]
                
                
                row = {'Season':season_filt,
                 'Home_Team':teams[0],
                 'Away_Team':teams[1],
                 'Home_Score':scores[0],
                 'Away_Score':scores[1],
                 'OT':ot,
                 'Match_Time':matchTime,
                 'Home_Odd':home_odd,
                 'Away_Odd':away_odd,
                 'N_Bookmakers':n_bookmakers,
                 'Date':date,
                 'Stage':stage}
                
                rows.append(row)
                
                

        except:
            pass

driver.quit()

data = pd.DataFrame(rows)
data.to_csv("data_odds.csv", index = False)

输出:

print(data.head(15).to_string())
       Season         Home_Team          Away_Team Home_Score Away_Score     OT Match_Time Home_Odd Away_Odd N_Bookmakers         Date       Stage
0   2013-2014       Real Madrid   Maccabi Tel Aviv         86         98  False      18:00     -667     +493            7  18 May 2014  Final Four
1   2013-2014         Barcelona        CSKA Moscow         93         78  False      15:00     -135     +112            7  18 May 2014  Final Four
2   2013-2014         Barcelona        Real Madrid         62        100  False      19:00     +134     -161            7  16 May 2014  Final Four
3   2013-2014       CSKA Moscow   Maccabi Tel Aviv         67         68  False      16:00     -278     +224            7  16 May 2014  Final Four
4   2013-2014       Real Madrid        Olympiacos          83         69  False      18:45     -500     +374            7  25 Apr 2014   Play Offs
5   2013-2014       CSKA Moscow     Panathinaikos          74         44  False      16:00     -370     +295            7  25 Apr 2014   Play Offs
6   2013-2014        Olympiacos       Real Madrid          71         62  False      18:45     +127     -152            7  23 Apr 2014   Play Offs
7   2013-2014  Maccabi Tel Aviv    Olimpia Milano          86         66  False      17:45     -217     +179            7  23 Apr 2014   Play Offs
8   2013-2014     Panathinaikos       CSKA Moscow          73         72  False      16:30     -106     -112            7  23 Apr 2014   Play Offs
9   2013-2014     Panathinaikos       CSKA Moscow          65         59  False      18:45     -125     +104            7  21 Apr 2014   Play Offs
10  2013-2014  Maccabi Tel Aviv    Olimpia Milano          75         63  False      18:15     -189     +156            7  21 Apr 2014   Play Offs
11  2013-2014        Olympiacos       Real Madrid          78         76  False      17:00     +104     -125            7  21 Apr 2014   Play Offs
12  2013-2014       Galatasaray         Barcelona          75         78  False      17:00     +264     -333            7  20 Apr 2014   Play Offs
13  2013-2014    Olimpia Milano  Maccabi Tel Aviv          91         77  False      18:45     -286     +227            7  18 Apr 2014   Play Offs
14  2013-2014       CSKA Moscow     Panathinaikos          77         51  False      16:15     -303     +247            7  18 Apr 2014   Play Offs

评论

0赞 the_special_none 5/31/2022
非常感谢!你给了我很多有用的见解!我真的很感激!😊