Panda 未打印所有表格

Panda not printing all of the table

提问人:Kanta 提问时间:7/5/2023 最后编辑:Kanta 更新时间:7/5/2023 访问量:33

问:

这是我的第一篇文章,所以我希望我不会忘记任何事情。

因此,我试图抓取所有UFC赛事以查看战士的某些统计数据,并尝试使用Pandas。

这是我的问题开始的地方,所以当我导入网站时


import pandas as pd

tables = pd.read_html('https://en.wikipedia.org/wiki/UFC_168')

print(tables[2])

现在有了这个,我得到了输出

主卡 ...

重量等级 未命名: 1_level_1 ...时间笔记

0 中量级克里斯·魏德曼 (c) ...1:16 [甲]

1 女子雏量级隆达·鲁西 (c) ...0:58 [乙]

2重量级特拉维斯·布朗(Travis Browne)...1:00 NaN

3 轻量级吉姆·米勒 ...3:42 NaN

4 重量(151.5 磅)达斯汀·普瓦里尔 ...4:54 NaN

5 预赛卡 (Fox Sports 1) 预赛卡 (Fox Sports 1) ...预赛卡(Fox Sports 1) 预赛卡(Fox Sports 1)

6 中量级乌利亚·霍尔 ...5:00 [中]

7 轻量级迈克尔·约翰逊 ...1:32 [丁] 8羽量级丹尼斯·西弗(Dennis Siver)...5:00 [英]

9次中量级约翰·霍华德...5:00 NaN

10 初赛卡(在线) 初赛卡(在线)...预备卡(在线) 预备卡(在线)

11次中量级威廉·马卡里奥...5:00 NaN

12羽量级罗比·佩拉尔塔(Robbie Peralta)...0:12 NaN

此输出缺少 3 个关键列来进行我的研究。对手,终结方法,以及战斗结束的回合。如果你们对这些部分如何或为什么丢失有任何倾向,请告诉我。谢谢

Python Pandas 网页抓取 数据集 维基百科上的数据

评论

0赞 William 7/5/2023
尝试在输出中使用代码块

答:

1赞 Marc 7/5/2023 #1

Pandas 网页抓取不如 BeautifulSoup(最好的开源抓取 imo)强大。它还可以让您更好地控制您提取的每个变量/结构化数据。 因此,我会用以下代码来解决您的问题:

import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/UFC_168'
response = requests.get(url)

# Parse HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Find the table containing fight data
table = soup.find('table', {'class': 'toccolours'})

# Iterate through the rows in the table
for row in table.find_all('tr')[1:]: # Skip the header row
    columns = row.find_all('td')
    
    # Check if the row contains fight data
    if len(columns) >= 5:
        weight_class = columns[0].get_text(strip=True)
        fighter = columns[1].get_text(strip=True)
        rel = columns[2].get_text(strip=True)
        opponent = columns[3].get_text(strip=True)
        method = columns[4].get_text(strip=True)
        round_finished = columns[5].get_text(strip=True)
        time = columns[6].get_text(strip=True)
        
        print(f'{weight_class} | {fighter} {rel} {opponent} | {method} | {round_finished} | {time}')

这很好地让你可以访问感兴趣的关键变量(包括对手、终结方法和回合完成),并给你以下输出:

Middleweight | Chris Weidman(c) def. Anderson Silva | TKO (leg injury) | 2 | 1:16
Women's Bantamweight | Ronda Rousey(c) def. Miesha Tate | Submission (armbar) | 3 | 0:58
Heavyweight | Travis Browne def. Josh Barnett | KO (elbows) | 1 | 1:00
Lightweight | Jim Miller def. Fabrício Camões | Submission (armbar) | 1 | 3:42
Catchweight (151.5 lb) | Dustin Poirier def. Diego Brandão | KO (punches) | 1 | 4:54
Middleweight | Uriah Hall def. Chris Leben | TKO (retirement) | 1 | 5:00
Lightweight | Michael Johnson def. Gleison Tibau | KO (punches) | 2 | 1:32
Featherweight | Dennis Siver vs. Manny Gamburyan | No Contest (overturned) | 3 | 5:00
Welterweight | John Howard def. Siyar Bahadurzada | Decision (unanimous) (30–27, 30–27, 30–27) | 3 | 5:00
Welterweight | William Macário def. Bobby Voelker | Decision (unanimous) (30–27, 30–27, 30–27) | 3 | 5:00
Featherweight | Robbie Peralta def. Estevan Payan | KO (punches) | 3 | 0:12

请注意,您必须导入 2 个 Python 包和 ,才能使上述代码正常工作。为方便起见,这里:RequestsBeautifulSoup

pip install -U requests beautifulsoup4 

评论

1赞 Kanta 7/5/2023
这太棒了,非常感谢!我会投票,但我需要 15+ 声望:(