提问人:Kanta 提问时间:7/5/2023 最后编辑:Kanta 更新时间:7/5/2023 访问量:33
Panda 未打印所有表格
Panda not printing all of the table
问:
这是我的第一篇文章,所以我希望我不会忘记任何事情。
因此,我试图抓取所有UFC赛事以查看战士的某些统计数据,并尝试使用Pandas。
这是我的问题开始的地方,所以当我导入网站时
import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/UFC_168')
print(tables[2])
现在有了这个,我得到了输出
主卡 ...
重量等级 未命名: 1_level_1 ...时间笔记
0 中量级克里斯·魏德曼 (c) ...1:16 [甲]
1 女子雏量级隆达·鲁西 (c) ...0:58 [乙]
2重量级特拉维斯·布朗(Travis Browne)...1:00 NaN
3 轻量级吉姆·米勒 ...3:42 NaN
4 重量(151.5 磅)达斯汀·普瓦里尔 ...4:54 NaN
5 预赛卡 (Fox Sports 1) 预赛卡 (Fox Sports 1) ...预赛卡(Fox Sports 1) 预赛卡(Fox Sports 1)
6 中量级乌利亚·霍尔 ...5:00 [中]
7 轻量级迈克尔·约翰逊 ...1:32 [丁] 8羽量级丹尼斯·西弗(Dennis Siver)...5:00 [英]
9次中量级约翰·霍华德...5:00 NaN
10 初赛卡(在线) 初赛卡(在线)...预备卡(在线) 预备卡(在线)
11次中量级威廉·马卡里奥...5:00 NaN
12羽量级罗比·佩拉尔塔(Robbie Peralta)...0:12 NaN
此输出缺少 3 个关键列来进行我的研究。对手,终结方法,以及战斗结束的回合。如果你们对这些部分如何或为什么丢失有任何倾向,请告诉我。谢谢
答:
Pandas 网页抓取不如 BeautifulSoup(最好的开源抓取 imo)强大。它还可以让您更好地控制您提取的每个变量/结构化数据。 因此,我会用以下代码来解决您的问题:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/UFC_168'
response = requests.get(url)
# Parse HTML content
soup = BeautifulSoup(response.content, 'html.parser')
# Find the table containing fight data
table = soup.find('table', {'class': 'toccolours'})
# Iterate through the rows in the table
for row in table.find_all('tr')[1:]: # Skip the header row
columns = row.find_all('td')
# Check if the row contains fight data
if len(columns) >= 5:
weight_class = columns[0].get_text(strip=True)
fighter = columns[1].get_text(strip=True)
rel = columns[2].get_text(strip=True)
opponent = columns[3].get_text(strip=True)
method = columns[4].get_text(strip=True)
round_finished = columns[5].get_text(strip=True)
time = columns[6].get_text(strip=True)
print(f'{weight_class} | {fighter} {rel} {opponent} | {method} | {round_finished} | {time}')
这很好地让你可以访问感兴趣的关键变量(包括对手、终结方法和回合完成),并给你以下输出:
Middleweight | Chris Weidman(c) def. Anderson Silva | TKO (leg injury) | 2 | 1:16
Women's Bantamweight | Ronda Rousey(c) def. Miesha Tate | Submission (armbar) | 3 | 0:58
Heavyweight | Travis Browne def. Josh Barnett | KO (elbows) | 1 | 1:00
Lightweight | Jim Miller def. Fabrício Camões | Submission (armbar) | 1 | 3:42
Catchweight (151.5 lb) | Dustin Poirier def. Diego Brandão | KO (punches) | 1 | 4:54
Middleweight | Uriah Hall def. Chris Leben | TKO (retirement) | 1 | 5:00
Lightweight | Michael Johnson def. Gleison Tibau | KO (punches) | 2 | 1:32
Featherweight | Dennis Siver vs. Manny Gamburyan | No Contest (overturned) | 3 | 5:00
Welterweight | John Howard def. Siyar Bahadurzada | Decision (unanimous) (30–27, 30–27, 30–27) | 3 | 5:00
Welterweight | William Macário def. Bobby Voelker | Decision (unanimous) (30–27, 30–27, 30–27) | 3 | 5:00
Featherweight | Robbie Peralta def. Estevan Payan | KO (punches) | 3 | 0:12
请注意,您必须导入 2 个 Python 包和 ,才能使上述代码正常工作。为方便起见,这里:Requests
BeautifulSoup
pip install -U requests beautifulsoup4
评论