提问人:Yas 提问时间:11/7/2023 更新时间:11/7/2023 访问量:86
从维基百科到熊猫的网页抓取
Web Scraping from Wikipedia into Pandas
问:
我正在尝试从维基百科页面获取行数据。到目前为止,我已经能够提取列数据,但提取行数据不起作用。 以下是我到目前为止有效的方法:
from bs4 import BeautifulSoup
import requests
url = 'https://en.wikipedia.org/wiki/List_of_most-followed_Instagram_accounts'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html')
celeb_titles = soup.find('tr')
celeb_table_titles = [title.text.strip() for title in celeb_titles] #list comp. to get titles in a list. needed help for this
while ('' in celeb_table_titles): #remove empty spaces
celeb_table_titles.remove('')
import pandas as pd
df = pd.DataFrame(columns = celeb_table_titles)
column_data = soup.find_all('th')
这是我遇到问题的地方:
for row in column_data:
row_data = (row.find_all('td'))
individual_row_data = [data.text.strip() for data in row_data]
print(individual_row_data)
我应该把我的行数据放在一个列表中。该列表应包含与维基百科相同的值。
答:
1赞
HedgeHog
11/7/2023
#1
使用时可以使用 pandas.read_html()
来抓取表格:pandas
import pandas as pd
pd.read_html('https://en.wikipedia.org/wiki/List_of_most-followed_Instagram_accounts')[0][:-1].dropna(axis=1, how='all')
排 | 用户名 | 所有者 | 追随者(百万) | 职业/活动 | 国家 | |
---|---|---|---|---|---|---|
0 | 1 | Instagram的 | 661 | 社交媒体平台 | 美国 | |
1 | 2 | @cristiano | 克里斯蒂亚诺·罗纳尔多 | 609 | 足球员 | 葡萄牙 |
2 | 3 | @leomessi | 莱昂内尔·梅西 | 490 | 足球员 | 阿根廷 |
... | ||||||
46 | 47 | @snoopdogg | 史努比狗狗 | 82.3 | 音乐家 | 美国 |
47 | 48 | @jennierubyjane | 珍妮 | 81.7 | 音乐家 | 韩国 |
48 | 49 | @khaby00 | 卡比·拉梅 | 80.9 | 媒体人 | 意大利 塞内加尔 |
49 | 50 | @narendramodi | 纳伦德拉·莫迪(Narendra Modi) | 80.5 | 印度总理 | 印度 |
0赞
Maria K
11/7/2023
#2
您可以使用:pandas.read_html
import pandas as pd
df = pd.read_html(page.text)[0]
print(df.head())
输出:
Rank Username Owner Brand account Followers (millions) \
0 1 @instagram Instagram NaN 661
1 2 @cristiano Cristiano Ronaldo NaN 609
2 3 @leomessi Lionel Messi NaN 490
3 4 @selenagomez Selena Gomez NaN 430
4 5 @kyliejenner Kylie Jenner NaN 399
Profession/Activity Country
0 Social media platform United States
1 Footballer Portugal
2 Footballer Argentina
3 Musician and actress United States
4 Media personality United States
2赞
Andrej Kesely
11/7/2023
#3
正如@HedgeHog所说,只需使用 .但这里有一个更正该列的版本:pandas.read_html
Brand account
from io import StringIO
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/List_of_most-followed_Instagram_accounts"
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
# correct Brand Account column for pd.read_html:
for t in soup.select('[title="Yes"]'):
t.replace_with("Yes")
df = pd.read_html(StringIO(str(soup)))[0].fillna("")
print(df)
指纹:
Rank Username Owner Brand account Followers (millions) Profession/Activity Country
0 1 @instagram Instagram Yes 661 Social media platform United States
1 2 @cristiano Cristiano Ronaldo 609 Footballer Portugal
2 3 @leomessi Lionel Messi 490 Footballer Argentina
3 4 @selenagomez Selena Gomez 430 Musician and actress United States
4 5 @kyliejenner Kylie Jenner 399 Media personality United States
5 6 @therock Dwayne Johnson 391 Actor and professional wrestler United States
6 7 @arianagrande Ariana Grande 380 Musician and actress United States
7 8 @kimkardashian Kim Kardashian 364 Media personality United States
8 9 @beyonce Beyoncé 318 Musician United States
9 10 @khloekardashian Khloé Kardashian 311 Media personality United States
10 11 @nike Nike Yes 305 Sportswear multinational United States
11 12 @kendalljenner Kendall Jenner 294 Media personality United States
12 13 @justinbieber Justin Bieber 293 Musician Canada
13 14 @natgeo National Geographic Yes 283 Magazine United States
14 15 @taylorswift Taylor Swift 274 Musician United States
15 16 @virat.kohli Virat Kohli 261 Cricketer India
16 17 @jlo Jennifer Lopez 252 Musician and actress United States
17 18 @nickiminaj Nicki Minaj 227 Musician Trinidad and Tobago United States
18 19 @kourtneykardash Kourtney Kardashian 224 Media personality United States
19 20 @mileycyrus Miley Cyrus 215 Musician and actress United States
20 21 @neymarjr Neymar 215 Footballer Brazil
21 22 @katyperry Katy Perry 206 Musician United States
22 23 @zendaya Zendaya 185 Actress and singer United States
23 24 @kevinhart4real Kevin Hart 179 Comedian and actor United States
24 25 @iamcardib Cardi B 168 Musician and actress United States
25 26 @kingjames LeBron James 158 Basketball player United States
26 27 @ddlovato Demi Lovato 157 Musician and actress United States
27 28 @badgalriri Rihanna 152 Musician Barbados
28 29 @realmadrid Real Madrid CF Yes 148 Football club Spain
29 30 @chrisbrownofficial Chris Brown 144 Musician United States
30 31 @champagnepapi Drake 143 Musician Canada
31 32 @ellendegeneres Ellen DeGeneres 139 Comedian and television host United States
32 33 @fcbarcelona FC Barcelona Yes 124 Football club Spain
33 34 @billieeilish Billie Eilish 110 Musician United States
34 35 @championsleague UEFA Champions League Yes 110 Club football competition Europe
35 36 @gal_gadot Gal Gadot 109 Actress Israel
36 37 @k.mbappe Kylian Mbappé 109 Footballer France
37 38 @vindiesel Vin Diesel 101 Actor United States
38 39 @lalalalisa_m Lisa 98.3 Musician Thailand
39 40 @nasa NASA Yes 96.6 Space agency United States
40 41 @priyankachopra Priyanka Chopra 89.7 Actress India
41 42 @shakira Shakira 89.5 Musician Colombia
42 43 @dualipa Dua Lipa 88.7 Musician United Kingdom Albania
43 44 @davidbeckham David Beckham 84.6 Former footballer United Kingdom
44 45 @shraddhakapoor Shraddha Kapoor 83.9 Actress India
45 46 @nba NBA Yes 83.6 Professional basketball league United States Canada
46 47 @snoopdogg Snoop Dogg 82.3 Musician United States
47 48 @jennierubyjane Jennie 81.7 Musician South Korea
48 49 @khaby00 Khaby Lame 80.9 Media personality Italy Senegal
49 50 @narendramodi Narendra Modi 80.5 Prime Minister of India India
50 As of November 2023 As of November 2023 As of November 2023 As of November 2023 As of November 2023 As of November 2023 As of November 2023
评论
'' in celeb_table_titles
为什么要找空字符串?