从页面中提取 hrefs 或特定标记

Extracting hrefs or specific tag from a page

提问人:Paul Corcoran 提问时间:3/8/2023 最后编辑:GeorgePaul Corcoran 更新时间:3/8/2023 访问量:41

问:

我一直在尝试多种方法,但事实证明这个网站很难通过 bs4 抓取。

我正在尝试提取在其中一个匹配项的截图中找到的 href 值。id 是将页面中的所有 href 标签提取到一个列表中。我没有返回任何值,理想的结果是包含所有 hrefs 的列表,例如 //www.premierleague.com/match/74911

enter image description here

import warnings
import numpy as np
from datetime import datetime
import requests
from bs4 import BeautifulSoup

warnings.filterwarnings('ignore')

# set up empty dataframe in a list for storage. errors is set up to handle any matches that dont scrape.
dataframe = []
errors = []

url = "https://www.premierleague.com/results"

response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

matches = {}

soup.find_all("div", {"class": "competitionContainer"})
蟒蛇 网页抓取 Beautifulsoup

评论


答:

1赞 Andrej Kesely 3/8/2023 #1

您在页面上看到的数据是通过 JavaScript 从外部源加载的(您可以在浏览器中打开 Web 开发人员工具 -> 网络选项卡,然后开始向下滚动页面。您应该在那里看到 Ajax 请求):

import json
import requests

api_url = "https://footballapi.pulselive.com/football/fixtures"

params = {
    "comps": "1",
    "compSeasons": "489",
    "teams": "127,1,2,130,131,4,6,7,34,9,26,10,11,12,23,15,20,21,25,38",
    "page": "1",
    "pageSize": "40",
    "sort": "desc",
    "statuses": "C",
    "altIds": "true",
}

headers = {
    'Origin': 'https://www.premierleague.com',
}

page = 0
while True:
    params['page'] = page
    data = requests.get(api_url, params=params, headers=headers).json()

    # uncoment this to print all data:
    # print(json.dumps(data, indent=4))

    for c in data['content']:
        team1, team2 = c['teams'][0]['team']['name'], c['teams'][1]['team']['name']
        print(f'{team1:<30} {team2:<30} https://www.premierleague.com/match/{int(c["id"])}')

    if page > data['pageInfo']['numPages']:
        break

    page += 1

指纹:


...

Chelsea                        Tottenham Hotspur              https://www.premierleague.com/match/74925
Nottingham Forest              West Ham United                https://www.premierleague.com/match/74928
Brentford                      Manchester United              https://www.premierleague.com/match/74923
Arsenal                        Leicester City                 https://www.premierleague.com/match/74921
Brighton & Hove Albion         Newcastle United               https://www.premierleague.com/match/74924
Manchester City                Bournemouth                    https://www.premierleague.com/match/74927
Southampton                    Leeds United                   https://www.premierleague.com/match/74929
Wolverhampton Wanderers        Fulham                         https://www.premierleague.com/match/74930
Aston Villa                    Everton                        https://www.premierleague.com/match/74922
West Ham United                Manchester City                https://www.premierleague.com/match/74920
Leicester City                 Brentford                      https://www.premierleague.com/match/74916
Manchester United              Brighton & Hove Albion         https://www.premierleague.com/match/74919
Everton                        Chelsea                        https://www.premierleague.com/match/74913
Bournemouth                    Aston Villa                    https://www.premierleague.com/match/74912
Leeds United                   Wolverhampton Wanderers        https://www.premierleague.com/match/74915
Newcastle United               Nottingham Forest              https://www.premierleague.com/match/74917
Tottenham Hotspur              Southampton                    https://www.premierleague.com/match/74918
Fulham                         Liverpool                      https://www.premierleague.com/match/74914
Crystal Palace                 Arsenal                        https://www.premierleague.com/match/74911