如何使用漂亮的汤进行网页抓取时解决服务器错误500?

How to solve server error 500 while web scraping using beautiful soup?

提问人:Newbie 提问时间:11/16/2023 最后编辑:Newbie 更新时间:11/16/2023 访问量:16

问:

import requests
from bs4 import BeautifulSoup
import os
import time

# Define the URL of the webpage
url = 'https://mahasldc.in/home.php/weekly-deviation-statements/'

# Create a session to maintain the state
session = requests.Session()

# Send an initial request to get the page content
response = session.get(url)
if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    yr_select = soup.find('select', id='yr')
    mon_select = soup.find('select', id='mon')
    wk_select = soup.find('select', id='wk')
    rep_select = soup.find('select', id='rep')

    download_dir = "downloaded_files"
    os.makedirs(download_dir, exist_ok=True)

    not_available_files = []

    # Maximum number of retries
    max_retries = 3

    for yr_option in yr_select.find_all('option'):
        for mon_option in mon_select.find_all('option'):
            for wk_option in wk_select.find_all('option'):
                for rep_option in rep_select.find_all('option'):
                    retry_count = 0
                    while retry_count < max_retries:
                        try:
                            yr_select['value'] = yr_option['value']
                            mon_select['value'] = mon_option['value']
                            wk_select['value'] = wk_option['value']
                            rep_select['value'] = rep_option['value']

                            headers = {'User-Agent': 'Mozilla/5.0 (Windows NT; Windows NT 6.2; en-US) WindowsPowerShell/4.0'}
                            response = session.post(url, data={'yr': yr_option['value'],
                                    'mon': mon_option['value'],
                                    'wk': wk_option['value'],
                                    'rep': rep_option['value']}, headers=headers)

                            if response.status_code == 200:
                                if "Not Found" not in response.text:
                                    xml_data = BeautifulSoup(response.text, 'html.parser').find('div', class_='data').text

                                    filename = f"{yr_option['value']}_{mon_option['value']}_{wk_option['value']}_{rep_option['value']}.xml"
                                    with open(os.path.join(download_dir, filename), 'w', encoding='utf-8') as file:
                                        file.write(xml_data)

                                    print(f"Downloaded {filename}")
                                else:
                                    not_available_files.append(f"yr={yr_option['value']}, mon={mon_option['value']}, wk={wk_option['value']}, rep={rep_option['value']}")
                            else:
                                print(f"Failed to download data for yr={yr_option['value']}, mon={mon_option['value']}, wk={wk_option['value']}, rep={rep_option['value']}")
                        except requests.RequestException as e:
                            print(f"Request failed. Retrying... ({retry_count + 1}/{max_retries})")
                            time.sleep(6)  # Add a short delay before retrying
                            retry_count += 1
                            continue
                        break  # Break out of the retry loop if successful

    print("Not available files:")
    for file_info in not_available_files:
        print(file_info)

else:
    print(f"Failed to access the webpage. Status code: {response.status_code}")

我尝试使用延迟后重试的方法,但它不起作用,并且一直显示以下错误:

无法访问网页。状态码:500

我也尝试过使用Chromedriver。 我尝试使用标头来超越错误,但它不起作用 有什么方法可以解决这个问题吗? 或者我应该使用其他方法来抓取网络?

网页抓取 Beautifulsoup 错误处理

评论


答: 暂无答案