提问人:Newbie 提问时间:11/16/2023 最后编辑:Newbie 更新时间:11/16/2023 访问量:16
如何使用漂亮的汤进行网页抓取时解决服务器错误500?
How to solve server error 500 while web scraping using beautiful soup?
问:
import requests
from bs4 import BeautifulSoup
import os
import time
# Define the URL of the webpage
url = 'https://mahasldc.in/home.php/weekly-deviation-statements/'
# Create a session to maintain the state
session = requests.Session()
# Send an initial request to get the page content
response = session.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
yr_select = soup.find('select', id='yr')
mon_select = soup.find('select', id='mon')
wk_select = soup.find('select', id='wk')
rep_select = soup.find('select', id='rep')
download_dir = "downloaded_files"
os.makedirs(download_dir, exist_ok=True)
not_available_files = []
# Maximum number of retries
max_retries = 3
for yr_option in yr_select.find_all('option'):
for mon_option in mon_select.find_all('option'):
for wk_option in wk_select.find_all('option'):
for rep_option in rep_select.find_all('option'):
retry_count = 0
while retry_count < max_retries:
try:
yr_select['value'] = yr_option['value']
mon_select['value'] = mon_option['value']
wk_select['value'] = wk_option['value']
rep_select['value'] = rep_option['value']
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT; Windows NT 6.2; en-US) WindowsPowerShell/4.0'}
response = session.post(url, data={'yr': yr_option['value'],
'mon': mon_option['value'],
'wk': wk_option['value'],
'rep': rep_option['value']}, headers=headers)
if response.status_code == 200:
if "Not Found" not in response.text:
xml_data = BeautifulSoup(response.text, 'html.parser').find('div', class_='data').text
filename = f"{yr_option['value']}_{mon_option['value']}_{wk_option['value']}_{rep_option['value']}.xml"
with open(os.path.join(download_dir, filename), 'w', encoding='utf-8') as file:
file.write(xml_data)
print(f"Downloaded {filename}")
else:
not_available_files.append(f"yr={yr_option['value']}, mon={mon_option['value']}, wk={wk_option['value']}, rep={rep_option['value']}")
else:
print(f"Failed to download data for yr={yr_option['value']}, mon={mon_option['value']}, wk={wk_option['value']}, rep={rep_option['value']}")
except requests.RequestException as e:
print(f"Request failed. Retrying... ({retry_count + 1}/{max_retries})")
time.sleep(6) # Add a short delay before retrying
retry_count += 1
continue
break # Break out of the retry loop if successful
print("Not available files:")
for file_info in not_available_files:
print(file_info)
else:
print(f"Failed to access the webpage. Status code: {response.status_code}")
我尝试使用延迟后重试的方法,但它不起作用,并且一直显示以下错误:
无法访问网页。状态码:500
我也尝试过使用Chromedriver。 我尝试使用标头来超越错误,但它不起作用 有什么方法可以解决这个问题吗? 或者我应该使用其他方法来抓取网络?
答: 暂无答案
评论