提问人:fschn 提问时间:9/18/2023 最后编辑:fschn 更新时间:9/18/2023 访问量:28
scrapy 请求 heute.at 总是有 403 个回复
scrapy requests for heute.at always with 403 responses
问:
我试图用 Scrapy 为个人数据科学项目抓取 www.heute.at。我还使用scrapy-rotating-procies和以下Tor代理。但是,我只得到回应。我还使用 Tor 浏览器查看是否有可能通过 tor 访问该站点(是的,确实如此!),然后尝试模仿 tor borwsers 请求(将其标头复制到 scrapy),但没有成功。请在下面详细查看我的设置和地址。非常感谢任何帮助或线索:403 https://www.heute.at/>: HTTP status code is not handled or not allowed
这是我的蜘蛛,它只是收集所有文章的链接,据说:
import scrapy
from scrapy.loader import ItemLoader
from HEUTE.items import heuteLinkItems
from dotenv import load_dotenv
class heuteLinks(scrapy.Spider):
name = "heuteLinks"
start_urls = ['https://www.heute.at/']
# parses data
def parse(self, response):
for item in response.xpath('//*[contains(@class, "link")]/@href'):
zacken = ItemLoader(item=heuteLinkItems(), selector=item)
zacken.add_value('mainPage', response.url)
zacken.add_value('link', item.get())
yield zacken.load_item()
for link in response.xpath('//*[contains(@class, "mainmenu")]//@href'):
url = link.get()
yield scrapy.Request(url, self.parse2)
# parses data passed on from first parse
def parse2(self, response):
for item in response.xpath('//*[contains(@class, "link")]/@href'):
zacken = ItemLoader(item=heuteLinkItems(), selector=item)
zacken.add_value('mainPage', response.url)
zacken.add_value('link', item.get())
yield zacken.load_item()
这:items.py
import scrapy
from itemloaders.processors import TakeFirst, Join, MapCompose
from scrapy.exceptions import DropItem
def urlMaker(x):
if '/s/' in x:
return 'https://www.heute.at' + x
class heuteLinkItems(scrapy.Item):
mainPage = scrapy.Field(output_processor=TakeFirst(),)
link = scrapy.Field(input_processor=MapCompose(urlMaker), output_processor=TakeFirst())
这:settings.py
from dotenv import load_dotenv
import os
import random
load_dotenv("../SETUP/.env")
ip = os.environ.get("server_domain")
BOT_NAME = "HEUTE"
SPIDER_MODULES = ["HEUTE.spiders"]
NEWSPIDER_MODULE = "HEUTE.spiders"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "HEUTE (+http://www.yourdomain.com)"
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 1
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = random.randint(1,3)
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8'",
"Accept-Language": "en-US,en;q=0.5",
'Cookie': 'ioam2018=00014b3f0e4ceb55c65081931:1725355954229:1695029554229:.heute.at:9:at_w_atheute:RedCont/Homepage/Homepage:noevent:1695047962431:g96nzp; dakt_2_uuid=76c9c244b122d37b4bfc4089ca8207a6; dakt_2_uuid_ts=1695029555113; dakt_2_version=2.1.61; _pbjs_userid_consent_data=3524755945110770; __gads=ID=a7625cd4974c024b:T=1695029556:RT=1695047964:S=ALNI_MYz08UbrntABhw-fNYFwC0Fve4kXQ; __gpi=UID=00000c782856d0ce:T=1695029556:RT=1695047964:S=ALNI_MZC5e8mon2kgCOPwmy8suXyIFzxEg; cto_bundle=MiDme19ZaUNLcUdlY0s1RUtYMG8lMkZCdll5Nkd4QXhvZXVvaCUyRml2cHAlMkIlMkZjUExoZnJTS3lWejMxUnNmT3hwYVNWcm1uMCUyRk8wVGhqREYySjdURjVmNHZ1bnNnJTJCcVZ1JTJCeDhFSWNtV1QxQSUyQldYMVY2dGFxNWp2MldvZ2g4aTElMkZJM2pnJTJCQlBz; dakt_2_session_id=1171e864c3d2baf83d6a6e6fad954d06',
'Upgrade-Insecure-Requests': '1',
'Connection': 'keep-alive',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Site': 'cross-site',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'TE': 'trailers'
}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
ROTATING_PROXY_LIST = [
f'{ip}:8118',
f'{ip}:8119',
f'{ip}:8120'
]
ROTATING_PROXY_BAN_POLICY = 'HEUTE.policy.BanPolicy'
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; rv:102.0) Gecko/20100101 Firefox/102.0"
运行时,我只得到响应:scrapy crawl heuteLinks
403 https://www.heute.at/>: HTTP status code is not handled or not allowed
2023-09-18 13:16:34 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
'BOT_NAME': 'HEUTE',
'CONCURRENT_REQUESTS': 1,
'DOWNLOAD_DELAY': 1,
'FEED_EXPORT_ENCODING': 'utf-8',
'NEWSPIDER_MODULE': 'HEUTE.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'SPIDER_MODULES': ['HEUTE.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor',
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; rv:102.0) Gecko/20100101 '
'Firefox/102.0'}
2023-09-18 13:16:34 [asyncio] DEBUG: Using selector: EpollSelector
2023-09-18 13:16:34 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2023-09-18 13:16:34 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2023-09-18 13:16:34 [scrapy.extensions.telnet] INFO: Telnet Password: 825c4fdec07d4a54
2023-09-18 13:16:34 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.throttle.AutoThrottle']
2023-09-18 13:16:34 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'rotating_proxies.middlewares.RotatingProxyMiddleware',
'rotating_proxies.middlewares.BanDetectionMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-09-18 13:16:34 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-09-18 13:16:34 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-09-18 13:16:34 [scrapy.core.engine] INFO: Spider opened
2023-09-18 13:16:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-09-18 13:16:34 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-09-18 13:16:34 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 0, unchecked: 10, reanimated: 0, mean backoff time: 0s)
2023-09-18 13:16:35 [stem] DEBUG: GETCONF __owningcontrollerprocess (runtime: 0.0003)
2023-09-18 13:16:35 [rotating_proxies.expire] DEBUG: Proxy <http://my.proxy.link:8123> is DEAD
2023-09-18 13:16:35 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.heute.at/> with another proxy (failed 1 times, max retries: 5)
2023-09-18 13:16:38 [rotating_proxies.expire] DEBUG: Proxy <http://my.proxy.link:8124> is DEAD
2023-09-18 13:16:38 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.heute.at/> with another proxy (failed 2 times, max retries: 5)
2023-09-18 13:16:44 [rotating_proxies.middlewares] DEBUG: 1 proxies moved from 'dead' to 'reanimated'
2023-09-18 13:16:45 [rotating_proxies.expire] DEBUG: Proxy <http://my.proxy.link:8118> is DEAD
2023-09-18 13:16:45 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.heute.at/> with another proxy (failed 3 times, max retries: 5)
2023-09-18 13:16:54 [rotating_proxies.expire] DEBUG: Proxy <http://my.proxy.link:8126> is DEAD
2023-09-18 13:16:54 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.heute.at/> with another proxy (failed 4 times, max retries: 5)
2023-09-18 13:16:59 [rotating_proxies.expire] DEBUG: Proxy <http://my.proxy.link:8123> is DEAD
2023-09-18 13:16:59 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.heute.at/> with another proxy (failed 5 times, max retries: 5)
2023-09-18 13:17:04 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 4, unchecked: 6, reanimated: 0, mean backoff time: 188s)
2023-09-18 13:17:06 [rotating_proxies.expire] DEBUG: Proxy <http://my.proxy.link:8127> is DEAD
2023-09-18 13:17:06 [rotating_proxies.middlewares] DEBUG: Gave up retrying <GET https://www.heute.at/> (failed 6 times with different proxies)
2023-09-18 13:17:06 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.heute.at/> (referer: None)
2023-09-18 13:17:06 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.heute.at/>: HTTP status code is not handled or not allowed
2023-09-18 13:17:06 [scrapy.core.engine] INFO: Closing spider (finished)
2023-09-18 13:17:06 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'bans/status/403': 6,
'downloader/request_bytes': 9396,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 6,
'downloader/response_bytes': 32301,
'downloader/response_count': 6,
'downloader/response_status_count/403': 6,
'elapsed_time_seconds': 31.850255,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 9, 18, 11, 17, 6, 707687),
'httpcompression/response_bytes': 6444,
'httpcompression/response_count': 1,
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/403': 1,
'log_count/DEBUG': 18,
'log_count/INFO': 13,
'memusage/max': 73097216,
'memusage/startup': 73097216,
'proxies/dead': 5,
'proxies/mean_backoff': 188.23789066016286,
'proxies/reanimated': 0,
'proxies/unchecked': 6,
'response_received_count': 1,
'scheduler/dequeued': 6,
'scheduler/dequeued/memory': 6,
'scheduler/enqueued': 6,
'scheduler/enqueued/memory': 6,
'spider_name': 'heuteLinks',
'start_time': datetime.datetime(2023, 9, 18, 11, 16, 34, 857432),
'urls_failed': '',
'urls_requested': ''}
2023-09-18 13:17:06 [scrapy.core.engine] INFO: Spider closed (finished)
自定义封禁策略在封禁后会发出切换 curcuit / endnoce 的代理信号:scrapy-rotating-proxies
import scrapy
from rotating_proxies.policy import BanDetectionPolicy
from stem import Signal
from stem.control import Controller
import stem.util
from dotenv import load_dotenv
import os
import socket
load_dotenv("../SETUP/.env")
class BanPolicy(BanDetectionPolicy):
def response_is_ban(self, request, response):
ban = super(BanPolicy, self).response_is_ban(request, response)
address = socket.gethostbyname(os.environ.get('server_domain')) # getting proxy ip
port = int(os.environ.get(f"torproxy_{request.meta.get('proxy')[-4:]}").split(",")[1]) # getting proxy control port
with Controller.from_port(address=address, port=port) as controller: # connecting to proxy
controller.authenticate(os.environ.get("torproxy_controller_pass")) # authenticating
stem.util.log.get_logger().propagate = False # disable logging, as noise info log of stem pollutes scrapy log. currently as workaround based on: https://github.com/torproject/stem/issues/112#
controller.signal(Signal.NEWNYM) # telling proxy to change curcuit / endpoint ip
controller.close()
return ban
在设置中停用旋转代理时:
DOWNLOADER_MIDDLEWARES = {
# 'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
# 'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}
一切正常。Scrapy 访问该网站并肆无忌惮地刮擦物品。
Tor 代理通过 Docker 运行。docker-compose.yml文件:
version: '3'
services:
tor_proxy_1:
&proxy_template
image: dperson/torproxy
container_name: tor_proxy_1
environment:
- PASSWORD=${torproxy_controller_pass}
- BW=0
- EXITNOTE=0
- TOR_NewCircuitPeriod=1
ports:
- 8118:8118
- 9050:9050
- 9051:9051 #control port
networks:
- scrapernetwork
restart: unless-stopped
tor_proxy_2:
<<: *proxy_template
container_name: tor_proxy_2
ports:
- 8119:8118
- 9052:9050
- 9053:9051 #control port
tor_proxy_3:
<<: *proxy_template
container_name: tor_proxy_3
ports:
- 8120:8118
- 9054:9050
- 9055:9051 #control port
我测试了通过 Tor 浏览器访问 www.heute.at,看看它是否有效。确实如此。然后,在“开发人员工具”的“网络”选项卡中复制了html文档的curl请求,请参见下面的屏幕截图。
要在该级别上重现:
curl 'https://www.heute.at/' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; rv:102.0) Gecko/20100101 Firefox/102.0' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8' -H 'Accept-Language: en-US,en;q=0.5' -H 'Connection: keep-alive' -H 'Cookie: ioam2018=00014b3f0e4ceb55c65081931:1725355954229:1695029554229:.heute.at:9:at_w_atheute:RedCont/Homepage/Homepage:noevent:1695047962431:g96nzp; dakt_2_uuid=76c9c244b122d37b4bfc4089ca8207a6; dakt_2_uuid_ts=1695029555113; dakt_2_version=2.1.61; _pbjs_userid_consent_data=3524755945110770; __gads=ID=a7625cd4974c024b:T=1695029556:RT=1695047964:S=ALNI_MYz08UbrntABhw-fNYFwC0Fve4kXQ; __gpi=UID=00000c782856d0ce:T=1695029556:RT=1695047964:S=ALNI_MZC5e8mon2kgCOPwmy8suXyIFzxEg; cto_bundle=MiDme19ZaUNLcUdlY0s1RUtYMG8lMkZCdll5Nkd4QXhvZXVvaCUyRml2cHAlMkIlMkZjUExoZnJTS3lWejMxUnNmT3hwYVNWcm1uMCUyRk8wVGhqREYySjdURjVmNHZ1bnNnJTJCcVZ1JTJCeDhFSWNtV1QxQSUyQldYMVY2dGFxNWp2MldvZ2g4aTElMkZJM2pnJTJCQlBz; dakt_2_session_id=1171e864c3d2baf83d6a6e6fad954d06' -H 'Upgrade-Insecure-Requests: 1' -H 'Sec-Fetch-Dest: document' -H 'Sec-Fetch-Mode: navigate' -H 'Sec-Fetch-Site: cross-site' -H 'If-Modified-Since: Mon, 18 Sep 2023 14:22:03 GMT' -H 'TE: trailers'
当然,它也可以正常工作,并返回一个好看的 HTML。
有了这些信息,我在上面已经包含了 中更新了 scrappy 的请求标头。settings.py
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; rv:102.0) Gecko/20100101 Firefox/102.0"
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8'",
"Accept-Language": "en-US,en;q=0.5",
'Cookie': 'ioam2018=00014b3f0e4ceb55c65081931:1725355954229:1695029554229:.heute.at:9:at_w_atheute:RedCont/Homepage/Homepage:noevent:1695047962431:g96nzp; dakt_2_uuid=76c9c244b122d37b4bfc4089ca8207a6; dakt_2_uuid_ts=1695029555113; dakt_2_version=2.1.61; _pbjs_userid_consent_data=3524755945110770; __gads=ID=a7625cd4974c024b:T=1695029556:RT=1695047964:S=ALNI_MYz08UbrntABhw-fNYFwC0Fve4kXQ; __gpi=UID=00000c782856d0ce:T=1695029556:RT=1695047964:S=ALNI_MZC5e8mon2kgCOPwmy8suXyIFzxEg; cto_bundle=MiDme19ZaUNLcUdlY0s1RUtYMG8lMkZCdll5Nkd4QXhvZXVvaCUyRml2cHAlMkIlMkZjUExoZnJTS3lWejMxUnNmT3hwYVNWcm1uMCUyRk8wVGhqREYySjdURjVmNHZ1bnNnJTJCcVZ1JTJCeDhFSWNtV1QxQSUyQldYMVY2dGFxNWp2MldvZ2g4aTElMkZJM2pnJTJCQlBz; dakt_2_session_id=1171e864c3d2baf83d6a6e6fad954d06',
'Upgrade-Insecure-Requests': '1',
'Connection': 'keep-alive',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Site': 'cross-site',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'TE': 'trailers'
}
但是它不起作用,只返回 403 个响应......
感谢您阅读本文,非常感谢任何帮助或线索!
答: 暂无答案
评论