提问人:Darwin 提问时间:11/6/2023 更新时间:11/6/2023 访问量:10
Scrapy-Playwrite程序只加载外围页面元素
Scrapy-Playwrite program only load periphery page elements
问:
我使用 scrapy 和 scrapy-playwright 编写的程序似乎只加载了页面的外围元素。“页面的肉”仍然是空白的,但不幸的是,这是我试图从中抓取的信息
https://chrome.google.com/webstore/category/ext/22-accessibility
import scrapy
from scrapy_playwright.page import PageMethod
import asyncio
class ExtensionSpider(scrapy.Spider):
name = "extension"
allowed_domains = ["chrome.google.com"]
def start_requests(self):
yield scrapy.Request(
url='https://chrome.google.com/webstore/category/ext/22-accessibility',
meta={
'playwright': True,
'playwright_include_page': True,
'playwright_page_method': [
PageMethod('wait_for_selector', '//h1'),
PageMethod('evaluate', 'window.scrollBy(0, document.body.scrollHeight)'),
PageMethod('wait_for_timeout', 30000),
],
'errback': self.errback,
},
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36'
},
callback=self.parse,
)
async def parse(self, response):
page = response.meta["playwright_page"]
h1_element = response.xpath('//h1/text()').get()
grids = response.xpath('//div[@role="grid"]').getall()
screenshot = await page.screenshot(path="example.png", full_page=True)
await page.close()
yield {
'H1 Loaded': h1_element,
'Number of grids': len(grids)
}
async def errback(self, failure):
page = failure.request.meta["playwright_page"]
await page.close()
我的代码遇到了一个奇怪的问题,基本上只有页面加载的外围元素(见附图),我想访问“页面的肉”,但无论我输入wait_for_selector还是wait_for_timeout我都只能获得页面的侧边栏和标题
从我的页面的屏幕截图中可以看出,只有标题和侧边栏加载
答: 暂无答案
评论