提问人:Олександр Митровка 提问时间:6/26/2023 最后编辑:Олександр Митровка 更新时间:6/27/2023 访问量:36
如何使用 Scrapy 报废类别和子类别
how scrap categories and subcategory with Scrapy
问:
我不明白如何调用回调来解析子类别。
我以下一个代码为例。
我想按 kegoria 解析类别,例如: 主类别--->sub_category--->如果子类别有类别,请解析并添加链接,直到我们遇到具有产品的最终类别,然后解析产品。
我希望 json 输出看起来像
{
"url": "Category URL",
"category_name": "Category name",
"subcategories": [
{
"url": "Subcategory URL",
"subcategory_name": "Subcategory name",
"subcategories": [
]
},
...
]
}
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
"""
name_category = item.css('a span::text').get()
url_category = item.css('a::attr(href)').get()
"""
class CategoryLinkSpider(scrapy.Spider):
name = "category_link"
allowed_domains = ["illyushatoys.com.ua"]
start_urls = ["https://illyushatoys.com.ua/?categoryID=184864"]
def start_requests(self):
category_item = {}
yield scrapy.Request(self.start_urls[0], self.parse, meta={'category_item': category_item})
def parse(self, response):
category_item = response.meta.get('category_item')
links = LinkExtractor(allow=r'.*category/\d+/$', restrict_css='div.inmenu').extract_links(response)
if not links:
print('emty_links')
else:
for link in links:
category_item['url'] = link.url
category_item['title_category'] = link.text
category_item['subcategories'] = []
yield response.follow(link, callback=self.parse_subcategory, meta={'category_item': category_item})
def parse_subcategory(self, response):
category_item = response.meta['category_item']
links = LinkExtractor(allow=r'.*category/\d+/$', restrict_css='div.inmenu').extract_links(response)
if links:
for link in links:
category_item['subcategories'].append({
'url': link.url,
'title_category': link.text
})
yield response.follow(link, callback=self.parse_subcategory, meta={'category_item': category_item})
yield category_item
答: 暂无答案
评论