如何使用 Scrapy 报废类别和子类别

how scrap categories and subcategory with Scrapy

提问人:Олександр Митровка 提问时间:6/26/2023 最后编辑:Олександр Митровка 更新时间:6/27/2023 访问量:36

问:

我不明白如何调用回调来解析子类别。

我以下一个代码为例。

我想按 kegoria 解析类别,例如: 主类别--->sub_category--->如果子类别有类别,请解析并添加链接,直到我们遇到具有产品的最终类别,然后解析产品。

我希望 json 输出看起来像

{
  "url": "Category URL",
  "category_name": "Category name",
  "subcategories": [
    {
      "url": "Subcategory URL",
      "subcategory_name": "Subcategory name",
      "subcategories": [
        
      ]
    },
    ...
  ]
}
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
"""
            name_category = item.css('a span::text').get()
            url_category = item.css('a::attr(href)').get()
"""

class CategoryLinkSpider(scrapy.Spider):
    name = "category_link"
    allowed_domains = ["illyushatoys.com.ua"]
    start_urls = ["https://illyushatoys.com.ua/?categoryID=184864"]

    def start_requests(self):
        category_item = {}
        yield scrapy.Request(self.start_urls[0], self.parse, meta={'category_item': category_item})


    def parse(self, response):
        category_item = response.meta.get('category_item')
        links = LinkExtractor(allow=r'.*category/\d+/$', restrict_css='div.inmenu').extract_links(response)
        if not links:
            print('emty_links')
        else:
            for link in links:
                category_item['url'] = link.url
                category_item['title_category'] = link.text
                category_item['subcategories'] = []
                yield response.follow(link, callback=self.parse_subcategory, meta={'category_item': category_item})

    def parse_subcategory(self, response):
        category_item = response.meta['category_item']
        links = LinkExtractor(allow=r'.*category/\d+/$', restrict_css='div.inmenu').extract_links(response)
        if links:
            for link in links:
                category_item['subcategories'].append({
                    'url': link.url,
                    'title_category': link.text
                })
                yield response.follow(link, callback=self.parse_subcategory, meta={'category_item': category_item})
            yield category_item

json python-3.x scrapy html 解析

评论

0赞 Alexander 6/27/2023
您当前的实现有什么问题?
0赞 Олександр Митровка 6/27/2023
在我目前的实现中,错误的generete json,我无法理解如何像这个json一样解析:链接

答: 暂无答案