通过将 URL 从其他 *.py 文件传递给 Scrapy 来从 url 获取数据

Craw data from urls by passing URL to Scrapy from other *.py file

提问人:Claire Duong 提问时间:6/14/2020 更新时间:6/14/2020 访问量:140

问:

我正在使用 Scrapy 从网站获取数据,这是我在 Scrapy 文件夹蜘蛛中文件 spider.py 的代码

class ThumbSpider(scrapy.Spider):
    userInput = readInputData('input/user_input.json')
    name = 'thumb'
    # start_urls = ['https://vietnamnews.vn/politics-laws', 'https://vietnamnews.vn/society']

    def __init__(self, *args, **kwargs): 
        super(ThumbSpider, self).__init__(*args, **kwargs)
        self.start_urls = kwargs.get('start_urls')

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        for cssThumb in self.userInput['cssThumb']: # browse each cssThumb which user provides
            items = response.css('{0}::attr(href)'.format(cssThumb)).getall() # access it

            for item in items:
                item = response.urljoin(item)
                yield scrapy.Request(url=item, callback=self.parse_details)

    def parse_details(self, response):
        data = response.css('div.vnnews-text-post p span::text').extract()

        with open('result/page_content.txt', 'a') as outfile:
            json.dump(data, outfile)

        yield data

我在文件 main.py 中调用类并在终端中运行此文件ThumbSpider

import json
import os
import modules.misc as msc
from scrapy.crawler import CrawlerProcess
from week_7.spiders.spider import NaviSpider, ThumbSpider

process2 = CrawlerProcess()

process2.crawl(ThumbSpider, start_urls=['https://vietnamnews.vn/politics-laws', 'https://vietnamnews.vn/society'])
process2.start()

我的程序没有从 2 个 url 中得到任何东西,但是当我取消注释和删除以及类和文件中的方法时 main.py 编辑它运行良好。我不知道发生了什么。任何人都可以帮助我,非常感谢start_urls = ['https://vietnamnews.vn/politics-laws', 'https://vietnamnews.vn/society']__init__start_requestsThumbSpiderprocess2.crawl(ThumbSpider, start_urls=msc.getUserChoices())process2.crawl(ThumbSpider)

Python Scrapy 数据科学 网络挖掘

评论

0赞 Gallaecio 6/15/2020
如果您使用参数名称而不是将它们传递给蜘蛛,它是否有效?start_urls
0赞 Claire Duong 6/15/2020
如果这段代码像这样,它仍然工作得很好: ,但是当我将其更改为它时,它将不起作用, getUserChoices() 从 json 文件中获取数据并返回 url 列表process2.crawl(ThumbSpider, start_urls=['https://vietnamnews.vn/politics-laws', 'https://vietnamnews.vn/society'])process2.crawl(ThumbSpider, start_urls=msc.getUserChoices())
0赞 Gallaecio 6/15/2020
返回 ,还是别的什么?msc.getUserChoices()['https://vietnamnews.vn/politics-laws', 'https://vietnamnews.vn/society']
0赞 Claire Duong 6/15/2020
是返回msc.getUserChoices()['https://vietnamnews.vn/politics-laws', 'https://vietnamnews.vn/society']
0赞 Claire Duong 6/15/2020
['https://vietnamnews.vn/politics-laws', 'https://vietnamnews.vn/society']包含在 JSON 文件中

答: 暂无答案