Calling the same spider programmatically

2019-09-19 05:10发布

问题:

I have a spider which crawls links for the websites passed. I want to start the same spider again when its execution is finished with different set of data. How to restart the same crawler again? The websites are passed through database. I want the crawler to run in a unlimited loop until all the websites are crawled. Currently I have to start the crawler scrapy crawl first all the time. Is there any way to start the crawler once and it will stop when all the websites are crawled?

I searched for the same, and found a solution of handling the crawler once its closed/finished. But I don't know how to call the spider form the closed_handler method programmatically.

The following is my code:

 class MySpider(CrawlSpider):
        def __init__(self, *args, **kwargs):
            super(MySpider, self).__init__(*args, **kwargs)
            SignalManager(dispatcher.Any).connect(
                self.closed_handler, signal=signals.spider_closed)

        def closed_handler(self, spider):
            reactor.stop()
            settings = Settings()
            crawler = Crawler(settings)
            crawler.signals.connect(spider.spider_closing, signal=signals.spider_closed)
            crawler.configure()
            crawler.crawl(MySpider())
            crawler.start()
            reactor.run()

        # code for getting the websites from the database
        name = "first"
        def parse_url(self, response):
            ...

I am getting the error:

Error caught on signal handler: <bound method ?.closed_handler of <MySpider 'first' at 0x40f8c70>>

Traceback (most recent call last):
  File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 150, in maybeDeferred
    result = f(*args, **kw)
  File "c:\python27\lib\site-packages\scrapy\xlib\pydispatch\robustapply.py", line 57, in robustApply
    return receiver(*arguments, **named)
  File "G:\Scrapy\web_link_crawler\web_link_crawler\spiders\first.py", line 72, in closed_handler
    crawler = Crawler(settings)
  File "c:\python27\lib\site-packages\scrapy\crawler.py", line 32, in __init__
    self.spidercls.update_settings(self.settings)
AttributeError: 'Settings' object has no attribute 'update_settings'

Is this the right way to get this done? Or is there any other way? Please help!

Thank You

回答1:

Another way to do it would be making a new script where you select the links from the database and save them to a file and then call the scrapy script this way

os.system("scrapy crawl first")

and load the links from the file onto your spider and work from there on.

If you want to constantly check the database for new links, in the first script just call the database from time to time in an infinite loop and make the scrapy call whenever there are new links!