Tuesday, 15 April 2014

python - How to add try exception in scrapy spider? -



python - How to add try exception in scrapy spider? -

i build simple crawler application using urllib2 , beautifulsoup, planning alter scrapy spider, how can handle errors while running crawler, current application have code this,

error_file = open('errors.txt','a') finish_file = open('finishlink.txt','a') try: #code process each links #if sucessfully finished link store 'finish.txt' file except exception e: #link write 'errors.txt' file error code

so when processing thousands of links, processed links store finish.txt , error's in errors.txt, can run links in errors later until processed. how can accomplish these in these code,

class dmozspider(scrapy.spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/computers/programming/languages/python/books/", "http://www.dmoz.org/computers/programming/languages/python/resources/" ] def parse(self, response): filename = response.url.split("/")[-2] open('filename+'.txt', 'wb') f: f.write(response.body)

you can create spider middleware , override process_spider_exception() method, saving links in file there.

a spider middleware way extend scrapy's behavior. here total illustration can modify needed purpose:

from scrapy import signals class saveerrorsmiddleware(object): def __init__(self, crawler): crawler.signals.connect(self.close_spider, signals.spider_closed) crawler.signals.connect(self.open_spider, signals.spider_opened) @classmethod def from_crawler(cls, crawler): homecoming cls(crawler) def open_spider(self, spider): self.output_file = open('somefile.txt', 'a') def close_spider(self, spider): self.output_file.close() def process_spider_exception(self, response, exception, spider): self.output_file.write(response.url + '\n')

put in module , set in settings.py:

spider_middlewares = { 'myproject.middleware.saveerrorsmiddleware': 1000, }

this code run spider, triggering open_spider(), close_spider(), process_spider_exception() methods when appropriated.

read more:

spider middlewares signals in scrapy example middleware in scrapy source code

python scrapy

No comments:

Post a Comment