Monday, 15 March 2010

python - curly bracket issue when inserting scrapy crawler results to postgresql -



python - curly bracket issue when inserting scrapy crawler results to postgresql -

when using scrapy shell:

scrapy shell “http://blogs.reuters.com/us/“

and trying extract title of url:

response.xpath('(//title/text())').extract()

i get:

[u’analysis & sentiment | reuters']

and when run crawler next in postgresql database:

{“analysis & sentiment | reuters”}

what want is:

analysis & sentiment | reuters

how can create happen? also, here’s pipeline i’m using if helps:

import scrapy scrapy.contrib.spiders import crawlspider, rule scrapy.contrib.linkextractors import linkextractor targets.items import targetsitem class myspider(crawlspider): name = 'reuters' allowed_domains = ['blogs.reuters.com'] start_urls = [ 'http://blogs.reuters.com/us/' ] rules = ( rule(linkextractor(allow_domains=('blogs.reuters.com', )), callback='parse_item'), ) def parse_item(self, response): item = targetsitem() item['title'] = response.xpath('(//title/text())').extract() item['link'] = response.url homecoming item

the best alternative utilize item loaders , input , output processors:

item loaders provide convenient mechanism populating scraped items. though items can populated using own dictionary-like api, item loaders provide much more convenient api populating them scraping process, automating mutual tasks parsing raw extracted info before assigning it.

in particular, takefirst() processor. define loader:

from scrapy.contrib.loader import itemloader scrapy.contrib.loader.processor import takefirst, mapcompose class targetloader(itemloader): default_output_processor = takefirst()

and load items using loader:

def parse_item(self, response): l = targetloader(targetsitem(), response) l.add_xpath('title', '//title/text()') l.add_value('link', response.url) homecoming l.load_item()

demo:

$ scrapy shell http://blogs.reuters.com >>> import scrapy >>> scrapy.contrib.loader import itemloader >>> scrapy.contrib.loader.processor import takefirst, mapcompose >>> class targetitem(scrapy.item): ... title = scrapy.field() ... link = scrapy.field() ... >>> class targetloader(itemloader): ... default_output_processor = takefirst() ... >>> l = targetloader(targetitem(), response) >>> l.add_xpath('title', '//title/text()') >>> l.add_value('link', response.url) >>> l.load_item() {'link': 'http://blogs.reuters.com/us/', 'title': u'analysis & sentiment | reuters'}

python web-scraping scrapy web-crawler

No comments:

Post a Comment