python - curly bracket issue when inserting scrapy crawler results to postgresql -
when using scrapy shell:
scrapy shell “http://blogs.reuters.com/us/“ and trying extract title of url:
response.xpath('(//title/text())').extract() i get:
[u’analysis & sentiment | reuters'] and when run crawler next in postgresql database:
{“analysis & sentiment | reuters”} what want is:
analysis & sentiment | reuters how can create happen? also, here’s pipeline i’m using if helps:
import scrapy scrapy.contrib.spiders import crawlspider, rule scrapy.contrib.linkextractors import linkextractor targets.items import targetsitem class myspider(crawlspider): name = 'reuters' allowed_domains = ['blogs.reuters.com'] start_urls = [ 'http://blogs.reuters.com/us/' ] rules = ( rule(linkextractor(allow_domains=('blogs.reuters.com', )), callback='parse_item'), ) def parse_item(self, response): item = targetsitem() item['title'] = response.xpath('(//title/text())').extract() item['link'] = response.url homecoming item
the best alternative utilize item loaders , input , output processors:
item loaders provide convenient mechanism populating scraped items. though items can populated using own dictionary-like api, item loaders provide much more convenient api populating them scraping process, automating mutual tasks parsing raw extracted info before assigning it.
in particular, takefirst() processor. define loader:
from scrapy.contrib.loader import itemloader scrapy.contrib.loader.processor import takefirst, mapcompose class targetloader(itemloader): default_output_processor = takefirst() and load items using loader:
def parse_item(self, response): l = targetloader(targetsitem(), response) l.add_xpath('title', '//title/text()') l.add_value('link', response.url) homecoming l.load_item() demo:
$ scrapy shell http://blogs.reuters.com >>> import scrapy >>> scrapy.contrib.loader import itemloader >>> scrapy.contrib.loader.processor import takefirst, mapcompose >>> class targetitem(scrapy.item): ... title = scrapy.field() ... link = scrapy.field() ... >>> class targetloader(itemloader): ... default_output_processor = takefirst() ... >>> l = targetloader(targetitem(), response) >>> l.add_xpath('title', '//title/text()') >>> l.add_value('link', response.url) >>> l.load_item() {'link': 'http://blogs.reuters.com/us/', 'title': u'analysis & sentiment | reuters'} python web-scraping scrapy web-crawler
No comments:
Post a Comment