Wednesday, 15 June 2011

python 2.7 - Scrapy Extract number from page text with regex -



python 2.7 - Scrapy Extract number from page text with regex -

i have been looking few hours on how search text on page , if matches regex extract it. have spider set follows:

def parse(self, response): title = response.xpath('//title/text()').extract() units = response.xpath('//body/text()').re(r"units: (\d)") print title, units

i pull out number after "units: " on pages. when run scrapy on page units: 351 in body title of page bunch of escapes before , after , nil units.

i new scrapy , have little python experience. help how extract integer after units: , remove escape characters "u'\r\n\t..." title much appreciated.

edit: per comment here partial html extract of illustration page. note within different tags aside p in example:

<body> <div> content , multiple divs here <div> <h1>this count dala</h1> <p><strong>number of units:</strong> 801</p> <p>we have other content here , more divs beyond</p> </body>

based on reply below got of way there. still working on removing units: , escape characters.

units = response.xpath('string(//body)').re("(units: [\d]+)")

try:

response.xpath('string(//body)').re(r"units: (\d)")

regex python-2.7 scrapy

No comments:

Post a Comment