python 2.7 - Scrapy Extract number from page text with regex -
i have been looking few hours on how search text on page , if matches regex extract it. have spider set follows:
def parse(self, response): title = response.xpath('//title/text()').extract() units = response.xpath('//body/text()').re(r"units: (\d)") print title, units
i pull out number after "units: " on pages. when run scrapy on page units: 351 in body title of page bunch of escapes before , after , nil units.
i new scrapy , have little python experience. help how extract integer after units: , remove escape characters "u'\r\n\t..." title much appreciated.
edit: per comment here partial html extract of illustration page. note within different tags aside p in example:
<body> <div> content , multiple divs here <div> <h1>this count dala</h1> <p><strong>number of units:</strong> 801</p> <p>we have other content here , more divs beyond</p> </body>
based on reply below got of way there. still working on removing units: , escape characters.
units = response.xpath('string(//body)').re("(units: [\d]+)")
try:
response.xpath('string(//body)').re(r"units: (\d)")
regex python-2.7 scrapy
No comments:
Post a Comment