Monday, 15 April 2013

python - Beautiful Soup 4 find_all don't find links that Beautiful Soup 3 finds -



python - Beautiful Soup 4 find_all don't find links that Beautiful Soup 3 finds -

i noticed annoying bug: beautifulsoup4 (package: bs4) finds less tags previous version (package: beautifulsoup).

here's reproductible instance of issue:

import requests import bs4 import beautifulsoup r = requests.get('http://wordpress.org/download/release-archive/') s4 = bs4.beautifulsoup(r.text) s3 = beautifulsoup.beautifulsoup(r.text) print 'with beautifulsoup 4 : {}'.format(len(s4.findall('a'))) print 'with beautifulsoup 3 : {}'.format(len(s3.findall('a')))

output:

with beautifulsoup 4 : 557 beautifulsoup 3 : 1701

the difference not minor can see.

here exact versions of modules in case wondering:

in [20]: bs4.__version__ out[20]: '4.2.1' in [21]: beautifulsoup.__version__ out[21]: '3.2.1'

you have lxml installed, means beautifulsoup 4 utilize that parser on standard-library html.parser option.

you can upgrade lxml 3.2.1 (which me returns 1701 results test page); lxml uses libxml2 , libxslt may blame here. may have upgrade those instead / well. see lxml requirements page; libxml2 2.7.8 or newer recommended.

or explicitly specify other parser when parsing soup:

s4 = bs4.beautifulsoup(r.text, 'html.parser')

python web web-scraping beautifulsoup

No comments:

Post a Comment