Tuesday, 15 January 2013

python - How to ignore a th tag while parsing html table? -



python - How to ignore a th tag while parsing html table? -

hello quite new parsing html tables python , beautifulsoup4. has been going until have run weird table uses 'th' tag midway through table, causing parse quit , throw 'index out of range' error. i've tried searching , google no avail. question how ignore or strip rogue 'th' tag while parsing table?

here code have far:

from mechanize import browser bs4 import beautifulsoup mech = browser() url = 'https://www.moscone.com/site/do/event/list' page = mech.open(url) html = page.read() soup = beautifulsoup(html) table = soup.find('table', { 'id' : 'list' }) row in table.findall('tr')[3:]: col = row.findall('td') date = col[0].string name = col[1].string location = col[2].string record = (name, date, location) final = ','.join(record) print(final)

here little snippet of html causes error

<td> convention </td> </tr> <tr> <th class="title" colspan="4"> mon dec 01 00:00:00 pst 2014 </th> </tr> <tr> <td> 12/06/14 - 12/09/14 </td>

i want info above , below rogue 'th' indicates start of new month on table

you can check if th in row , parse content if not, this:

for row in table.findall('tr')[3:]: # create sure th not in row if not row.find_all('th'): col = row.findall('td') date = col[0].string name = col[1].string location = col[2].string record = (name, date, location) final = ','.join(record) print(final)

this results provided url without indexerror:

out & equal workplace,11/03/14 - 11/06/14,moscone west samsung developer conference,11/11/14 - 11/13/14,moscone west north american spine society (nass) annual meeting,11/12/14 - 11/15/14,moscone south , esplanade ballroom san francisco international auto show,11/22/14 - 11/29/14,moscone north & south 67th annual meeting of aps partition of fluid dynamics,11/23/14 - 11/25/14,moscone north, south , west american society of hematology,12/06/14 - 12/09/14,moscone north, south , west california school boards association,12/12/14 - 12/16/14,moscone north & esplanade ballroom american geophysical union,12/15/14 - 12/19/14,moscone north & south

python html beautifulsoup

No comments:

Post a Comment