Monday, 15 September 2014

Logging HTML content in a library environment with Python -



Logging HTML content in a library environment with Python -

i have library third-party developers utilize obtaining info off few specific websites. library responsible connecting website, grabbing pages, parsing necessary information, , returning developer.

however, i'm having issues coming acceptable way handle storing potentially malformed html. since can business relationship many things when testing, parsing may fail in future , helpful if find way store html failed parsing future bug fixing.

right i'm using internal logging module of python handle logging in library. i'm allowing third-party developer supply configuration dictionary configure how logging outputs error data. however, printing html console or file me not ideal think clutter terminal or error log. considered storing html files on local hard drive, seems extremely intrusive.

i've determined how i'm going pass html internally. plan pass via parameters of exception , grab filter. however, troubling me.

any feedback on method accomplish appreciated.

services based on websites don't command fragile, storing html avoid recrawling in event of parsing problems makes perfect sense me. since uncompressed html can consume lot of space on disk, might want store in compressed form in database.

i've found mongodb convenient this. underlying storage format bson (i.e. binary json). it's easy install , use.

here's toy illustration using pymongo store page in mongodb:

from pymongo import mongoclient import urllib2 import time # stored in document ts = time.time() url = 'http://stackoverflow.com/questions/26683772/logging-html-content-in-a-library-environment-with-python' html = urllib2.urlopen(url).read() # create dict , store in mongodb htmldict = {'url':url, 'ts':ts, 'html':html} client = mongoclient() db = client.html_log collection = db.html collection.insert(htmldict)

check see document stored in mongodb:

$ mongo > utilize html_log; > db.html.find() { "_id" : objectid("54544d96164a1b22d3afd887"), "url" : "http://stackoverflow.com/questions/26683772/logging-html-content-in-a-library-environment-with-python", "html" : "<!doctype html> [...] </html>", "ts" : 1414810778.001168 }

python html logging

No comments:

Post a Comment