python - Saving the web pages by reading the list of urls from file -
only way think of 1 way of solving problem has listed limitations. can suggest of way of solving problem?
we have given text file 999999 urls. have write python programme read file & save webpages in folder called 'saved_page'.
i have tried solve problem this,
import os import urllib save_path='c:/test/home_page/' name = os.path.join(save_path, "test.txt") file = open('soop.txt', 'r') '''all urls in soop.txt file ''' line in file: info = urllib.urlopen(line) line in data: f=open(name) lines=f.readlines() f.close() lines.append(line) f=open(name,"w") f.writelines(lines) f.close() file.close()
here limitations code,
1).if network goes down, code restart.
2).if comes across bad url - i.e. server doesn't respond - code stuck.
3).i downloading in sequence - quite slow big no of urls.
so can suggest solution address these problems well?
some remarks :
point 1 , 2 can fixed restart point method. in in script restart, loop until ok or max number of attemps under line for line in file
containing read part , write if dowload file. still have decide in case or not downloadable file : either log error , go on next file or abort whole job.
if want able restart later failed job, should maintain somewhere (a state.txt
file) list of downloaded files. write (and flush) after each file got , written. boolet proof, should write 1 element after getting file, , 1 element after writing it. way, on restart, can know output file may contain partially written file (power outage, break, ...) testing presence of state file , content.
point 3 much more tricky. allow parrallel download, have utilize threads or asyncio. have synchronize ensure files written in output file in proper order. if can afford maintain in memory, simple way first download using parralelized method (the link given j.f. sebastian can help), , write in order.
python urllib2
No comments:
Post a Comment