Thursday, 15 January 2015

web scraping - How to split a list of urls by pattern? -



web scraping - How to split a list of urls by pattern? -

i crawled list of urls website. want cluster these urls groups. can generate sitemap site. similar urls should goto same group.

in [1]: http://www.example.org/s/daily/2013-12-09/1392994518.html out[1]: http://www.example.org/s/daily/${date:%y-%m-%d}/${date:%s}.html in [2]: http://www.example.org/torvalds/linux/commit/3bd7bf1f0fe14f591c089ae61bbfa9bd356f178a out[2]: http://www.example.org/torvalds/linux/commit/${sha1}

do have ideas? there same software bundle can use?

you want find urls have high frequency of flow them. 1 time you've identified these, eliminate have low flow (or no) flow other pages on site. later grouping things terms of use, privacy policy.

the former anchor points partition of site. goto anchor pages , utilize text in line name of division. check urls flow out of anchors other pages on site. if don't flow anchor point, belong division.

web-scraping pattern-recognition

No comments:

Post a Comment