Friday, 15 April 2011

web scraping - How do I scrape a web page using C? -



web scraping - How do I scrape a web page using C? -

so i've written web site scraper programme in c# using html agility pack. straight forward. accounting inconsistencies in formatting on web page, still took me couple of hours working.

now, have re-implement programme in c can run in linux environment. major nightmare.

i'm able pull page when comes tracking through pull out parts i'm interested in - i'm drawing lot of blanks. originally, dead set on trying implement solution similar html agility alternative in c# except using tidy , other xml library maintain logic more or less same.

this hasn't worked out well. xml library have access doesn't appear back upwards xpath , i'm not able install 1 does. i've resorted trying figure out way read through page using string matching find info want. can't help sense there has improve way this.

here have:

#define html_page "codes.html" int extract() { file *html; int found = 0; char buffer[1000]; char searchfor[80], *cp; html = fopen(html_page, "r"); if (html) { // error prone, if buffer cuts off half way through section of string looking for, fail! while(fgets(buffer, 999, html)) { trim(buffer); if (!found) { sprintf(searchfor, "<strong>"); cp = (char *)strstr(buffer, searchfor); if(!cp)continue; if (strncmp(cp + strlen(searchfor), "co1", 3) == 0 || strncmp(cp + strlen(searchfor), "co2", 3) == 0) { got_code(cp + strlen(searchfor)); } } } } fclose(html); homecoming 0; } got_code(html) char *html; { char code[8]; char *endtag; struct _code_st *currcode; int i; endtag = (char *)strstr(html, "</strong>"); if(!endtag)return; sprintf(code, "%.7s", html); for(i=0 ; i<data.codes ; i++) if(strcasecmp(data.code[i].code, code)==0) return; add_to_list(currcode, _code_st, data.code, data.codes); currcode->code = (char *)strdup(code); printf("code: %s\n", code); }

the above doesn't work properly. lot of codes i'm interested in mention above, if buffer cuts off @ wrong spots miss some.

i did seek reading entire chunk of html i'm interested in string wasn't able figure out how cycle through - couldn't codes displayed.

does know how can solve issue?

edit: i've been thinking more. there way can ahead in file , search end of each 'block' of text parsing , set buffer size before read it? need file pointer same file? (hopefully) prevent problem of buffer cutting off @ inconvenient places.

okay, after much banging of head against wall trying come way create above code work, decided seek different approach.

since knew info on page i'm scraping contained on 1 huge line, changed code search through file till found it. progress downwards line looking blocks wanted. worked surprisingly , 1 time had code reading of blocks, easy create minor modifications business relationship inconsistencies in html. part took longest figuring out how bail out 1 time reached end of line , solved peaking ahead create sure there block read.

here code (which ugly functional):

#define html_page "codes.html" #define start_block "<strong>" #define end_block "</strong>" int extract() { file *html; int found = 0; char *line = null, *endtag, *starttag; size_t len = 0; ssize_t read; char searchfor[80]; html = fopen(html_page, "r"); if (html) { while((read = getline(&line, &len, html)) != -1) { if (found) // found line codes interested in { char *ptr = line; size_t nlen = strlen (end_block); while (ptr != null) { sprintf(searchfor, start_block); starttag = (char *)strstr(ptr, searchfor); if(!starttag) { nlen = strlen (start_block); ptr += nlen; continue; } if (strncmp(starttag + strlen(searchfor), "co1", 3) == 0 || strncmp(starttag + strlen(searchfor), "co2", 3) == 0) got_code(starttag + strlen(searchfor), code); else { nlen = strlen (start_block); ptr += nlen; continue; } sprintf(searchfor, end_block); ptr = (char *)strstr(ptr, searchfor); if (!ptr) { found = 0; break; } nlen = strlen (end_block); ptr += nlen; if (ptr) { // ahead create sure have more pull out sprintf(searchfor, end_block); endtag = (char *)strstr(ptr, searchfor); if (!endtag) { break; } } } found = 0; break; } // find section of downloaded page care // next line read blob containing html want if (strstr(line, "wiki-content") != null) { found = 1; } } fclose(html); } homecoming 0; } got_code(char *html) { char code[8]; char *endtag; struct _code_st *currcode; int i; endtag = (char *)strstr(html, "</strong>"); if(!endtag)return; sprintf(code, "%.7s", html); for(i=0 ; i<data.codes ; i++) if(strcasecmp(data.code[i].code, code)==0) return; add_to_list(currcode, _code_st, data.code, data.codes); currcode->code = (char *)strdup(code); printf("code: %s\n", code); }

not elegant or robust c# programme @ to the lowest degree pulls info want.

c web-scraping

No comments:

Post a Comment