My Blog: web scraping - How do I scrape a web page using C? -

Friday, 15 April 2011

web scraping - How do I scrape a web page using C? -

so i've written web site scraper programme in c# using html agility pack. straight forward. accounting inconsistencies in formatting on web page, still took me couple of hours working.

now, have re-implement programme in c can run in linux environment. major nightmare.

i'm able pull page when comes tracking through pull out parts i'm interested in - i'm drawing lot of blanks. originally, dead set on trying implement solution similar html agility alternative in c# except using tidy , other xml library maintain logic more or less same.

this hasn't worked out well. xml library have access doesn't appear back upwards xpath , i'm not able install 1 does. i've resorted trying figure out way read through page using string matching find info want. can't help sense there has improve way this.

here have:

#define html_page "codes.html"  int extract() {      file *html;      int found = 0;     char buffer[1000];     char searchfor[80], *cp;      html = fopen(html_page, "r");      if (html)     {          // error prone, if buffer cuts off half way through section of string looking for, fail!         while(fgets(buffer, 999, html))         {             trim(buffer);              if (!found)             {                 sprintf(searchfor, "<strong>");                 cp = (char *)strstr(buffer, searchfor);                 if(!cp)continue;                  if (strncmp(cp + strlen(searchfor), "co1", 3) == 0 || strncmp(cp + strlen(searchfor), "co2", 3) == 0)                 {                     got_code(cp + strlen(searchfor));                 }             }         }     }      fclose(html);       homecoming 0; }  got_code(html)     char    *html; {     char    code[8];     char    *endtag;     struct  _code_st    *currcode;     int i;        endtag = (char *)strstr(html, "</strong>");     if(!endtag)return;      sprintf(code, "%.7s", html);      for(i=0 ; i<data.codes ; i++)         if(strcasecmp(data.code[i].code, code)==0)            return;      add_to_list(currcode, _code_st, data.code, data.codes);     currcode->code = (char *)strdup(code);      printf("code: %s\n", code); }

the above doesn't work properly. lot of codes i'm interested in mention above, if buffer cuts off @ wrong spots miss some.

i did seek reading entire chunk of html i'm interested in string wasn't able figure out how cycle through - couldn't codes displayed.

does know how can solve issue?

edit: i've been thinking more. there way can ahead in file , search end of each 'block' of text parsing , set buffer size before read it? need file pointer same file? (hopefully) prevent problem of buffer cutting off @ inconvenient places.

okay, after much banging of head against wall trying come way create above code work, decided seek different approach.

since knew info on page i'm scraping contained on 1 huge line, changed code search through file till found it. progress downwards line looking blocks wanted. worked surprisingly , 1 time had code reading of blocks, easy create minor modifications business relationship inconsistencies in html. part took longest figuring out how bail out 1 time reached end of line , solved peaking ahead create sure there block read.

here code (which ugly functional):

#define html_page "codes.html" #define start_block "<strong>" #define end_block "</strong>"  int extract() {      file *html;      int found = 0;     char *line = null, *endtag, *starttag;     size_t len = 0;     ssize_t read;      char searchfor[80];      html = fopen(html_page, "r");      if (html)     {         while((read = getline(&line, &len, html)) != -1)         {             if (found) // found line codes interested in             {                 char   *ptr = line;                 size_t nlen = strlen (end_block);                  while (ptr != null)                  {                     sprintf(searchfor, start_block);                     starttag = (char *)strstr(ptr, searchfor);                     if(!starttag)                     {                         nlen = strlen (start_block);                         ptr += nlen;                         continue;                     }                      if (strncmp(starttag + strlen(searchfor), "co1", 3) == 0 || strncmp(starttag + strlen(searchfor), "co2", 3) == 0)                         got_code(starttag + strlen(searchfor), code);                     else {                         nlen = strlen (start_block);                         ptr += nlen;                         continue;                     }                      sprintf(searchfor, end_block);                     ptr = (char *)strstr(ptr, searchfor);                      if (!ptr) { found = 0; break; }                      nlen = strlen (end_block);                                       ptr += nlen;                      if (ptr)                     {                         // ahead  create sure have more pull out                         sprintf(searchfor, end_block);                         endtag = (char *)strstr(ptr, searchfor);                         if (!endtag) { break; }                     }                 }                  found = 0;                 break;             }              // find section of downloaded page care             // next line read blob containing html want             if (strstr(line, "wiki-content") != null)             {                 found = 1;             }         }          fclose(html);     }       homecoming 0; }  got_code(char *html) {     char    code[8];     char    *endtag;     struct  _code_st    *currcode;     int i;        endtag = (char *)strstr(html, "</strong>");     if(!endtag)return;      sprintf(code, "%.7s", html);      for(i=0 ; i<data.codes ; i++)         if(strcasecmp(data.code[i].code, code)==0)             return;      add_to_list(currcode, _code_st, data.code, data.codes);     currcode->code = (char *)strdup(code);      printf("code: %s\n", code); }

not elegant or robust c# programme @ to the lowest degree pulls info want.

c web-scraping

My Blog

Friday, 15 April 2011

web scraping - How do I scrape a web page using C? -

No comments:

Post a Comment