Monday, 15 March 2010

Process UTF-8 characters in C from a text file -



Process UTF-8 characters in C from a text file -

i need read utf-8 characters text file , process them. instance calculate frequency of occurrence of character. ordinary characters fine. problem occurs characters ü or ğ. next code check if character occurs comparing ascii code of incoming character:

file * fin; file * fout; wchar_t c; fin=fopen ("input.txt","r"); fout=fopen("out.txt","w"); int frequency = 0; while((c=fgetwc(fin))!=weof) { if(c == some_number){ frequency++; } }

some_number can't figure out characters. infact characters print out 5 different numbers when trying print decimal. whereas illustration character 'a' as: if(c == 97){ frequency++; } since ascii code of 'a' 97. there anyway identify special characters in c?

p.s. working ordinary char ( not wchar_t ) creates same problem, time printing decimal equivalent of incoming character print 5 different negative numbers special characters. problem stands.

a modern c platform should provide need such task.

first thing have sure programme runs under locale can handle utf8. environement should set that, thing have in code

setlocale(lc_all, "");

to switch "c" locale native environment.

then can read strings usual fgets, e.g. comparisons accented characters , stuff you'd have convert such string wide character string (mbsrtowcs) mention. encoding of such wide characters implementation defined, don't need know encoding checks.

usually l'ä' work long platform on compile , execute not screwed up. if need codes can't come in on keyboard can utilize l'\uxxxx' notation c11 didierc mentions in answer. ('l'\uxxxx' "basic" characters, if have weird you'd utilize l'\uxxxxxxxx', capital u 8 hex-digits)

as said, encoding wide characters implementation defined, chances either utf-16 or utf-32, can check sizeof(wchar_t) , predefined macro __stdc_iso_10646__. if platform supports utf-16 (which may have 2-word "characters") utilize case describe shouldn't cause problem since characters can coded l'\uxxxx' form.

c file input utf-8

No comments:

Post a Comment