Process UTF-8 characters in C from a text file -
i need read utf-8 characters text file , process them. instance calculate frequency of occurrence of character. ordinary characters fine. problem occurs characters ü
or ğ
. next code check if character occurs comparing ascii code of incoming character:
file * fin; file * fout; wchar_t c; fin=fopen ("input.txt","r"); fout=fopen("out.txt","w"); int frequency = 0; while((c=fgetwc(fin))!=weof) { if(c == some_number){ frequency++; } }
some_number
can't figure out characters. infact characters print out 5 different numbers when trying print decimal. whereas illustration character 'a'
as: if(c == 97){ frequency++; }
since ascii code of 'a'
97
. there anyway identify special characters in c?
p.s. working ordinary char ( not wchar_t
) creates same problem, time printing decimal equivalent of incoming character print 5 different negative numbers special characters. problem stands.
a modern c platform should provide need such task.
first thing have sure programme runs under locale can handle utf8. environement should set that, thing have in code
setlocale(lc_all, "");
to switch "c"
locale native environment.
then can read strings usual fgets
, e.g. comparisons accented characters , stuff you'd have convert such string wide character string (mbsrtowcs
) mention. encoding of such wide characters implementation defined, don't need know encoding checks.
usually l'ä'
work long platform on compile , execute not screwed up. if need codes can't come in on keyboard can utilize l'\uxxxx'
notation c11 didierc mentions in answer. ('l'\uxxxx'
"basic" characters, if have weird you'd utilize l'\uxxxxxxxx'
, capital u 8 hex-digits)
as said, encoding wide characters implementation defined, chances either utf-16 or utf-32, can check sizeof(wchar_t)
, predefined macro __stdc_iso_10646__
. if platform supports utf-16 (which may have 2-word "characters") utilize case describe shouldn't cause problem since characters can coded l'\uxxxx'
form.
c file input utf-8
No comments:
Post a Comment