Tuesday, 15 June 2010

Repeated letters during reading pdf arabic text by pdfbox -



Repeated letters during reading pdf arabic text by pdfbox -

how can avoid repeated characters during read standard arabic pdf text using pdfbox? illustration word: بهم when read word result بهرم many

to create analysis of issue easier, op provided sample file , in comment pointed towards word illustrating issue:

the first word in first line in pdf file left has 3 letters, word become 4 letters in output(there additional letter in word!), please check if word exist in pdf stream 3 letters or 4?

the first line of pdf displayed in adobe reader looks this:

the first line of text extracted pdfbox sorting activated is

طعم الرضرا والسرعادة، ويحيرا حيراة طيبرة مرع غيرره، فيسرعد بهرم

which in total commander looks this:

the first line copied & pasted adobe reader results in

طعم الرضرا والسرعادة، ويحيرا حيراة طيبرة مرع غيرره، فيسرعد بهرم

thus, looks pdfbox , adobe reader agree on contained text, @ to the lowest degree in eyes have no experience standard arabic writing. adobe reader quite @ text extraction, hint pdfbox extracts text pdf claims be.

looking @ innards of pdf find operation outputting left-most part of line (glyphs rendered left-to-right here):

[<000303e10467>-2<03ec03910003>-3<03a903cc>3<0467>-2<03b3>3<03f3>-3<03d3>] tj

thus, left-most glyphs represented these codes:

0003 03e1 0467 03ec 0391 0003

using tounicode mapping of font used here, these glyphs correspond next unicode characters:

0003 0020 space 03e1 0645 meem م 0467 0631 reh ر 03ec 0647 heh ه 0391 0628 beh ب 0003 0020 space

so, indeed seems 4-letter word.

this furthermore matches lastly characters of first line of text extracted pdfbox:

0020 0628 0647 0631 0645 0020

at to the lowest degree in case of word pointed out op, therefore, info extracted pdfbox info in pdf contents express. if glyphs drawn according font programme read differently, pdf info inconsistent.

ps: in comment op asked for

tounicode mapping font

it this:

/cidinit /procset findresource begin 26 dict begin begincmap /cidsysteminfo << /registry (adobe) /ordering (ucs) /supplement 0 >> def /cmapname /adobe-identity-ucs def /cmaptype 2 def 1 begincodespacerange <0000> <ffff> endcodespacerange 2 beginbfchar <0003> <0020> <0005> <0022> endbfchar 2 beginbfrange <000b> <000d> [<0029> <0028> <002a>] <0010> <0012> <002d> endbfrange 4 beginbfchar <001d> <003a> <003e> <005d> <0040> <005b> <00b1> <2013> endbfchar 5 beginbfrange <02ec> <02f4> [<060c> <0020> <061f> <0621> <0640> <064b> <0020> <0020> <0020>] <02f5> <02f6> <064f> <02f7> <02f8> [<0020> <0652>] <0348> <0349> [<0020> <0647>] <034a> <034b> <0651064f> endbfrange 1 beginbfchar <037f> <0631> endbfchar 4 beginbfrange <0381> <0385> [<0622> <062e> <0623> <0647> <0624>] <0387> <038b> [<0625> <0645> <0020> <0626> <0626>] <038d> <038d> [<0627>] <038e> <038f> <0627> endbfrange 1 beginbfchar <0391> <0628> endbfchar 2 beginbfrange <0393> <0393> [<0629>] <0394> <0395> <0629> endbfrange 3 beginbfchar <0397> <062a> <0399> <062b> <039b> <062b> endbfchar 3 beginbfrange <039d> <039f> [<062c> <062c> <062c>] <03a1> <03a3> [<062d> <062d> <062d>] <03a6> <03a7> [<062e> <062e>] endbfrange 14 beginbfchar <03a9> <062f> <03ab> <0630> <03ad> <0631> <03af> <0632> <03b1> <0633> <03b3> <0633> <03b5> <0634> <03b7> <0634> <03b9> <0635> <03bb> <0635> <03bd> <0636> <03bf> <0636> <03c1> <0637> <03c5> <0638> endbfchar 5 beginbfrange <03c9> <03d1> [<0639> <0639> <0639> <0639> <0020> <0629> <063a> <063a> <0641>] <03d3> <03d3> [<0641>] <03d4> <03d5> <0641> <03d7> <03d7> [<0642>] <03d8> <03d9> <0642> endbfrange 2 beginbfchar <03db> <0643> <03dd> <0644> endbfchar 2 beginbfrange <03df> <03df> [<0644>] <03e0> <03e1> <0644> endbfrange 3 beginbfchar <03e3> <0645> <03e5> <0646> <03e7> <0646> endbfchar 4 beginbfrange <03e9> <03eb> [<0647> <0647> <0647>] <03ec> <03ed> <0647> <03ef> <03f3> [<0649> <0649> <0020> <0020> <064a>] <03f5> <03fc> [<06440622> <0626> <06440623> <06440623> <06440625> <06440625> <06440627> <06440627>] endbfrange 3 beginbfchar <0467> <0631> <0b09> <0650> <0bcc> <0627064406440651064e> endbfchar endcmap cmapname currentdict /cmap defineresource pop end end

pdfbox

No comments:

Post a Comment