unicode - PDF: Duplicate font names with different ToUnicode Cmaps -
unicode - PDF: Duplicate font names with different ToUnicode Cmaps -
i'm parsing pdf file , extracting of text, , i've run situation encounter font dictionary named "c2_0", contains cidfont (type 0) tounicode
cmap. so, no problem - have tools parse tounicode
cmap , map 2-byte character codes unicode values.
but pdf file later includes another font dictionary object, also called "c2_0", contains different tounicode
cmap. didn't how should handle sec cmap, guessed , combined entries both cmaps. worked, , extracted text correctly.
but, can't find in pdf reference manual says allowed, or addresses situation. have thought duplicate font names lead unspecified behavior, or @ to the lowest degree have sec override first or something. tried combining them longshot guess - , surprised worked.
does have experience this? know if pdf allowed have duplicate font names refer different objects different cmaps "combine" when invoked tf
operator?
c2_0 symbolic name in /font resource dictionary , has local scope, used in content stream resource dictionary belongs to. if c2_0 appears in /font resource dictionary, that's not problem. in have in same /font resource dictionary 2 c2_0 entries: /c2_0 x 0 r /c2_0 y 0 r have problem because behavior undefined , how handle situation. symbolic name resolution works this: if in page content stream, search font symbolic name (the tf operand) in page's resources dictionary. if cannot locate it, go in page tree , search resources dictionary (if exist) each parent page node. if reached top of page tree , did not find font, behavior undefined. @ moment can implement various fallback strategies: can utilize default font, can search resources included in form xobjects on page, can search resources dictionaries in other pages.
pdf unicode
Comments
Post a Comment