Overview
ApacheFOP treats surrogates pairs as 2 different codepoints. Most of the methods accepts/returns a single char (Eg Typeface.mapChar(char c)) which mean they can deal only with BMP characters (<= 0xFFFF).
In order to correctly handle the non-BMP characters (Eg: Emoji, Mathematical symbols, ancient scripts, CJK extensions) ApacheFOP should deal with int rather then char. It is possible to represent the whole Unicode range using a single int while it is not possible with a single UTF-16 char.
These are the main aspects of this modification:
- Read the non-BMP glyphs from the font
- Make the API to use int instead of char
- Convert surrogate pairs to a single int
- Adapt the renderer
Read the non-BMP glyphs from the font
The glyph information are stored in one of the font CMAPs tables. The implemented one is:
- PlatformID: 3 (Microsoft)
- EncodingID: 10 (Unicode UCS-4)
- CMAP format: 12 (Formats 8, 10, and 12, 13, and 14 are used for mixed 16/32-bit and pure 32-bit mappings. This supports text encoded with surrogates in Unicode 2.0 and later)
Apple TrueType Reference Manual: (https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6cmap.html)
Make the API to use int instead of char
Such modification would mainly affects the Font classes hierarchy. The Typeface class is one of the base classes of the Font Hierarchy and is one of the classes that should be modified. It has though (as of September 2016) approximately 27 subclasses/implementations which would make the scope of the modification pretty huge.
Since not all the font classes are supposed to deal with non-BMP codepoints it is possible to narrow down the the scope of the modification to a lower number of classes. This is supposed to be just a step that allow to have at least a working path to handle surrogate pairs.
The class identified as good point to start is CIDFont as a CID Fonts has been designed to handle huge character sets. From Adobe documentation: "CID fonts are a new format of composite (multibyte) Type 1 fonts that better address the requirements of Far East markets. Adobe developed the CID-keyed font file format to support large character set fonts..." (src: http://www.adobe.com/products/postscript/pdfs/cid.pdf).
FontMetrics Typeface SystemFontMetricsMapper LazyFont AFPFont RasterFont AbstractOutlineFont DoubleByteFont OutlineFont AFPTrueTypeFont in AFPFontConfig Base14Font Helvetica Symbol HelveticaBoldOblique HelveticaOblique HelveticaBold ZapfDingbats Courier CourierBold TimesBold TimesBoldItalic TimesItalic TimesRoman CourierOblique CourierBoldOblique CustomFontMetricsMapper CustomFont SingleByteFont + hasCodePoint(int):boolean CIDFont <------------------ + mapCodePoint(int):int MultiByteFont ~ getUnicode(int):char -> getUnicode(int):int CIDSet (Used by CIDFont) <-------------- + mapCodePoint(int, int):int CIDSubset CIDFull + hasCodePoint(int):boolean Font <----------------------------------- + mapCodePoint(int):int
getUnicode()
: is defined in CIDSet (is not a property of the Typeface class or one of its subclasses). I changed the firm of this method to handle int instead of char because it is semantically incorrect to represent unicode with a single UTF-16 char. As you can see from the CIDSet hierarchy the change affect only 3 classes.
getUnicodeFromGID()
: this method is defined in CustomFont and CIDSet. It never get called from the MultiByteFont path, probably because getUnicode is used instead. That is why I'm down casting the return value from int to char in CIDFull and CIDSubset. Probably the best thing to do would be to get rid of this method or make it handle int, but again the change would affect more classes then the ones in the scope.
Convert surrogate pairs to a single int
The data arrives as String and non-BMP characters are represented as surrogate pairs. Every time some operation is performed on the data (eg. Font.mapCodePoint(int)) surrogate pairs should be converted to the corresponding code point.
The current implementation make this conversion inside the for loops used to deal with the data:
for (int i = 0, i < text.length(); i++) { int cp = text.charAt(i); if (CharUtilities.containsSurrogatePairAt(text, i)) { // Throw an exception if it is an ill-formed surrogate pair c = Character.toCodePoint((char) c, text.charAt(++i)); } [...] }
or
for (int i = 0, i < text.length(); i++) { int cp = text.codePointAt(i); // Java API, do NOT throw error if it is an ill-formed surrogate pair i += CharUtilities.incrementIfNonBMP(orgChar); [...] }
The best thing to do in future is implement something like the Java8 API String.codepoints() which allow you to directly iterate through a stream/array of codepoints avoiding boilerplate code.
Adapt the Renderer
Every ApacheFOIP output format has it's own way to represent data which means that each renderer need to be adapted to handle non-BMP codepoints.
The adapted Renderer/Painters are:
- PDFPainter
- PSPainter
- Java2DPainter
- Java2DRenderer