Overview

ApacheFOP treats surrogates pairs as 2 different codepoints. Most of the methods accepts/returns a single char (Eg Typeface.mapChar(char c)) which mean they can deal only with BMP characters (<= 0xFFFF).

In order to correctly handle the non-BMP characters (Eg: Emoji, Mathematical symbols, ancient scripts, CJK extensions) ApacheFOP should deal with int rather then char. It is possible to represent the whole Unicode range using a single int while it is not possible with a single UTF-16 char.

These are the main aspects of this modification:

  1. Read the non-BMP glyphs from the font
  2. Make the API to use int instead of char
  3. Convert surrogate pairs to a single int
  4. Adapt the renderer

Read the non-BMP glyphs from the font

The glyph information are stored in one of the font CMAPs tables. The implemented one is:

  • PlatformID: 3 (Microsoft)
  • EncodingID: 10 (Unicode UCS-4)
  • CMAP format: 12 (Formats 8, 10, and 12, 13, and 14 are used for mixed 16/32-bit and pure 32-bit mappings. This supports text encoded with surrogates in Unicode 2.0 and later)

Apple TrueType Reference Manual: (https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6cmap.html)

Make the API to use int instead of char

Such modification would mainly affects the Font classes hierarchy. The Typeface class is one of the base classes of the Font Hierarchy and is one of the classes that should be modified. It has though (as of September 2016) approximately 27 subclasses/implementations which would make the scope of the modification pretty huge.

Since not all the font classes are supposed to deal with non-BMP codepoints it is possible to narrow down the the scope of the modification to a lower number of classes. This is supposed to be just a step that allow to have at least a working path to handle surrogate pairs.

The class identified as good point to start is CIDFont as a CID Fonts has been designed to handle huge character sets. From Adobe documentation: "CID fonts are a new format of composite (multibyte) Type 1 fonts that better address the requirements of Far East markets. Adobe developed the CID-keyed font file format to support large character set fonts..." (src: http://www.adobe.com/products/postscript/pdfs/cid.pdf).

FontMetrics
    Typeface
        SystemFontMetricsMapper
        LazyFont
        AFPFont
            RasterFont
            AbstractOutlineFont
                DoubleByteFont
                OutlineFont
                    AFPTrueTypeFont in AFPFontConfig
        Base14Font
            Helvetica
            Symbol
            HelveticaBoldOblique
            HelveticaOblique
            HelveticaBold
            ZapfDingbats
            Courier
            CourierBold
            TimesBold
            TimesBoldItalic
            TimesItalic
            TimesRoman
            CourierOblique
            CourierBoldOblique
        CustomFontMetricsMapper
        CustomFont
            SingleByteFont                + hasCodePoint(int):boolean
            CIDFont   <------------------ + mapCodePoint(int):int
                MultiByteFont

                                         ~ getUnicode(int):char -> getUnicode(int):int
CIDSet (Used by CIDFont) <-------------- + mapCodePoint(int, int):int
    CIDSubset
    CIDFull
                                          + hasCodePoint(int):boolean
Font <----------------------------------- + mapCodePoint(int):int

getUnicode(): is defined in CIDSet (is not a property of the Typeface class or one of its subclasses). I changed the firm of this method to handle int instead of char because it is semantically incorrect to represent unicode with a single UTF-16 char. As you can see from the CIDSet hierarchy the change affect only 3 classes.

getUnicodeFromGID(): this method is defined in CustomFont and CIDSet. It never get called from the MultiByteFont path, probably because getUnicode is used instead. That is why I'm down casting the return value from int to char in CIDFull and CIDSubset. Probably the best thing to do would be to get rid of this method or make it handle int, but again the change would affect more classes then the ones in the scope.

Convert surrogate pairs to a single int

The data arrives as String and non-BMP characters are represented as surrogate pairs. Every time some operation is performed on the data (eg. Font.mapCodePoint(int)) surrogate pairs should be converted to the corresponding code point.

The current implementation make this conversion inside the for loops used to deal with the data:

for (int i = 0, i < text.length(); i++) {
    int cp = text.charAt(i);

    if (CharUtilities.containsSurrogatePairAt(text, i)) { // Throw an exception if it is an ill-formed surrogate pair
        c = Character.toCodePoint((char) c, text.charAt(++i));
    }
    [...]
}

or

for (int i = 0, i < text.length(); i++) {
    int cp = text.codePointAt(i); // Java API, do NOT throw error if it is an ill-formed surrogate pair

    i += CharUtilities.incrementIfNonBMP(orgChar);
    [...]
}

The best thing to do in future is implement something like the Java8 API String.codepoints() which allow you to directly iterate through a stream/array of codepoints avoiding boilerplate code.

Adapt the Renderer

Every ApacheFOIP output format has it's own way to represent data which means that each renderer need to be adapted to handle non-BMP codepoints.

The adapted Renderer/Painters are:

  • PDFPainter
  • PSPainter
  • Java2DPainter
  • Java2DRenderer
  • No labels