Automatic Hyphenation
FOP uses Frank Liang's famous pattern based algorithm for automated hyphenation of words. This page will describe
- The core algorithm for hyphenation, in an abstract way.
- Data structures used in an efficient implementation
- Refinements and embedding of the algorithm
- How to generate the patterns used in the algorithm
- Related problems
- Various notes and other work in the area of automated hyphenation and related problems.
Motivation
The simplest solution to automated hyphenation is to make a list of all possible hyphenations of all words of the language in question. With most languages sporting some ten thousand base forms, which easily explode into several hundred thousand or a few million inflected and derived forms, this may be impratical, even if clever tricks for compact storage are used.
The other extreme, dissecting a given word algorithmically, may be difficult to formulate and slow too.
Frank Liang's idea is cool because it breaks out of the box in several ways:
- use patterns which have no linguistical justification
- learn these patterns automatically
- don't strive for perfection, good enough is good enough
The resulting algorithm works surprisingly well not only for english but for a wide range of languages, even if they pose problems like arbitrary word compositions. It is also fast, not all that hard to implement efficiently, and doesn't eat lots and lots of memory. Especially the latter was important in the early 1980's, and should be a virtue even today. Of course, it's not perfect. There is a trade-off between missing possible hyphenation points and spurious hyphenations. More than one news paper noticed 'funny hyphenations' for words which were outside the scope of the people which provided the patterns for standard typesetting software. The original patgen program allowed to weight words and specific hyphenation position, so that the resulting patterns would always hyphenate frequently used words correctly, while rare words might miss some positions or even get a wrong hyphenation position. There was also already a short "exception list" for words which are reasonably common but were incorrectly handled by the patterns. Apparently removing these words from the corpus used for pattern generation reduced the number of patterns significantly.
A few more problems:
- Changes in spelling due to hyphenation. Famous examples are the ck -> k-k transformation in german (classic spelling) and a few other languages: strecken -> strek-ken, and consonant resurrection in certain composite nouns in german (classic spelling): Schiffahrt -> Schiff-fahrt.
- Different hyphenations in homonyms due to different semantics. Famous example, also from german: Wachstube, either Wach-stube (guard's room) or Wachs-tube (wax tube).
Fortunately, these kinds of words are rare enough. In particular, the german spelling reform apparently placed some emphasis on getting rid of the irregularities which complicate computer programs.
The Core Algorithm
The algorithm works roughly as follows:
setup the hyphenation position vector (fill with zeros) for each substring of the word lookup the substring in the pattern store if you got a result // merge the result into the hyphenation position vector for each element of the result if the element is greater than the corresponding element in the hyphenation position vector then replace the element in the hyphenation position vector by the element from the lookup result end if end for end if end for
After this, odd integers in the hyphenation position vector mark valid hyphenation points, while even integers (including zero) denote positions unavailable for hyphenation.
Data Details
The hyphenation position vector is an array of integers with one element more than the number of characters in the word which is hyphenated. Each element in the hyphenation position vector represents a potential hyphenation position. The first element in the hyphenation position vector represents the position before the first character in the word, the secont element represents the position between the first and the second character of the word and so on, and the last element represents the position after the last character of the word. The first and last element in the hyphenation position vector are already there for the sake of an efficient implementation.
The pattern store maps strings (the patterns) to vectors of integers, which have one element more then the number of characters in the associated pattern. The integer values in these vectors are called weights. Similarly to the pattern hyphenation vector, the first element represents the weight for the position before the first character of the pattern etc.
A sample pattern store (taken from the english LaTeX hyphenation patterns):
+---+---+ +---+---+---+ | a | m | -> | 2 | 2 | 0 | +---+---+ +---+---+---+ +---+---+---+---+ +---+---+---+---+---+ | a | r | a | m | -> | 0 | 0 | 0 | 3 | 0 | +---+---+---+---+ +---+---+---+---+---+
Merging The Pattern Weights
Let's say we want to hyphenate the word "parameter". First setup the hyphenation position vector for the word:
p a r a m e t e r +---+---+---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+---+---+
Now we scan the substrings of the word starting with the first character: p, pa, par, para, param, parame, ... no hits, continue with the substrings starting at the second character: a, ar, ara, aram ... hit! Merge it into the hyphenation position vector:
p a r a m e t e r +---+---+---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+---+---+ a r a m +---+---+---+---+---+ | 0 | 0 | 0 | 3 | 0 | +---+---+---+---+---+ | v +---+---+---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+---+---+
Scanning further, we find a second hit:
p a r a m e t e r +---+---+---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+---+---+ a m +---+---+---+ | 2 | 2 | 0 | +---+---+---+ | x v x +---+---+---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 2 | 3 | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+---+---+
The result is a possible hyphenation as para-meter, indicated by the weight 3 at the 5th position, while all the zeroes and the weight 2 at position 4 inhibit hyphenation.
The order of substring matching doesn't matter, scanning the substrings e.g. by substring length rather than by start position leads to the same result.
The example also demonstrates why the hyphenation position vector extends left and right beyond the boundaries of the word: patters may have weights associated with the position before or after the end of the pattern, and extending the hyphenation position vector saves explicit testing whether such a pattern matches at the beginning or the end of the word. The positions are clipped after the weights of all matching patterns have been merged, which is much more efficient.
Anchoring Patterns
Like for regular expressions, it may be useful to anchor hyphenation patters to the beginning or the end of a word. The common, established way is to use sentinel characters, conventionally a dot. The core algorithm isn't changed at all, there are just the sentinel characters added to the word.
Here a somewhat fictious example, which uses a high weight to force a hyphenation of a "pa-ra" prefix:
. p a r a m e t e r . +---+---+---+---+---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+---+---+---+---+ . p a r a +---+---+---+---+---+---+ | 0 | 0 | 0 | 7 | 0 | 0 | +---+---+---+---+---+---+ | v +---+---+---+---+---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+---+---+---+---+
Of course, this requires that the sentinel character never occurs in a word.
External representation Of Patterns
Liang and Knuth also invented a convenient, compact representation to write down patterns. It basically inserts the non-zero weights as single digits between the characters of the patterns, and uses the dot for anchoring a pattern to the begin or the end of the word. The zero weights are implicit and are not represented.
The patterns used so far can be written as
ara3m 2a2m .pa7ra
A (fictious) pattern anchored at the end of the word would be
ete8r.
The representation has never been standardized, and obvious extensions (multi-digit weights, non-ASCII characters) are likely to create some confusion. Here an attempt at formalization in form of an EBNF grammar:
pattern ::= [start-of-word-anchor character] { [digit]? character }+ [digit] [character end-of-word-anchor] digit ::= "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" start-of-word-anchor ::= "." end-of-word-anchor ::= "." character ::= "any Unicode character except a dot"
Note that this grammar allows a pattern like "a44b" which should probably be interpreted as a pattern string "a4b" with a weight 4 for the position between the a and the 4.
Overlapping Patterns
Well, what is actually meant is: is it useful to have patters where a substring from the beginning of the pattern is also a pattern? In other words, is it useful to have e.g. both "para" and "param" as keys in the pattern storage? This has implications for the implementation of the pattern storage (see below). It seems the original patgen program wont generate such pattern sets (see further below)
Efficient Pattern Storage Implementation
Tries, Trie Implementations and Derived Data Structures
The pattern storage can be basically any data structure which can map strings to integer arrays, for example a hash map. However, a word has O( length(word)^2 ) substrings, and the word "parameter" requires 66 lookups (remember the sentinel characters, which count too). On the other hand, matches are rare. There are some obvious ways to reduce the number of lookups. For example, if there are no strings starting with "para" in the map, we don't have to look up "param", "parame", "paramet" etc.
A data structure specifically designed for storing a mapping of strings to arbitrary values is the trie (Wikipedia link). While tries look like trees, the key strings are not stored in the tree nodes; rather, a key string is represented by the path from the tree root to a leaf. An example of a trie storing the strings "am", "aram" and ".para":
root -*- a -*- m -> leaf | | | +- r -*- a -*- m -> leaf | +- . -*- p -*- a -*- r -*- a -> leaf
The nodes are represented as a "*".
There are multiple ways to implement the concept of a trie. In the original implementation by Liang and Knuth in TeX, a trie node is basically a sorted array of bytes which holds the downpath characters and a corresponding array of C pointers to the corresponding nodes. The implementors faced the usual tradeoff between space and access time and between insert time and find time complexity. In natural languages, there is a high correlation between successive characters in words. Some of the trie node character arrays would be quite full, in particular near the root, but further down the tree node usually forks only into a few nodes further downpath.
The FOP implementation has been called "ternary tree", and a trie node here is an ordinary binary tree itself, with the characters of the trie node as keys. The FOP ternary tree merges both the pointer structure of the trie and the binary tree of the trie nodes. BTW there are no mechanisms to keep the binary trees of the trie nodes balanced. The FOP hyphenation compiler first inserts the patterns as they come from the file (which leads to grossly unbalanced trees), then the data is copied using a median strategy, which generates balanced binary trees for the trie nodes. Finally, in order to make the confusion complete, the FOP ternary tree doesn't use Java objects to represent the tree nodes but integer and character arrays.
As already noted, when storing substrings of natural language words in a trie, the trie nodes further down are likely sparsely populated. The extreme case, only one child node, occurs quite often. In the example above, there are only two nodes with more than a child: the root node and the node below the "a" arc from the root node. It may seem wastful, both in terms of storage and possibly lookup time (pointer dereference and perhaps the non-locality of memory access), to use a full blown node structure in this special case. A natural optimization would be to store sub-tries with only one leaf or intermediate sequences of trie nodes with only one child in a more compact form. This leads to PATRICIA trees Wikipedia link and other specializations.
Search FreshMeat for sample implementations of tries. Google will turn up some more.
Alphabet Transformation
The hyphenation patterns for a certain language usually use only a small subset of all Unicode characters, often from a single Unicode block. Storing the characters 1:1 uses a Java char (16 bit), which may seem to be a waste. The usual advice is to map the necessary characters onto small numbers which will fit into a byte. This transformation must be implemented both for generating patterns (or reading/compiling human readable patterns into the pattern storage) and for the lookup. Fortunately, the transformation can be conveniently combined with the case normalization step (see below).
Weight Compression
Conventionally, weights in hyphenation patterns are limited to single digits. Some experiments indicate that high weigths are rare and there is not much to gain by allowing arbitrary high weights. Again, storing weights as Java integer (32 bit) or even as byte may seem wasteful. Many implementations store weights as packed BCD, two weights in a single byte, which already halves the storage requirements. Unpacking takes place while merging the weight vector into the hyphenation position vector. There is no noticable speed penalty due to this unpacking.
Also, most weights are zero. It would be an idea to use huffmann compression, run length encoding of sequences of zero or some other more advanced compression method to save even more space. AFAIK nobody explored these ideas yet.
Refinements And Embedding Of the Algorithm
In order to use the algorithm above, some more surrounding code is necessary.
- The text has to be parsed into words. A robust algorithm would cooperate with established line breaking algorithms (UAX 14) and would not be confused by punctuation, numbers and words from other languages (aka "funny characters").
- The parsed words have to be normalized (Unicode normalization).
- Filter out soft hyphens (U+00AD) as well as joiners (U+200D) and non-joiners (U+200C), which might have been inserted in order to control ligatures. This is not part of UAX 15. There might be more characters to filter, perhaps dependent on the laguage or the script.
- Get rid of differences in the word representation which are insignificant for hyphenation. For western languages, this mainly means character case normalization (read: downcase). This step may also include getting rid of accents or other character decorations (exposé -> expose). However, there's the usual trade-off between the complexity of normalizing minor spelling variations and simply using more patterns.
Character Classes
Classical TeX implementations conveniently ignore steps 2 and 3 and fold steps 1 and 4 into a single step. So called "character classes" both guide the word parser and describe the normalization in step 4.
Parsing and Unicode Normalization
http://www.unicode.org/reports/tr29/#Word_Boundaries
icu4j
The Exception List
As noted above, the algorithm can't cope with spelling changes due to hyphenation.
How To Generate Patterns
Keywords: patgen description, patgen parameters, word weights, hyphenation position weights
Related Problems
Keywords: ligatures, character shaping (e.g. arabic), thai word extraction
Further work
Known Problems Automated Hyphenation
Keywords: word compositions, agglutinating languages, unmarked foreign words, ambiguities, context
Other Work and Projects
Keywords: Petr Sojka, SiSiSi, deco-cow, track down more people and projects