FOP's line breaking and page breaking algorithms both implement Knuth's breaking algorithm.
The detailed description of the model and the algorithm can be found on the paper "Breaking Paragraphs into Lines" by Donald E. Knuth, published in the book "Digital Typography" (Stanford, California: Center for the Study of Language and Information, 1999), (CSLI Lecture Notes, no. 78.), ISBN 1-57586-010-4 (which is surely a book worth reading).
This is just a short summary meant to give a quick overview of the content representation that is the input of the breaking algorithm, such allowing interested people to understand FOP's behaviour even without the need to know how exactly the breaks are actually computed.
The best way to know how the algorithm examines this representation and find the breaks is reading Knuth's paper cited above and look at its implementation inside FOP, which follows quite closely the description.
It is important to note that this representation is not bound to a particular algorithm: it's very easy to use it as a input for a completely different breaking algorithm, such as a first-fit or best-fit one.
Knuth's original algorithm solves the line breaking problem, so this page will speak in terms of "paragraphs", "lines" and "words", but most of the concepts apply to the page breaking problem too: it's enough to substitute "paragraphs" with "page sequences", "lines" with "pages", and "words" with "lines". Just as the line breaking algorithm breaks the content of paragraphs into lines (containing words), the page breaking algorithm breaks the content of page-sequences into pages (containing lines).
Formal Representation of the Problem
A paragraph can be modelled as a sequence of elements
x1 x2 x3 ... xn
where each xi can be either a box element, a glue element or a penalty element.
All elements have a width w; if xi is the i-th element in the sequence, its width will be wi.
A box element represents an unbreakable piece of content with fixed width: for example a character, a syllable (but only if letter spacing is constant), an inline image, ...
In the context of page breaking, a box element can represent a line.
A glue element represents some kind of space between boxes: for example, a space character between two words.
A glue element xi, besides its optimal width wi has a stretchability yi and a shrinkability zi; this means that the breaking algorithm can set the element's actual width to any value in the range [wi - zi, wi + yi], although it will try to choose a value as close as possible to wi.
In the context of page breaking, a glue element can represent the space between two paragraphs.
A penalty element represents information about a breaking point; it does not represent any piece of content.
The width wi of a penalty element xi is considered only if the breaking algorithm chooses that penalty element as a break. For example, in the context of line breaking the width of the hyphen character "-" must be considered only if a word is actually hyphenated.
A penalty element xi has a penalty value pi representing the "aesthetic cost" of the break: a positive cost suggests the breaking algorithm not to use that break point, while a negative cost signals an appealing position for a break. In particular, if pi is +infinity it forbids a break, while if it is -infinity it forces a break (in other words, there cannot be a better place for a break).
Moreover, a penalty element has a flagged boolean value: this value marks penalties that should not be chosen to end consecutive lines. For example, in the context of line breaking the hyphenation points inside a word are represented using flagged penalties, so the algorithm tries to avoid the creation of consecutive lines ending with a hyphen.
Starting and ending elements
... coming soon ....
Having decided where the lines end, it's time to decide where they begin.
... coming soon ...
A paragraph may be broken only at given places, and such that the resulting lines are not too loose or too tight. The following definitions help model those concerns:
- Legal Breakpoint: this is the only place where a paragraph may be broken; xi is a legal breakpoint if and only if:
- it is a penalty item with penalty < +infinity;
- it is a glue item and xi-1 is a box item.
- demerits of a line (or page): represents its aesthetic cost. It is computed by using a formula mixing the following parameters:
- adjustment ratio r: how much the line has to be shrinked (r < 0) or stretched (r > 0) to fit in the given length. r may not be < -1 or superior to a given threshold
- badness β: it is related to the adjustment ratio of this line: the more it has to be shrinked/stretched, the higher is the badness. β is infinite if r < -1 or r is undefined. r is undefined if the total shrinkability of stretchability for the line is negative or null.
- penalty π: value of the penalty if this line ends at a penalty item; otherwise 0.
- flagged demerits: additional demerits due to consecutive lines ending at flagged penalty items (e.g., hypenated words)
- Feasible Breakpoint: a legal breakpoint such that the paragraph may be broken into a set of lines, one of those lines being ended by this breakpoint, each line having finite demerits. That is, no line of this set is too much shrinked or stretched.
- Active Breakpoint: when the algorithm is running, a feasible breakpoint which might be a candidate for future breaks. That is, there might exist a feasible breakpoint later in the (not yet considered) Knuth sequence, ending the line which would start after this breakpoint. When we become sure that no such breakpoint exists, this breakpoint will be deactivated.
It follows from those definitions that the set of active breakpoints is a subset of the feasible breakpoints, which is a subset of the legal breakpoints, which is a subset of the Knuth elements.
The goal of the breaking algorithm is to find an ordered set of indices
b0 < b1 < b2 < ... < bk
- the first index, b0, is conventionally 0 (note that the first element in the sequence has index = 1)
- the other indices point to an element that is a feasible breakpoint
The algorithm must respect the forced breaks: this means that this set must contain the indices of all the penalties whose value is -infinity; in particular bk = n.