FOP's line breaking and page breaking algorithms both implement Knuth's breaking algorithm.

The detailed description of the model and the algorithm can be found on the paper "Breaking Paragraphs into Lines" by Donald E. Knuth, published in the book "Digital Typography" (Stanford, California: Center for the Study of Language and Information, 1999), (CSLI Lecture Notes, no. 78.), ISBN 1-57586-010-4 (which is surely a book worth reading).

This is just a short summary meant to give a quick overview of the content representation that is the input of the breaking algorithm, such allowing interested people to understand FOP's behaviour even without the need to know how exactly the breaks are actually computed.

The best way to know how the algorithm examines this representation and find the breaks is reading Knuth's paper cited above and look at its implementation inside FOP, which follows quite closely the description.

It is important to note that this representation is not bound to a particular algorithm: it's very easy to use it as a input for a completely different breaking algorithm, such as a first-fit or best-fit one.

Knuth's original algorithm solves the line breaking problem, so this page will speak in terms of "paragraphs", "lines" and "words", but most of the concepts apply to the page breaking problem too: it's enough to substitute "paragraphs" with "page sequences", "lines" with "pages", and "words" with "lines". Just as the line breaking algorithm breaks the content of paragraphs into lines (containing words), the page breaking algorithm breaks the content of page-sequences into pages (containing lines).

Formal Representation of the Problem

Paragraph Modeling

A paragraph can be modelled as a sequence of elements

x₁ x₂ x₃ ... x_n

where each x_i can be either a box element, a glue element or a penalty element.

All elements have a width w; if x_i is the i-th element in the sequence, its width will be w_i.

Box elements

A box element represents an unbreakable piece of content with fixed width: for example a character, a syllable (but only if letter spacing is constant), an inline image, ...

In the context of page breaking, a box element can represent a line.

Glue elements

A glue element represents some kind of space between boxes: for example, a space character between two words.

A glue element x_i, besides its optimal width w_i has a stretchability y_i and a shrinkability z_i; this means that the breaking algorithm can set the element's actual width to any value in the range [w_i - z_i, w_i + y_i], although it will try to choose a value as close as possible to w_i.

In the context of page breaking, a glue element can represent the space between two paragraphs.

Penalty elements

A penalty element represents information about a breaking point; it does not represent any piece of content.

The width w_i of a penalty element x_i is considered only if the breaking algorithm chooses that penalty element as a break. For example, in the context of line breaking the width of the hyphen character "-" must be considered only if a word is actually hyphenated.

A penalty element x_i has a penalty value p_i representing the "aesthetic cost" of the break: a positive cost suggests the breaking algorithm not to use that break point, while a negative cost signals an appealing position for a break. In particular, if p_i is +infinity it forbids a break, while if it is -infinity it forces a break (in other words, there cannot be a better place for a break).

Moreover, a penalty element has a flagged boolean value: this value marks penalties that should not be chosen to end consecutive lines. For example, in the context of line breaking the hyphenation points inside a word are represented using flagged penalties, so the algorithm tries to avoid the creation of consecutive lines ending with a hyphen.

Starting and ending elements

... coming soon ....

Element suppression

Having decided where the lines end, it's time to decide where they begin.

... coming soon ...

Breaking Rules

Definitions

A paragraph may be broken only at given places, and such that the resulting lines are not too loose or too tight. The following definitions help model those concerns:

Legal Breakpoint: this is the only place where a paragraph may be broken; x_i is a legal breakpoint if and only if:
- it is a penalty item with penalty < +infinity;
- it is a glue item and x_i-1 is a box item.
demerits of a line (or page): represents its aesthetic cost. It is computed by using a formula mixing the following parameters:
- adjustment ratio r: how much the line has to be shrinked (r < 0) or stretched (r > 0) to fit in the given length. r may not be < -1 or superior to a given threshold
- badness β: it is related to the adjustment ratio of this line: the more it has to be shrinked/stretched, the higher is the badness. β is infinite if r < -1 or r is undefined. r is undefined if the total shrinkability of stretchability for the line is negative or null.
- penalty π: value of the penalty if this line ends at a penalty item; otherwise 0.
- flagged demerits: additional demerits due to consecutive lines ending at flagged penalty items (e.g., hypenated words)
Feasible Breakpoint: a legal breakpoint such that the paragraph may be broken into a set of lines, one of those lines being ended by this breakpoint, each line having finite demerits. That is, no line of this set is too much shrinked or stretched.
Active Breakpoint: when the algorithm is running, a feasible breakpoint which might be a candidate for future breaks. That is, there might exist a feasible breakpoint later in the (not yet considered) Knuth sequence, ending the line which would start after this breakpoint. When we become sure that no such breakpoint exists, this breakpoint will be deactivated.

It follows from those definitions that the set of active breakpoints is a subset of the feasible breakpoints, which is a subset of the legal breakpoints, which is a subset of the Knuth elements.

Algorithm

The goal of the breaking algorithm is to find an ordered set of indices

b₀ < b₁ < b₂ < ... < b_k

such that

the first index, b₀, is conventionally 0 (note that the first element in the sequence has index = 1)
the other indices point to an element that is a feasible breakpoint

The algorithm must respect the forced breaks: this means that this set must contain the indices of all the penalties whose value is -infinity; in particular b_k = n.

Page tree

KnuthsModel