PDF/A Conformance Notes
This document discusses what needs to be done to make Apache FOP conformant to PDF/A (ISO 19005). PDF/A is an ISO standard that defines additional requirements and restrictions on PDF documents to make them useful for long-term preservation.
References:
- PDF/A-1:
- PDF/A-2:
- XMP Specification
Implementing Support for PDF/A-1
Conformance Levels
PDF/A-1 defines two conformance levels: A and B. These are discussed separately below. The first goal is to make FOP level B conformant. Level A is a superset of level B and involves preserving the structural and semantic properties of the source document ("Tagged PDF").
Level B Conformance
Level B conformance basically has the primary purpose to define a file format based on PDF, known as PDF/A, which provides a mechanism for representing electronic documents in a manner that preserves their visual appearance over time, independent of the tools and systems used for creating, storing or rendering the files. This puts some constraints on the application generating the PDF files. Examples of such constraints are:
- No PostScript XObjects allowed
- No LZW compression allowed
- No encryption allowed
- All fonts need to be embedded, even the Base 14 fonts
- No file attachments allowed
- The use of XMP metadata is required
- etc. (for details, please see ISO 19005-1:2005(E))
Implementation in Apache FOP
Outputting PDF/A-1 should be an optional feature as it may restrict the feature set of Apache FOP. For example, the use of EPS files directly embedded in PDF files may be desired by certain applications. As can be seen above, however, this feature is prohibited in PDF/A-1. The class PDFDocument should get a flag that turns on PDF/A-1 functionality. The PDF library as such should check conformance wherever possible, throwing an Exception if a breach of PDF/A-1 conformance is detected. But the PDF library cannot detect everything, for example, violations inside a page stream. Therefore, the PDFRenderer (and probably PDFGraphics2D, too) need to do similar checks if PDF/A-1 conformance is activated. Tasks identified for making FOP PDF/A-1b compatible so far are:
- Support XMP metadata
- Add checks in the PDF library, PDFRenderer and PDFGraphics2D (and helper classes) to check for violations of PDF/A-1.
- Review PDF generation concerning color handling based on requirements of PDF/A-1.
- Review PDF generation concerning font handling based on requirements of PDF/A-1.
- Verify that all currently supported fonts are fully embedded, even Base 14 fonts if PDF/A-1 is activated.
Level A Conformance
Level A adds requirements so the textual content and its structure can be recovered from an PDF file. This means supporting "Tagged PDF". Tasks identified in addition to the above for making FOP PDF/A-1b compatible so far are:
- Support for ToUnicode maps
- Support for Tagged PDF
Implementing Support for PDF/A-2
PDF/A-2 is an updated version of PDF/A based on PDF 1.7 (ISO 32000-1). It relieves some limitations imposed by PDF/A-1 and allows constructs that appeared in newer versions of PDF.
The main element of interest in the context of FOP is the possibility to use transparency. Although transparency was already available in PDF 1.4, PDF/A-1 was forbidding it because the model was not entirely well defined. Since this is now the case in PDF 1.7, transparency is allowed by PDF/A-2.
Because of the backwards-compatibility of PDF, any PDF/A-1 compliant file should normally also be PDF/A-2 compliant (at the same conformance level).
Conformance Levels
PDF/A-2 introduces a new conformance level, level U. This is basically the same as level B + the presence of ToUnicode maps. Therefore, level A is a superset of level U, which is a superset of level B.
Some confusion can occur when mixing PDF/A-1 and PDF/A-2:
- If a file is PDF/A-1 compliant, then it is also PDF/A-2 compliant (but the opposite is not necessarily true!).
- If a file is PDF/A-2a compliant, then it is also PDF/A-2u compliant.
- If a file is PDF/A-2u compliant, then it is also PDF/A-2b compliant.
- However, a file may be PDF/A-2u compliant but not PDF/A-1b compliant!
Implementation
We can largely rely on the current implementation of PDF/A-1. We just need to add the constants for PDF/A-2, and relieve the constraint on transparency when targetting PDF/A-2.
We should leave the choice to the user to select conformance level B or U. From a FOP point of view those are equivalent since ToUnicode maps are always generated, yet it is better if the user can retrieve their selected conformance level in the XMP metadata.
Problems
A major nuisance is that ISO 19005-1:2005(E) is a standard that is not freely available. You have to buy licenses from the International Organisation for Standardization (ISO). The price for a single-user license is 114 CHF (around 87 USD). This fact may make it difficult to maintain PDF/A-1 compatibility once it has been implemented, as not every committer and contributor may have access to a copy of the specification. You can find freely available copies of drafts of this standard on the net. Please note that there maybe differences to the actual and currently valid ISO document (ISO 19005-1:2005(E), corrected version, 2005-12-01).
Publicly available copies (found by using public search engines):
- http://www.aiim.org/documents/standards/ISO_19005-1_(E).doc
- http://www.archivists.org.au/pubs/ISO_DIS_19005-1.pdf
- https://committees.standards.org.au/COMMITTEES/IT-021/N0001/ISO_19005-1-2005.pdf