Currently, the character encoding for source files needs to be configured individually for each and every plugin that processes source files. In this context, source file refers to some plain text file that - unlike an XML file - lacks intrinsic means to specify the employed file encoding. The Java source files are the most promiment example of such text files. Velocity templates, BeanShell scripts and APT documents are further examples. This proposal does not apply to XML files as their encoding can be determined from the file itself, see XML encoding for further information.
Life would become easier if there was a dedicated POM element like ${project.build.sourceEncoding}
which could be used to specify the encoding once per entire project. Every plugin could use it as default value:
/** * @parameter expression="${encoding}" default-value="${project.build.sourceEncoding}" */ private String encoding; |
Adding this element to the POM structure can only happen in Maven 4+ (tracked with MNG-2216 issue and referenced in POM Model version 5.0.0 proposal):
< project > ... < build > <!-- NOTE: This is just a vision for the future, it's not yet implemented: see MNG-2216 --> < sourceEncoding >UTF-8</ sourceEncoding > ... </ build > ... </ project > |
For Maven 2 & 3, the value can be defined as an equivalent property:
< project > ... < properties > < project.build.sourceEncoding >UTF-8</ project.build.sourceEncoding > ... </ properties > ... </ project > |
Thus plugins could immediately be modified to use ${project.build.sourceEncoding
} expression, whatever Maven version is used.
Motivation
Why bother with file encoding at all? Well, a file encoding (aka charset) is required to solve the following discrepancy: A file stored on disk or transmitted via network is merely a stream of bytes/octets. In contrast, text is a stream of characters. However, a character is not a byte.
To further illustrate this, just consider the Unicode standard chosen for a Java String. Unicode defines more than 65.000 characters which obviously cannot be mapped to a single byte each. Hence, one needs a reversible transformation that defines how to map a character to bytes and vice-versa. This transformation is called a file/character encoding.
Now, there are different encodings, each potentially yielding different bytes for the same character. For example, the common encoding ASCII will map the character 'A' to the byte with the hex code 0x41. The same character is mapped to the byte 0xC1 when using the encoding EBCDIC. Another example is the character 'ü' (small letter u with umlaut) that maps to the single byte 0xFC when using ISO-8859-1 but maps to the two byte sequence 0xC3 0xBC when using UTF-8.
It should be clear by now that encoding a character with one encoding and later on decoding it with a different encoding can corrupt the character. To avoid such errors, it is crucial that all developers of a project have agreed to use the same encoding when editing the project sources and running the build.
Default Value
As shown by a user poll on the mailing list and the numerous comments on this article, this proposal has been revised: Plugins should use the platform default encoding if no explicit file encoding has been provided in the plugin configuration.
Since usage of the platform encoding yields platform-dependent and hence potentially irreproducible builds, plugins should output a warning to inform the user about this threat, e.g.:
[WARNING] Using platform encoding (Cp1252 actually) to copy filtered resources, i.e. build is platform dependent!
This way, users can smoothly update their POMs to follow best practices.
Code Spots to Review for Proper Encoding Handling
The following classes and/or methods indicate usage of the JVM's default encoding and hence should be reviewed:
String(byte[])
String.getBytes()
FileReader
FileWriter
PrintWriter(File)
(new in JDK 5)PrintWriter(OutputStream)
(new in JDK 5)InputStreamReader(InputStream)
OutputStreamWriter(OutputStream)
ReaderFactory.newPlatformReader()
WriterFactory.newPlatformWriter()
FileUtils.fileRead(String)
FileUtils.fileRead(File)
FileUtils.fileWrite(String, String)
FileUtils.fileAppend(String, String)
IOUtils.toString(InputStream)
IOUtils.toString(InputStream, int)
Plugins to Modify
Build plugins are highlighted, since the impact of the change is more critical to the built artifact than reporting plugins.
Affected Apache plugins:
- maven-changes-plugin (velocity template for announcement): MCHANGES-71, done in 2.1
- maven-checkstyle-plugin (source analysis): MCHECKSTYLE-95, done in 2.2
- maven-compiler-plugin (source processing): MCOMPILER-70, done in 2.1
- maven-invoker-plugin (beanshell script evaluation): MINVOKER-30, done in 1.2
- maven-javadoc-plugin (source processing): MJAVADOC-182, done in 2.5
- maven-jxr-plugin (source processing): JXR-60, done in 2.2
- maven-plugin-plugin (javadoc extraction, java source generation): MPLUGIN-101, MPLUGIN-100, done in 2.5
- maven-pmd-plugin (source analysis): MPMD-76, done in 2.4
- maven-resources-plugin (contents filtering): MRESOURCES-57, done in 2.3
- maven-site-plugin (apt sources): MSITE-314, done in 2.0-beta-7
Affected Codehaus plugins:
- findbugs-maven-plugin: (no Jira issue), done in 2.2
- jalopy-maven-plugin: MOJO-1138, done in 1.0-alpha-2-SNAPSHOT
- javancss-maven-plugin: MJNCSS-31
- modello-maven-plugin/modello-core (java source generation): MODELLO-109, done in 1.0-alpha-19
- native2ascii-maven-plugin
- plexus-component-metadata (formerly plexus-maven-plugin) (javadoc extraction): PLX-371, done in 1.0-beta-3.0.4
- shitty-maven-plugin (groovy script evaluation)
- simian-maven-plugin
- taglist-maven-plugin (javadoc extraction): MTAGLIST-27, done in 2.3
References
Please see [0] for the related thread from the mailing list, [1] for some further descriptions and [2] for a similar feature request in JIRA. Also note a related proposal for the output encoding of reports [3].
[0] http://www.nabble.com/POM-Element-for-Source-File-Encoding-to14930345s177.html
[1] http://www.nabble.com/Re%3A-Maven-and-File-Encoding-p16301958s177.html
[2] MNG-2216