THIS IS GOING TO CHANGE in 1.1
This entry does not try to cover all aspects of localization. It merely describes how to set the encoding of a markup file and how the encoding of the output html is determined.
- Each markup file associated with a component might have its own encoding, which is not required to be equal to the HTTP response's encoding. Characters get transformed automatically, if possible.
- Application settings setDefaultMarkupEncoding() allows to set a default markup file encoding to be used instead of the value which the JVM inherited from the operating systems process environment. If set to null, the OS default value will be used.
- The XML parser uses an InputStream and a Reader applying the encoding as mentioned above first. Attaching a new Reader with the "correct" encoding for the remaining characters of the markup file, if <?xml encoding="..." ?> is found. That is, besides what (File)Reader already offers, we especially do not interpret the first two bytes of the file, which are allowed to contain information about the (text)file's encoding.
- Encoding of a markup file content is determined by e.g.
<?xml version="1.0" encoding="utf-8"?>. See http://www.w3.org/TR/2000/REC-xml-20001006#charencoding for more details. This is true for HTML as well, which is not 100% XML compliant.
- The xml declaration string <?xml ..?> of a page's markup is passed through to the http response unchanged, whereas the xml declaration string of a component's markup is not. This is kind of uncomfortable if you need to support IE in quirks mode (see http://www.wellstyled.com/html-doctype-and-browser-mode.html), which is why we support setStripXmlDeclarationFromOutput().
- If the page's markup does NOT contain a xml declaration string including encoding information, the encoding of the http response is determined by the session's locale. Please read the note below for more details.
- If the page's markup DOES contain a xml declaration string including encoding information, the encoding of the http header will be modified accordingly (Page.configureResponse)
- Wicket does not automatically extend the HTTP content type header with
charset=..., except as described on the previous note. See below for how you can do it yourself.
- Wicket has no build in means at all to automatically create/detect/maintain a html meta tags. See below for more details.
Note on WebResponse.setLocale(): Output stream encoding. Servlet 2.4 only: If the deployment descriptor contains a locale-encoding-mapping-list element, and that element provides a mapping for the given locale, that mapping is used. Otherwise, the mapping from locale to character encoding is container dependent. Default is ISO-8859-1. See javadoc for javax.servlet.ServletResponse#setLocale(java.util.Locale) as well.
Because only servlet 2.4 supports web.xml locale-encoding-mapping-list deployment descriptors, this is a workaround for servlet 2.3
You'll find CharSetUtil.java in the contrib's encoding package.
Note: Wicket has no built-in means at all to automatically create/detect/maintain a html meta tag such as
<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">.
Currently it is completely up to you. http://www.w3.org/TR/REC-html40/charset.html explains why you might want to do that.
Gaaaarrrrrr.... Yet another standards related problem to keep me scatching my head for hours. I developed and now run the website Rural Escapes. This is great, I really enjoy doing it even though at the moment it's not earning me much but it also means I get the job of debugging and fixing it when things go wrong. Todays little problem is foregin characters not displaying correctly. While developing the site I went to great lengths to make sure that I was able to correctly display all the characters that are used in Europe as Rural Escapes is a Europe wide service. To this end I made sure that all pages sent back stated UTF-8 as their character encoding. Everything seemed to be working. All the foregin characters I entered seemed to be displayed correctly so I was happy. What I forgot to check though was what happens when non-ASCII characters are sent up as form data. Well it turns out that Tomcat (and probably every other container and webserver) interprets them using whatever default encoding it is set to use. In the case of Tomcat this seems to be ISO-8859-1 which means that it mangles characters such as Ãƒâ€šÃ‚Â£. The reason for this monumental screw up - you guessed it IE.
There is a header called Content-Type which is sent up with POST data which should have the format
Content-type: application/x-www-form-urlencoded; charset=UTF-8
however back at the dawn of time Microsoft, when developing IE, left off the "; charset-UTF-8" parameter. At the time this wasn't so bad because basically everywhere used ISO-8859-1 so you could be pretty sure everything would be interpretted correctly. Now though a multitude of different character sets are used and you can't rely on the data being sent using one particular encoding which leads to problems.
There is partial solution to this problem but it's not pretty. Mozilla and IE (at least) can now include an extra parameter in a post request called "charset" (as described in this Mozilla bug report) which can be used to determine the character set of the posted data. There are some potential problems with parameter name clashes but they are probably quite minimal. To include this extra parameter you simply add the attribute "accept-charset" to your form element, cross your fingers, and place a hidden form field called "charset" in the form. The browser will then fill in that field when it sends up the data. It is described in this w3c docuemnt. A word of warning though; although this works the Java Servlet API doesn't recognise this as a valid way of specifying the form data character set and therefore a call to request.getCharacterEncoding() still returns null.
You might be wondering why Microsoft and Mozilla don't just correctly implement the specification. It's quite simple really: Microsoft have been doing it wrong for so long that numerous websites now rely on the incorrectly specified header and handle the correct header very badly. The Mozilla team tried to include the correct header and ran into serious compatability problems so removed it again.
So are there any nice solutions? There is a de-facto standards solution. Both IE and Mozilla send back form data encoded using whatever encoding the page was supplied in even though they don't actually set the header correctly. This, ironically, is the root cause of my problems. The data has been coming up as UTF-8 but because the header isn't set the servlet spec has been decoding it as ISO-8859-1 and screwing it up. If you want to read more about this problem I suggest you have a look here for a great review. This page describes using UTF-8 (well Unicode) with Linux and other posix system and details a number of problems.
If you are using Java Servlets to process forms the simplest, and probably most effective, way to ensure you get all the correct characters is to make sure you set the page character set everywhere you can (headers and meta data) and then rely on the browser using that character set when submitting the form. This will work as long as the user doesn't change the content type before submitting the form but as most users don't know what "character type" means they will leave that setting well alone. The only modification you have to make to you code is to ensure that you call request.setCharacterEncoding() BEFORE you read ANY parameters from the request. This technique is discussed here.