Date: Tue, 19 Mar 2024 12:32:43 +0000 (UTC) Message-ID: <173350129.56653.1710851563542@cwiki-he-fi.apache.org> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_56652_1063486196.1710851563542" ------=_Part_56652_1063486196.1710851563542 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
Permalink to this page: https://cwiki.apache.org/confluence/x/li= klBg
If a character encoding is not specified, the Servlet specification requ=
ires that an encoding of ISO-8859-1 is used. The character encoding for the=
body of an HTTP message (request or response) is specified in the=
Content-Type
header field. An example of such a header is
References: HTTP 1.1 Specifica= tion, Section 3.7.1
The above general rules apply to Servlets. The behaviour of JSP pages is=
further specified by the JSP specification. The request character encoding=
handling is the same, but response character encoding behaves a bit differ=
ently. See chapter "JSP.4.2 Response Character Encoding". For JSP pages in =
standard syntax the default response charset is the usual ISO-8859-1<=
/code>, but for the ones in XML syntax it is
UTF-8
.
Everything covered in this page comes down to practical interpretation o= f a number of specifications. When working with Java servlets, the Java Ser= vlet Specification is the primary reference, but the servlet spec itself re= lies on older specifications such as HTTP for its foundation. Here are a co= uple of references before we cover exactly where these items are located in= them. A more detailed list can be found on the Specifications page.
See 'Default Encoding for POST' below.
The character set for HTTP query strings (that's the technical term for =
'GET parameters') can be found in sections 2 and 2.1 the "URI Syntax" speci=
fication. The character set is defined to be US-ASCII. An=
y character that does not map to US-ASCII must be encoded in some way. Sect=
ion 2.1 of the URI Syntax specification says that characters outside of US-=
ASCII must be encoded using %
escape sequences: each character=
is encoded as a literal %
followed by the two hexadecimal cod=
es which indicate its character code. Thus, a
(US-ASCII charac=
ter code 97 =3D 0x61) is equivalent to %61
. Although the URI s=
pecification does not mandate a default encoding for percent-encoded octets=
, it recommends UTF-8 especially for new URI schemes, and most modern user =
agents have settled on UTF-8 for percent-encoding URI characters.
Some notes about the character encoding of URIs:
Older versions of the HTTP/1.1 specification (e.g. RFC 261=
6) indicated that ISO-8859-1 is the default =
charset for text-based HTTP request and response bodies if no charset is in=
dicated. Although RFC 7231 removed this default, the serv=
let specification continues to follow suit. Thus the servlet specification =
indicates that if a POST
request does not indicate an encoding=
, it must be processed as ISO-8859-1
, except for applica=
tion/x-www-form-urlencoded
, which by default should be interpreted a=
s {{`}}US-ASCII` (as it by definition should contain only characters within=
the ASCII range to begin with).
Some notes about the character encoding of a POST request:
Content-Type
header if the encoding is supported. A missing charact=
er allows the recipient to "guess" what encoding is appropriate.application/x-www-form-urlencoded
=
h5>
The HTML 4.01 specification =
indicated that percent-encoding of any non alphanumeric characters of application/x-www-form-urlencoded
(the default content type for HTM=
L form submissions) should be performed using US-ASCII
byte se=
quences. However HTML 5 changed th=
is to use UTF-8 byte sequences, matching the modern percent encoding for UR=
Ls. Modern browsers therefore percent-encode UTF-8 sequences when submittin=
g forms using application/x-www-form-urlencoded
.
The servlet specification, however, requires servlet containers to inter=
pret percent-encoded sequences in application/x-www-form-urlencoded=
code> as
ISO-8859-1
, which in a default configuration will res=
ult in corrupted content because of the charset mismatch. See below for how=
this can be reconfigured in Tomcat.
Section 3.1 of the ARPA Internet Text Messages spec states that headers = are always in US-ASCII encoding. Anything outside of that needs to be encod= ed. See the section above regarding query strings in URIs.
Tomcat will use ISO-8859-1 as the default character encoding of the enti= re URL, including the query string ("GET parameters") (though see Tomcat 8 = notice below).
There are two ways to specify how GET parameters are interpreted:
URIEncoding
attribute on the <Connector> ele=
ment in server.xml to something specific (e.g. URIEncoding=3D"UTF-8"<=
/code>).
useBodyEncodingForURI
attribute on the <Connect=
or> element in server.xml to true
. This will cause the Conn=
ector to use the request body's encoding for GET parameters.In Tomcat 8 starting with 8.0.0 (8.0.0-RC3, to be specific), the default=
value of URIEncoding
attribute on the <Connector> eleme=
nt depends on "strict servlet compliance" setting. The default value (stric=
t compliance is off) of URIEncoding
is now UTF-8
.=
If "strict servlet compliance" is enabled, the default value is ISO-=
8859-1
.
References: Tomcat 7 HTTP Connector<= /a>, Tomcat 7 AJP Connector, Tomcat 8.5 HTTP Connector, Tomcat 8.5 AJP Connector
POST
requests should specify the encoding of the parameters=
and values they send. Since many clients fail to set an explicit encoding,=
the default used is US-ASCII
for application/x-www-form=
-urlencoded
and ISO-8859-1
for all other content types.=
In addition, the servlet specification requires that percent-encoded seq=
uences of application/x-www-form-urlencoded
be interpreted as =
ISO-8859-1
by default which, as explained above, does not matc=
h the HTML 5 specification and modern user agent practice of using UTF-8 to=
percent encode characters. Nevertheless the servlet specification requires=
the servlet container's interpretation of percent-encoded sequences of application/x-www-for=
m-urlencoded
byte sequences can be achieved by setting the request c=
haracter encoding to UTF-8
.
The container-agnostic approach for specifying the request character enc=
oding for applications using Servlet 4.0 or later (which would correspond t=
o Tomcat 9.0 and later) is to set the <request-character-encoding&=
gt;
element in the web application web.xml
file:
<request-character-encoding>UTF-8</request-character-enco=
ding>
Note: If you are using the Eclipse integrated =
development environment, as of Eclipse Enterprise Java Developers 2019-03 M=
1 (4.11.0 M1) the IDE does not recognize the <request-character-en=
coding>
setting and will temporarily freeze the IDE and generate =
errors with any edit of web application files. You can track the latest sta=
tus of this problem at Eclipse Bug 543377<=
/a>.
Otherwise one can employ a javax.servlet.Filter
. Writing su=
ch a filter is trivial.
6.x, 7.x::
Tomcat already comes with such an example filter. Please take a look at
5.5.36+, 6.0.36+, 7.0.20+, 8.x and later::
Since Tomcat 7.0.20, 6.0.36 and 5.5.36 the filter became first-class citize=
n and was moved from the examples into core Tomcat and is available to any =
web application without the need to compile and bundle it separately, altho=
ugh this will not allow the web application to be deployed in non-Tomcat se=
rvlet containers that do not have this filter available, if the servlet is =
defined in the web application's own web-xml
file. See documen=
tation for the list of filters pr=
ovided by Tomcat. The class name is org.apache.catalina.filters.SetCh=
aracterEncodingFilter
.
It is also possible to define such a filter in the Tomcat installation c=
onfiguration file conf/web.xml
, which would set the request ch=
aracter encoding across all web applications without the need for any web.xml
modifications. In fact the latest Tomcat versions come with=
sections in conf/web.xml
that already configure a filter to s=
et the request character encoding to UTF-8
. Simply edit =
conf/web.xml
and uncomment both the definition and the mapping of th=
e filter named setCharacterEncodingFilter
.
Note: The request encoding setting is effectiv=
e only if it is done earlier than parameters are parsed. Once parsing happe=
ns, there is no way back. Parameters parsing is triggered by the first meth=
od that asks for parameter name or value. Make sure that the filter is posi=
tioned before any other filters that ask for request parameters. The positi=
oning depends on the order of filter-mapping
declarations in t=
he WEB-INF/web.xml file, though since Servlet 3.0 specification there are a=
dditional options to control the order. To check the actual order you can t=
hrow an Exception from your page and check its stack trace for filter names=
.
Tomcat 9.x and later: do not use a <filter>
at all an=
d instead specify <request-character-encoding>
in your a=
pplication's web.xml file.
Using UTF-8
as your character encoding for everything is a =
safe bet. This should work for pretty much every situation.
In order to completely switch to using UTF-8, you need to make the follo= wing changes:
URIEncoding=3D"UTF-8"
on your <Connector> in conf/web.xml
file or in the web=
app web.xml
file; either by setting <request-charact=
er-encoding>
(for applications using Servlet 4.0 / Tomcat 9.x+) o=
r by using a character encoding filter.<%@page contentType=3D"text/html; charset=3DUTF-8" %&=
gt;
for the usual JSP pages and <jsp:directive.page content=
Type=3D"text/html; charset=3DUTF-8" />
for the pages in XML synta=
x (aka JSP Documents).response.set=
ContentType("text/html; charset=3DUTF-8")
or response.setChara=
cterEncoding("UTF-8")
.The following sample JSP should work on a clean Tomcat install for any i= nput. If you set the URIEncoding=3D"UTF-8" on the connector, it will also w= ork with method=3D"GET".
<%@ page contentType=3D"text/html; charset=3DUTF-8" %> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <title>Character encoding test page</title> </head> <body> <p>Data posted to this form was: <% request.setCharacterEncoding("UTF-8"); out.print(request.getParameter("mydata")); %> </p> <form method=3D"POST" action=3D"index.jsp"> <input type=3D"text" name=3D"mydata"> <input type=3D"submit" value=3D"Submit" /> <input type=3D"reset" value=3D"Reset" /> </form> </body> </html>
You have to encode them in some way before you insert them into a header=
. Using url-encoding (%
+ high byte number + low byte number) =
would be a good idea.
If a web application is configured to use the BASIC authentication schem=
e (e.g. configured with <auth-method>BASIC</auth-method><=
/code> in its web.xml file), it means that an instance of BasicAuth=
enticator will be automatically created and inserted into the chai=
n of Valves for this web application (this Context), unless any other Authe=
nticator valve has already been explicitly configured.
To enable support for UTF-8 in a BasicAuthenticator, you can configure i= t explicitly, by inserting the following line into the Context configuratio= n file of your web application (usually META-INF/context.xml):
<Valve className=3D"org.apache.catalina.authenticator.Ba=
sicAuthenticator" charset=3D"UTF-8" />
If you do so, the BasicAuthenticator will append "charset=3DUTF-8" to th= e value of WWW-Authenticate header that it sends and will interpret the val= ues sent by clients as UTF-8.
See also:
In Tomcat 5 - there have been issues reported with respect to character = encoding (usually of the the form "request.setCharacterEncoding(String) doe= sn't work"). Odds are, its not a bug. Before filing a bug report, see these= bug reports as well as any bug reports linked to these bug reports: