开发者

Servlet receiving data both in ISO-8859-1 and UTF-8. How to URL-decode?

开发者 https://www.devze.com 2023-01-01 14:42 出处:网络
I\'ve a web application (well, in fact is just a servlet) which receives data from 3 different sources:

I've a web application (well, in fact is just a servlet) which receives data from 3 different sources:

  • Source A is a HTML document written in UTF-8, and sends the data via <form method="get">.
  • Source B is written in ISO-8859-1, and sends the data via <form method="get">, too.
  • Source C is written in ISO-8859-1, and sends the data via <a href="http://my-servlet-url?param=value&param2=value2&etc">.

The servlet receives the request params and URL-decodes them using UTF-8. As you can expect, A works without problems, while B and C fail (you can't URL-decode in UTF-8 something that's encoded in ISO-8859-1...).

I can make slight modifications to B and C, bu开发者_JAVA技巧t I am not allowed to change them from ISO-8859-1 to UTF-8, which would solve all the problems.

In B, I've been able to solve the problem by adding accept-charset="UTF-8" to the <form>. So it sends the data in UTF-8 even with the page being ISO.

What can I do to fix C?

Alternatively, is there any way to determine the charset on the servlet, so I can call URL-decode with the right encoding in each case?


Edit: I've just found this, which seems to solve my problem. I still have to make some tests in order to determine if it impacts the perfomance, but I think I'll stick with that solution.


The browser will by default send the data in the same encoding as the requested page was returned in. This is controllable by the HTTP Content-Type header which you can also set using the HTML <meta> tag.

The accept-charset attribute of the HTML <form> element should be avoided since it's broken in MSIE. Almost all non-UTF-8 encodings are ignored and will be sent in platform default encoding (which is usually CP-1252 in case of Windows).

To fix A and B (POST) you basically need to set HttpServletRequest#setCharacterEncoding() before gathering request parameters. Keep in mind that this is an one-time task. You cannot get a parameter and then change the encoding and then "re-get" the parameters.

To fix C (GET) you basically need to set the request URI encoding in the server configuration. Since it's unclear which server you're using, here's a Tomcat-targeted example: in the HTTP connector set the following attribute:

<Connector (...) URIEncoding="ISO-8859-1" />

However, this is already the default encoding in most servers. So you maybe don't need to do anything for C.

As an alternative, you can grab the raw and un-URL-encoded data from the request body (in case of POST) by HttpServletRequest#getInputStream() or from the query string (in case of GET) by HttpServletRequest#getQueryString() and then guess the encoding yourself based on the characters available in the parameters and then URL-encode accordingly using the guessed encoding. A hidden input element with a specific character which is different in both UTF-8 and ISO-8859-1 may help a lot in this.


I'm answering myself in order to mark the question as solved:

I found this question, which covers exactly the same problem I was facing. The javax.servlet.Filter was the solution for me.

0

精彩评论

暂无评论...
验证码 换一张
取 消