开发者

How do I specify which version of UTF-8 I want (in Java)?

开发者 https://www.devze.com 2023-03-18 23:49 出处:网络
Due to some awkward legacy code, I need to pass some non-English text around as ansi/ascii strings that are visibly UTF-8 encoded. For the most part, this is working alright (I\'m using URLEncoder). H

Due to some awkward legacy code, I need to pass some non-English text around as ansi/ascii strings that are visibly UTF-8 encoded. For the most part, this is working alright (I'm using URLEncoder). However, now I need it to be able to output different versions of UTF-8 in different circumstances, and I don't know how to do that.

For example, this character can be UTF-8 encoded these ways:

大
%u5927
大
%E5%A4%A7

But nothing seems to talk about the different versions, as though there is no difference. I know URLEncoder does not do the second version, because the & is a reserved character, but the second one is what I need in some instances. How can I convert the text to the specific version I want?

Specifically, it's bei开发者_JS百科ng passed to a .jsp that contains a library called displaytag that handles the data and displays a table without much developer input, but it doesn't seem to have any options for setting the encoding. I know the second encoding (passed as ansi/ascii) in the above list is displays correctly without changing the .jsp, though, which is the safest option for me. I just need to get it that way.


First is the unicode code point in hex and is URL encoded, second is same in decimal and is the HTML/XML entity form.

Never used it for your purpose but I think StringEscapeUtils escapeHtml or escapeXml should give you the second form.

BTW the second form also has a hex version: 大

Third looks like a conversion by a non utf-8 aware function which has converted the three bytes that in utf-8 make up the single code point separately. The third is in my view incorrect because you cannot see if it are three ascii bytes or that it is in fact utf-8.


From what I can gather from the question, all you really want to ultimately do is display text.

You already understand that what is stored in memory or in files is byte sequences, pure and simple, and somehow you have the three-byte sequence e5 a4 a7, because that is the way the character OOKII HAJIME OOINI (大) is encoded in UTF-8.

To put this character in a URL using Java then yes you use URLEncoder and you will get %E5%A4%A7. But if you want to display it on a JSP, then I would certain recommend the HTML entity 大 because you won't be subject to end users setting up their browser's character encoding to mess with your byte stream if you decide to send the raw UTF-8 bytes.

How you do this depends on whether your data is stored as a byte array or a real Java string. Generally, to output HTML numeric entities, you can do this programmatically by turning each character with codepoint above 7F into characters of the form

& # x codepoint ;

or search the web for a library that does that for you. It is probably more work if you are processing a byte array, but it can be done. Commons Lang's StringEscapeUtils handles known named entities, but I do not believe it will create numeric HTML entities for characters with large codepoints.

0

精彩评论

暂无评论...
验证码 换一张
取 消