How can I convert an international (e.g. Russian) String to \u
numbers (unicode num开发者_开发技巧bers)
\u041e\u041a
for OK
?there is a JDK tools executed via command line as following :
native2ascii -encoding utf8 src.txt output.txt
Example :
src.txt
بسم الله الرحمن الرحيم
output.txt
\u0628\u0633\u0645 \u0627\u0644\u0644\u0647 \u0627\u0644\u0631\u062d\u0645\u0646 \u0627\u0644\u0631\u062d\u064a\u0645
If you want to use it in your Java application, you can wrap this command line by :
String pathSrc = "./tmp/src.txt";
String pathOut = "./tmp/output.txt";
String cmdLine = "native2ascii -encoding utf8 " + new File(pathSrc).getAbsolutePath() + " " + new File(pathOut).getAbsolutePath();
Runtime.getRuntime().exec(cmdLine);
System.out.println("THE END");
Then read content of the new file.
You could use escapeJavaStyleString
from org.apache.commons.lang.StringEscapeUtils
.
I also had this problem. I had some Portuguese text with some special characters, but these characters where already in unicode format (ex.: \u00e3
).
So I want to convert S\u00e3o
to São
.
I did it using the apache commons StringEscapeUtils. As @sorin-sbarnea said. Can be downloaded here.
Use the method unescapeJava
, like this:
String text = "S\u00e3o"
text = StringEscapeUtils.unescapeJava(text);
System.out.println("text " + text);
(There is also the method escapeJava
, but this one puts the unicode characters in the string.)
If any one knows a solution on pure Java, please tell us.
Here's an improved version of ArtB's answer:
StringBuilder b = new StringBuilder();
for (char c : input.toCharArray()) {
if (c >= 128)
b.append("\\u").append(String.format("%04X", (int) c));
else
b.append(c);
}
return b.toString();
This version escapes all non-ASCII chars and works correctly for low Unicode code points like Ä
.
There are three parts to the answer
- Get the Unicode for each character
- Determine if it is in the Cyrillic Page
- Convert to Hexadecimal.
To get each character you can iterate through the String using the charAt()
or toCharArray()
methods.
for( char c : s.toCharArray() )
The value of the char is the Unicode value.
The Cyrillic Unicode characters are any character in the following ranges:
Cyrillic: U+0400–U+04FF ( 1024 - 1279)
Cyrillic Supplement: U+0500–U+052F ( 1280 - 1327)
Cyrillic Extended-A: U+2DE0–U+2DFF (11744 - 11775)
Cyrillic Extended-B: U+A640–U+A69F (42560 - 42655)
If it is in this range it is Cyrillic. Just perform an if check. If it is in the range use Integer.toHexString()
and prepend the "\\u"
. Put together it should look something like this:
final int[][] ranges = new int[][]{
{ 1024, 1279 },
{ 1280, 1327 },
{ 11744, 11775 },
{ 42560, 42655 },
};
StringBuilder b = new StringBuilder();
for( char c : s.toCharArray() ){
int[] insideRange = null;
for( int[] range : ranges ){
if( range[0] <= c && c <= range[1] ){
insideRange = range;
break;
}
}
if( insideRange != null ){
b.append( "\\u" ).append( Integer.toHexString(c) );
}else{
b.append( c );
}
}
return b.toString();
Edit: probably should make the check c < 128
and reverse the if
and the else
bodies; you probably should escape everything that isn't ASCII. I was probably too literal in my reading of your question.
There's a command-line tool that ships with java called native2ascii. This converts unicode files to ASCII-escaped files. I've found that this is a necessary step for generating .properties files for localization.
In case you need this to write a .properties
file you can just add the Strings into a Properties object and then save it to a file. It will take care for the conversion.
Apache commons StringEscapeUtils.escapeEcmaScript(String)
returns a string with unicode characters escaped using the \u
notation.
"Art of Beer
精彩评论