开发者

Problem with UTF-8 String and binary data

开发者 https://www.devze.com 2023-01-08 02:16 出处:网络
prehistory: Java regular expression for binary string I can extract a substring with binary data I need, but when I use

prehistory: Java regular expression for binary string

I can extract a substring with binary data I need, but when I use

   String s = matcher.group(1);

It seems that data is spoiled,

to be exact spoiled are only those chars that belong to extended ASCII table, probably from 128 to 255. Other chars are kept untouched, but some are corrupted. What I basically mean, is that I need to transform this " s " string into byte array, but this: String s2 = new String(s.getBytes(), "US-ASCII")

or this

String s2 = new String(s.getBytes(), "ISO-8859-1") 

and later,

 fileOutputStream.write(s2.getBytes())

replaces all chars from extended ASCII table to " ? ", while others lik开发者_如何学Ce \0 or 'A' are kept uncorrupted.

How to interpret a String as plain [0-255] ascii binary symbols ?

PS I solved it, one should use

    String encoding = "ISO-8859-1";

to encode/decode byte arrays, and everything works perfectly.


What I basically mean, is that I need to transform this " s " string into byte array

Answering this directly:

byte[] array = Charset.forName("utf-8").encode(CharBuffer.wrap(s)).array();

Edit:
String has a helper function added that does the same thing as above with a bit less code:

byte[] array = s.getBytes(Charset.forName("utf-8"));


Java only knows general Unicode Strings. Whenever you care about the underlying byte values of letters, you are dealing with bytes, and should be using byte arrays. You can only convert Java Strings to byte arrays for a specific encoding (it may be an implicit default argument, but it's always there). You CANNOT use the String data type and expect your particular encoding to be preserved, you really must specify it each and ever time you read data from outside Java or export them elsewhere (such as text field inputs or the file system).

Using byte arrays means that you cannot use Java's built-in support for regular expressions directly. That's kind of a pain, but as you have seen, it wouldn't give correct results anyway, and that's not an accident - it CANNOT work correctly for what you want to do. You really must use something else to manipulate byte streams, because Strings are encoding-agnostic, and always will be.


you can also do this with a little less code than what Gunslinger47 showed us

byte[] utf8Bytes = s.getBytes("UTF8");
0

精彩评论

暂无评论...
验证码 换一张
取 消