prehistory: Java regular expression for binary string
I can extract a substring with binary data I need, but when I use
String s = matcher.group(1);
It seems that data is spoiled,
to be exact spoiled are only those chars that belong to extended ASCII table, probably from 128 to 255. Other chars are kept untouched, but some are corrupted. What I basically mean, is that I need to transform this " s " string into byte array, but this: String s2 = new String(s.getBytes(), "US-ASCII")or this
String s2 = new String(s.getBytes(), "ISO-8859-1")
and later,
fileOutputStream.write(s2.getBytes())
replaces all chars from extended ASCII table to " ? ", while others lik开发者_如何学Ce \0 or 'A' are kept uncorrupted.
How to interpret a String as plain [0-255] ascii binary symbols ?
PS I solved it, one should use
String encoding = "ISO-8859-1";
to encode/decode byte arrays, and everything works perfectly.
What I basically mean, is that I need to transform this " s " string into byte array
Answering this directly:
byte[] array = Charset.forName("utf-8").encode(CharBuffer.wrap(s)).array();
Edit:
String has a helper function added that does the same thing as above with a bit less code:
byte[] array = s.getBytes(Charset.forName("utf-8"));
Java only knows general Unicode Strings. Whenever you care about the underlying byte values of letters, you are dealing with bytes, and should be using byte arrays. You can only convert Java Strings to byte arrays for a specific encoding (it may be an implicit default argument, but it's always there). You CANNOT use the String
data type and expect your particular encoding to be preserved, you really must specify it each and ever time you read data from outside Java or export them elsewhere (such as text field inputs or the file system).
Using byte arrays means that you cannot use Java's built-in support for regular expressions directly. That's kind of a pain, but as you have seen, it wouldn't give correct results anyway, and that's not an accident - it CANNOT work correctly for what you want to do. You really must use something else to manipulate byte streams, because String
s are encoding-agnostic, and always will be.
you can also do this with a little less code than what Gunslinger47 showed us
byte[] utf8Bytes = s.getBytes("UTF8");
精彩评论