开发者

Isn't the size of character in Java 2 bytes?

开发者 https://www.devze.com 2023-02-12 20:21 出处:网络
I used RandomAccessFile to read a byte from a text file. public static void readFile(RandomAccessFile fr) {

I used RandomAccessFile to read a byte from a text file.

public static void readFile(RandomAccessFile fr) {
    byte[] cbuff = new byte[1];
    fr.read(cbuff,0,1);
    System.out.println(new String(cbuff));
}

Why am I seeing one fu开发者_Go百科ll character being read by this?


A char represents a character in Java (*). It is 2 bytes large (or 16 bits).

That doesn't necessarily mean that every representation of a character is 2 bytes long. In fact many character encodings only reserve 1 byte for every character (or use 1 byte for the most common characters).

When you call the String(byte[]) constructor you ask Java to convert the byte[] to a String using the platform's default charset(**). Since the platform default charset is usually a 1-byte encoding such as ISO-8859-1 or a variable-length encoding such as UTF-8, it can easily convert that 1 byte to a single character.

If you run that code on a platform that uses UTF-16 (or UTF-32 or UCS-2 or UCS-4 or ...) as the platform default encoding, then you will not get a valid result (you'll get a String containing the Unicode Replacement Character instead).

That's one of the reasons why you should not depend on the platform default encoding: when converting between byte[] and char[]/String or between InputStream and Reader or between OutputStream and Writer, you should always specify which encoding you want to use. If you don't, then your code will be platform-dependent.

(*) that's not entirely true: a char represents a UTF-16 code unit. Either one or two UTF-16 code units represent a Unicode code point. A Unicode code point usually represents a character, but sometimes multiple Unicode code points are used to make up a single character. But the approximation above is close enough to discuss the topic at hand.

(**) Note that on Android the default character set is always UTF-8 and starting with Java 18 the Java platform itself also switched to this default (but it can still be configured to act the legacy way)


Java stores all it's "chars" internally as two bytes. However, when they become strings etc, the number of bytes will depend on your encoding.

Some characters (ASCII) are single byte, but many others are multi-byte.

Java supports Unicode, thus according to:

Java Character Docs

The max value supported is "\uFFFF" (hex FFFF, dec 65535), or 11111111 11111111 binary (two bytes).


The constructor String(byte[] bytes) takes the bytes from the buffer and encodes them to characters.

It uses the platform default charset to encode bytes to characters. If you know, your file contains text, that is encoded in a different charset, you can use the String(byte[] bytes, String charsetName) to use the correct encoding (from bytes to characters).


In ASCII text file each character is just one byte


Looks like your file contains ASCII characters, which are encoded in just 1 byte. If text file was containing non-ASCII character, e.g. 2-byte UTF-8, then you get just the first byte, not whole character.


There are some great answers here but I wanted to point out the jvm is free to store a char value in any size space >= 2 bytes.

On many architectures there is a penalty for performing unaligned memory access so a char might easily be padded to 4 bytes. A volatile char might even be padded to the size of the CPU cache line to prevent false sharing. https://en.wikipedia.org/wiki/False_sharing

It might be non-intuitive to new Java programmers that a character array or a string is NOT simply multiple characters. You should learn and think about strings and arrays distinctly from "multiple characters".

I also want to point out that java characters are often misused. People don't realize they are writing code that won't properly handle codepoints over 16 bits in length.


Java allocates 2 of 2 bytes for character as it follows UTF-16. It occupies minimum 2 bytes while storing a character, and maximum of 4 bytes. There is no 1 byte or 3 bytes of storage for character.


The Java char is 2 bytes. But the file encoding may be different.

So first you should know what encoding your file uses. For example, the file could be UTF-8 or ASCII encoded, then you will retrieve the right chars by reading one byte at a time.

If the encoding of the file is UTF-16, it may still show you the correct char if your UTF-16 is little endian. For example, the little endian UTF-16 for A is [65, 0]. Then when you read the first byte, it returns 65. After padding with 0 for the second byte, you will get A.

0

精彩评论

暂无评论...
验证码 换一张
取 消