开发者

Unicode Byte sequence/convert a char to bytes array

开发者 https://www.devze.com 2023-03-10 23:50 出处:网络
I am trying to write a simple program for this interview question: Write a function that checks for valid unicode byte sequence. A unicode

I am trying to write a simple program for this interview question:

Write a function that checks for valid unicode byte sequence. A unicode sequence is encoded as: - first byte indicates number of subsequent bytes '11110000' means 4 subsequent data bytes - data bytes start with a '10xxxxxx'

   public static void main(String[] args)
{

        System.out.println(checkUnicode(new byte[] {(byte)'c'}));

}

    /**开发者_如何转开发
     * Write a function that checks for valid unicode byte sequence. A unicode
     * sequence is encoded as: - first byte indicates number of subsequent bytes
     * '1111000' means 4 subsequent data bytes - data bytes start with a
     * '10xxxxxx'
     * 
     * @param unicodeChar
     * @return
     */
 public static boolean checkUnicode(byte[] unicodeChar)
{
    byte b = unicodeChar[0];
    int len = 0;

    int temp = (int)b<<1;
    while((int)temp<<1 == 0)
    {
        len++;
    }
    System.out.println(len);

    if (unicodeChar.length == len) 
    {
        for(int i = 1 ; i < len; i++)
        {
            // Check if Most significant 2 bits in the byte are '10'
            // c0, in base 16, is 11000000 in binary
            // 10000000, in base 2, is 128 in decimal
            if( ( (int)unicodeChar[i]&0Xc0 )==128 )
            {
                continue;
            }
            else
            {
                return false;
            }
        }
        return true;
    }
    else
    {
        return false;
    }
}

The output I get is   
99
false  

Changed the conversion from char to byte array based on Chris Jester-Young's comment.

Can someone point me to right direction

Thanks

Made some modifications based on input from Ted Hopp.

P.S:

I got the question from some forum and I think it wasn't posted in correctly there, however I still decided to solve it and use it as is to prevent obfuscating it more, since I did not understand it completely either !


Here's an enterprise level solution for your enterprise level job:

public static void main(String[] args) {
    if (args.length == 0 || args[0] == null || (args[0] = args[0].trim()).isEmpty()) {
        System.out.println("No argument passed or argument empty!");
        return;
    }

    String arg = args[0];
    System.out.println("arg: " + arg + ", arg len: " + arg.length());

    BitSet bs = new BitSet(arg.length());
    for (int i = 0; i < arg.length(); i++) {
        if (arg.charAt(i) == '1') {
            bs.set(i, true); 
        }
    }
    ByteBuffer bb = ByteBuffer.wrap(bs.toByteArray());
    Charset cs = Charset.forName("UTF-8");
    CharsetDecoder csd =
            cs.newDecoder().onMalformedInput(CodingErrorAction.REPORT).
            onUnmappableCharacter(CodingErrorAction.REPORT)
            ;

    try {
        CharBuffer cb = csd.decode(bb);
        String uns = cb.toString();
        System.out.println("Got unicode string of len " + uns.length() + ": " + uns + " from " + arg + " -- no errors!");
    } catch (CharacterCodingException cce) {
        System.out.println("Invalid UTF-8 unicode string! " + cce.getMessage());
    }
}

Verification:

public static void test() {
    StringBuilder sb = new StringBuilder();
     byte[] byt = new String("stupid interview").getBytes();
     BitSet byt1 = fromByteArray(byt);
     for (int i = 0; i < byt1.size(); i++) {
         sb.append(byt1.get(i) ? "1" : "0");
     }
     String[] st = new String[1];
     st[0] = sb.toString();
     main(st);
}

public static BitSet fromByteArray(byte[] bytes) {
    BitSet bits = new BitSet();
    for (int i=0; i<bytes.length*8; i++) {
        if ((bytes[bytes.length-i/8-1]&(1<<(i%8))) > 0) {
            bits.set(i);
        }
    }
    return bits;
}

Output:

11001110001011101010111000001110100101100010011000000100100101100111011000101110101001100100111001101110100101101010011011101110
arg: 11001110001011101010111000001110100101100010011000000100100101100111011000101110101001100100111001101110100101101010011011101110, arg len: 128
{0, 1, 4, 5, 6, 10, 12, 13, 14, 16, 18, 20, 21, 22, 28, 29, 30, 32, 35, 37, 38, 42, 45, 46, 53, 56, 59, 61, 62, 65, 66, 67, 69, 70, 74, 76, 77, 78, 80, 82, 85, 86, 89, 92, 93, 94, 97, 98, 100, 101, 102, 104, 107, 109, 110, 112, 114, 117, 118, 120, 121, 122, 124, 125, 126}
Got unicode string of len 16: stupid interview from 11001110001011101010111000001110100101100010011000000100100101100111011000101110101001100100111001101110100101101010011011101110 -- no errors!


First, the documentation of UTF-8 provided in the question is wrong. There is no such thing as "a valid Unicode byte sequence" without specifying the encoding. A safe assumption is that they meant UTF-8. Second (and more important) 11110000 does not indicate 4 more bytes of data. The four "1" bits before the first "0" bit indicate a total of 4 bytes (that is, 3 subsequent bytes, not 4, each starting with "10"). The rules are described well in the Wikipedia article on UTF-8.

Second, converting a character to a string and calling getBytes is a good approach, but you need to specify the encoding as an argument to getBytes. (However, for the character 'c', this isn't going to make a difference.)

I don't know what you are trying to do in your code, but you need to count how many '1' bits there are before the first '0' bit. Your code doesn't do anything like that.

UPDATE: I wouldn't actually bother trying to analyze the bit structure. I'd just feed the bytes to a CharsetDecoder and see if it chokes:

public static boolean checkUnicode(byte[] unicodeChar)
{
    try {
        CharsetDecoder decoder = Charset.forName(UTF-8).newDecoder();
        // test only for malformed input, ignore unknown Unicode characters
        decoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
        decoder.onMalformedInput(CodingErrorAction.REPORT);
        decoder.decode(ByteBuffer.wrap(unicodeChar));
        return true;
    }
    catch (MalformedInputException ex)
    {
        return false;
    }
}


Re how to convert your characters to bytes, you can just cast directly:

byte[] b = new byte[] {(byte) 0xe2, (byte) 0x82, (byte) 0xac};

Or, as a shorthand:

byte[] b = {(byte) 0xe2, (byte) 0x82, (byte) 0xac};


You can use Character.toCodePoint() to get an int, and then int to byte should be easy.

0

精彩评论

暂无评论...
验证码 换一张
取 消