'a' char appended when reading unicode text from txt file in android_问答_开发者

'a' char appended when reading unicode text from txt file in android

开发者 https://www.devze.com 2023-03-05 18:38 出处：网络

Hello I am trying to read a UTF-8 encoded txt files with Hebrew chars on my android application, and now after managing doing for some reason the \'a\' char is always appended at the beginning of the

Hello I am trying to read a UTF-8 encoded txt files with Hebrew chars on my android application, and now after managing doing for some reason the 'a' char is always appended at the beginning of the String i read.. and I wonder why

Here is my code:

        void Read(){
        try {
            File fileDir = new File("/sdcard/test.txt");

            BufferedReader in = new Buffe开发者_StackOverflow中文版redReader( new InputStreamReader(
                          new FileInputStream(fileDir), "UTF8"));

            String str;

            while ((str = in.readLine()) != null) {
                    Log.i("TEST",str);
            }

                    in.close();
            } 
            catch (UnsupportedEncodingException e) 
            {
                System.out.println(e.getMessage());
            } 
            catch (IOException e) 
            {
                System.out.println(e.getMessage());
            }
            catch (Exception e)
            {
                System.out.println(e.getMessage());
            }
        }

this is the result i get

05-15 01:53:25.269: INFO/TEST(16236): אבגדהוזחטיכלמנסעפצקרשתa

In order to get a better answer, I need two questions answered:

What is the exact code point of the character in question (your "a")?
What is the exact byte sequence in your file, around the questionable area?

I'm going to take a guess here: You say the character is the first thing in the file ("appended at the beginning of the String") and that you got back it's in the Arabic Presentation Forms B block. The last character of Arabic Presentation Forms B, which oddly has nothing to do with Arabic, is U+FFEF, or the byte order mark (BOM). It usually appears at the beginning of UTF-16 or UTF-32 encoded files, and identifies the "endianess" of the encoding (whether the file is UTF-16LE or UTF-16BE encoded, likewise for UTF-32). It typically does not appear, however, in UTF-8 data, as UTF-8 has no notion of "byte order". That said, some brain-dead Windows programs will stick it there, and then have an additional option of "UTF-8 without BOM". (The BOM is used then to identify a file as likely being encoded in UTF-8.) My guess is you have a BOM in your data, and your program is reading it and passing it on to you.

IF this is your problem, and your file is genuinely encoded in UTF-8, you should be able to find the following byte sequence near the beginning of the file: EF BB BF — this is the UTF-8 representation of U+FFEF.