开发者

NSXMLParser and BOM bytes

开发者 https://www.devze.com 2022-12-16 15:54 出处:网络
I\'m getting my xml file as a result of a ph开发者_C百科p query from some server. When I print the resulting data to the console I\'m getting well-structured xml file. When I try to parse it using NSX

I'm getting my xml file as a result of a ph开发者_C百科p query from some server. When I print the resulting data to the console I'm getting well-structured xml file. When I try to parse it using NSXMLParser it returns NSXMLParserErrorDomain with code 4 - empty document. I saw that xmls that it couldn't parse have BOM (Byte order mark) sequence right after closing '>' mark of xml header. The question is how to get rid of BOM sequence. I tried to create a string with those BOM bytes like that:

    const   UInt8 bom[3] = {0xEF, 0xBB, 0xBF};
NSString    *bomString = [[NSString alloc] initWithData:[NSData dataWithBytes:(const void *)bom length:3] encoding:NSUTF8StringEncoding];
NSString    *noBOMString = [theResult stringByReplacingOccurrencesOfString:bomString withString:@" "];

but it doesn't work for some reason. There are xmls, that have this sequence after the root element. In this case NSXMLParser parses the xml successfully. Safari ignores those characters. So Xcode debugger. Please help!

Thanks,

Nava


I tried to create a string with those BOM bytes like that:

const   UInt8 bom[3] = {0xEF, 0xBB, 0xBF};
NSString    *bomString = [[NSString alloc] initWithData:[NSData dataWithBytes:(const void *)bom length:3] encoding:NSUTF8StringEncoding];
NSString    *noBOMString = [theResult stringByReplacingOccurrencesOfString:bomString withString:@" "];

but it doesn't work for some reason.

Make sure you gave the correct encoding when instantiating noBOMString. If the document data was UTF-8, make sure you instantiated the string as UTF-8. Likewise, if the data was UTF-16, make sure you instantiated the string as UTF-16.

If you pass the wrong encoding, either the string won't instantiate at all (I'm assuming that isn't your problem) or some characters will be wrong. The BOM would be one of these: If the input is UTF-8 and you interpret it as MacRoman or ISOLatin1, it'll appear in the string as three separate characters. These three separate characters won't compare equal to the single character that is the BOM.


I'm not certain that this is the issue. I've had a very similar experiance where the file was encoded as UTF-8, but the xml header claimed it to be UTF-16.

As a result of the mismatch I was unable to parse it with the same error you had. However, changing the xml header from UTF-16 to UTF-8 fixed my issue for me.

You may be experiencing a similar issue.


Well, may be this is not the best approach to get rid of BOM bytes, but it works. For those who spent hours like me trying to make NSXMLParser to swallow BOMs: Given, that you get your data through NSURLConnection and store it in NSMutableData *webData.

    const char bom[3] = {0xEF, 0xBB, 0xBF};

char *data = [webData mutableBytes];
char *cp = data, *pp;
long lessBom = 0;
do {
    cp = strstr((const char *)cp, (const char *)bom);
    if (cp) {
        pp = cp;
        cp += 3;
        memcpy(pp, cp, strlen(cp));
        lessBom += 3;
    }
} while (cp != NULL);

NSMutableData   *newData = [[NSMutableData alloc] initWithBytes:data length:webData.length - lessBom];

Then you create your parser with newData and it JUST WORKS! I'll be glad to get any comments/improvements to this code

0

精彩评论

暂无评论...
验证码 换一张
取 消