开发者

iPhone SDK - stringWithContentsOfUrl ASCII characters in HTML source

开发者 https://www.devze.com 2022-12-21 08:06 出处:网络
When I fetch the source of any web page, no matter the encoding I use, I always end up with &# - characters (such as © or ®) instead of the actual characters themselves. This goes for foreign ch

When I fetch the source of any web page, no matter the encoding I use, I always end up with &# - characters (such as © or ®) instead of the actual characters themselves. This goes for foreign characters as well (such as åäö in swedish), which I have to parse from "&开发者_StackOverflow;Aring" and such).

I'm using

+stringWithContentsOfUrl: encoding: error; 

to fetch the source and have tried several different encodings such as NSUTF8StringEncoding and NSASCIIStringEncoding, but nothing seems to affect the end result string.

Any ideas / tips / solution is greatly appreciated! I'd rather not have to implement the entire ASCII table and replace all occurrances of every character... Thanks in advance!

Regards


I'm using

+stringWithContentsOfUrl: encoding: error; 

to fetch the source and have tried several different encodings such as NSUTF8StringEncoding and NSASCIIStringEncoding, but nothing seems to affect the end result string.

You're misunderstanding the purpose of that encoding: argument. The method needs to convert bytes into characters somehow; the encoding tells it what sequences of bytes describe which characters. You need to make sure the encoding matches that of the resource data.

The entity references are an SGML/XML thing. SGML and XML are not encodings; they are markup language syntaxes. stringWithContentsOfURL:encoding:error: and its cousins do not attempt to parse sequences of characters (syntax) in any way, which is what they would have to do to convert one sequence of characters (an entity reference) into a different one (the entity, in practice meaning single character, that is referenced).

You can convert the entity references to un-escaped characters using the CFXMLCreateStringByUnescapingEntities function. It takes a CFString, which an NSString is (toll-free bridging), and returns a CFString, which is an NSString.


Are you sure they originally are not in Å form? Try to view the source code in a browser first.


That really, really sucks. I wanted to convert it directly and the above solution isn't really a good one, so I just wrote my own ascii-table converter (static) class. Works as it should have worked natively (though I have to fill in the ascii table myself...)

Ideas for optimization? ("ASCII" is a static NSDictionary)

@implementation InternetHelper

+(NSString *)HTMLSourceFromUrlWithString:(NSString *)str convertASCII:(BOOL)state
{
    NSURL *url = [NSURL URLWithString:str];
    NSString *source = [NSString stringWithContentsOfURL:url encoding:NSUTF8StringEncoding error:nil];

    if (state)
        source = [InternetHelper ConvertASCIICharactersInString:source];

    return source;
}

+(NSString *)ConvertASCIICharactersInString:(NSString *)str
{
    NSString *ret = [NSString stringWithString:str];

    if (!ASCII)
    {
        NSString *path = [[NSBundle mainBundle] pathForResource:kASCIICharacterTableFilename ofType:kFileFormat];
        ASCII = [[NSDictionary alloc] initWithContentsOfFile:path];
    }

    for (id key in ASCII)
    {
        ret = [ret stringByReplacingOccurrencesOfString:key withString:[ASCII objectForKey:key]];
    }

    return ret;
}       

@end
0

精彩评论

暂无评论...
验证码 换一张
取 消