Kof Kof - 1 month ago 10
HTTP Question

Detect HTML encoding when NSURLResponse returns nil for textEncodingName

I'm loading a website HTML using this call -

NSMutableURLRequest *request = [NSMutableURLRequest requestWithURL:url];
[request setValue:@"utf-8" forHTTPHeaderField:@"Accept-Encoding"];
[request setValue:@"text/html" forHTTPHeaderField:@"Accept"];
[NSURLConnection sendAsynchronousRequest:request
queue:[NSOperationQueue currentQueue]
completionHandler:^(NSURLResponse *response, NSData *data, NSError *error) { ... }


and then, to convert NSData into NSString, I need to know the encoding, so I call -

NSString *textEncoding = [response textEncodingName];


from the code block, but it returns nil on websites that won't specify "Content-Encoding" header field.

If I don't know the encoding,
[[NSString alloc] initWithData:data encoding:responseEncoding]
won't give me readable HTML.

How can I detect the right encoding for websites that don't send "Content-Encoding" header field?

Kof Kof
Answer

It is possible to try different encodings and see which one results with readable text -

static int encodingPriority[] = {
    NSUTF8StringEncoding,
    NSASCIIStringEncoding,
    NSISOLatin1StringEncoding,
    NSISOLatin2StringEncoding,
    NSUnicodeStringEncoding,
    NSWindowsCP1251StringEncoding,
    NSWindowsCP1252StringEncoding,
    NSWindowsCP1253StringEncoding,
    NSWindowsCP1254StringEncoding,
    NSWindowsCP1250StringEncoding,
    NSNEXTSTEPStringEncoding,
    NSJapaneseEUCStringEncoding,
    NSNonLossyASCIIStringEncoding,
    NSShiftJISStringEncoding,          /* kCFStringEncodingDOSJapanese */
    NSISO2022JPStringEncoding,        /* ISO 2022 Japanese encoding for e-mail */
    NSMacOSRomanStringEncoding,
    NSUTF16BigEndianStringEncoding,
    NSUTF16LittleEndianStringEncoding,
    NSUTF32StringEncoding,
    NSUTF32BigEndianStringEncoding,
    NSUTF32LittleEndianStringEncoding
};

#define REQUIRED_HTML_STRING    @"<html"

- (NSString *)htmlStringForUnknownEncodingData:(NSData *)data detectedEncoding:(NSStringEncoding *)detectedEncoding
{
    NSStringEncoding encoding;
    NSString *html;

    for (int i = 0; i < sizeof(encodingPriority); i++) {
        encoding = encodingPriority[i];

        // try this encoding
        html = [[NSString alloc] initWithData:data encoding:encoding];

        // we need to find a text, because bad encoding will return an unreadable text
        if (html && [html rangeOfString:REQUIRED_HTML_STRING options:NSCaseInsensitiveSearch].location != NSNotFound) {
            *detectedEncoding = encoding;
            return html;
        }
    }
    return nil;
}

then, to detect which encoding the HTML in your NSData is using, call -

NSStringEncoding encoding;
html = [self htmlStringForUnknownEncodingData:data detectedEncoding:&encoding];

if (html)
    NSLog("Encoding detected!");
else
    NSLog("No encoding detected");
Comments