PHP: Convert curl_exec output to UTF8_问答_开发者

I would like to only work with UTF8. The problem is I don't know the charset of every webpage. How can I detect it and convert to UTF8?

<?php
$url = "http://vkontakte.ru";
$ch = curl_init($url);
$options = array(
    CURLOPT_RETURNTRANSF开发者_C百科ER => true,
);
curl_setopt_array($ch, $options);
$data = curl_exec($ch);

// $data = magic($data);

print $data;

See this at: http://paulisageek.com/tmp/curl-utf8

What is magic()?

Going by Gumbo and Pekka's advice, I wrote curl_exec_utf8

/** The same as curl_exec except tries its best to convert the output to utf8 **/
function curl_exec_utf8($ch) {
    $data = curl_exec($ch);
    if (!is_string($data)) return $data;

    unset($charset);
    $content_type = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);

    /* 1: HTTP Content-Type: header */
    preg_match( '@([\w/+]+)(;\s*charset=(\S+))?@i', $content_type, $matches );
    if ( isset( $matches[3] ) )
        $charset = $matches[3];

    /* 2: <meta> element in the page */
    if (!isset($charset)) {
        preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s*charset=([^\s"]+))?@i', $data, $matches );
        if ( isset( $matches[3] ) ) {
            $charset = $matches[3];
            /* In case we want do do further processing downstream: */
            $data = preg_replace('@(<meta\s+http-equiv="Content-Type"\s+content="[\w/]+\s*;\s*charset=)([^\s"]+)@i', '$1utf-8', $data, 1);
        }
    }

    /* 3: <xml> element in the page */
    if (!isset($charset)) {
        preg_match( '@<\?xml.+encoding="([^\s"]+)@si', $data, $matches );
        if ( isset( $matches[1] ) ) {
            $charset = $matches[1];
            /* In case we want do do further processing downstream: */
            $data = preg_replace('@(<\?xml.+encoding=")([^\s"]+)@si', '$1utf-8', $data, 1);
        }
    }

    /* 4: PHP's heuristic detection */
    if (!isset($charset)) {
        $encoding = mb_detect_encoding($data);
        if ($encoding)
            $charset = $encoding;
    }

    /* 5: Default for HTML */
    if (!isset($charset)) {
        if (strstr($content_type, "text/html") === 0)
            $charset = "ISO 8859-1";
    }

    /* Convert it if it is anything but UTF-8 */
    /* You can change "UTF-8"  to "UTF-8//IGNORE" to 
       ignore conversion errors and still output something reasonable */
    if (isset($charset) && strtoupper($charset) != "UTF-8")
        $data = iconv($charset, 'UTF-8', $data);

    return $data;
}

The regexes are mostly from http://nadeausoftware.com/articles/2007/06/php_tip_how_get_web_page_content_type

The converting is easy. The detecting is the hard part. You could try mb_detect_encoding but that is a very shaky method, it's literally "guessing" the content type and as @troelskn highlights in the comments can guess "rough" differences at best (Is it a multi-byte encoding?) but fails at detecting nuances of similar character sets.

The proper way would be IMO:

Interpreting any content-type Meta tags in the page
Interpreting any content-type headers sent by the server
If that yields nothing, try to "sniff" the encoding using mb_detect_encoding()
If that yields nothing, fall back to a defined default (maybe ISO-8859-1, maybe UTF-8).

Different than outlined in the guidelines in @Gumbo's answer, I personally think Meta tags should have priority over server headers because I'm pretty sure that if a Meta tag is present, that is a more reliable indicator of the actual encoding of the page than a server setting some site operators don't even know how to change. The correct way, however, seems to be to treat content-type headers with higher priority.

For the former, I think you can use get_meta_tags(). The latter you should be getting from curl already, you would just have to parse it. Here is a full example on how to systematically process response headers served by cURL.

The conversion would then be using iconv:

$new_content = iconv("incoming-charset", "utf-8", $content);

I was extremely happy to find this answer, but noticed there's a flaw in the <meta> tag detection. It simply didn't seem to match any content-type tags, and it's not yet equipped for the new HTML5 style tags: <meta charset="UTF-8">. So I wrote this, hope it helps you guys, and thanks again for this excellent solution!

/* 2: <meta> element in the page */
if (!isset($charset)) {
    preg_match('/<[\s]*meta[^>]*charset="?([^\s"]+)\s?"/i', $data, $matches);

    if (isset($matches[1])) {
        $charset = $matches[1];
    }
}

(P.S. I couldn't figure out how to post this as a comment, as it's obviously not a full answer.)

You can try and use something like:

http://www.php.net/manual/en/function.mb-detect-encoding.php

http://www.php.net/manual/en/function.mb-convert-encoding.php

Although this is not fool proof.

There is a defined order how to specify the character encoding in HTML:

[…] conforming user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest):

An HTTP "charset" parameter in a "Content-Type" field.

A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset".

The charset attribute set on an element that designates an external resource.

If no character encoding declaration is present, HTTP defines ISO 8859-1 as default character encoding. You can either use that as default character encoding for HTML too or simply refuse to process the response.

For XHTML you additionally have the XML declaration as source for the encoding:

In an XML document, the character encoding of the document is specified on the XML declaration (e.g., <?xml version="1.0" encoding="EUC-JP"?>). In order to portably present documents with specific character encodings, the best approach is to ensure that the web server provides the correct headers. If this is not possible, a document that wants to set its character encoding explicitly must include both the XML declaration an encoding declaration and a meta http-equiv statement (e.g., <meta http-equiv="Content-type" content="text/html; charset=EUC-JP" />). In XHTML-conforming user agents, the value of the encoding declaration of the XML declaration takes precedence.

If no character encoding declaration, XML defines UTF-8 and UTF-16 as default character encoding:

Unless an encoding is determined by a higher-level protocol, it is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16.

So, to sum up, the order is:

An HTTP "charset" parameter in a "Content-Type" field.
XML declaration with encoding attribute.
A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset".

If no character encoding declaration is present, you may assume ISO 8859-1 as default encoding for HTML and must assume UTF-8 or UTF-16 as default encoding for XHTML.