I would like to only work with UTF8. The problem is I don't know the charset of every webpage. How can I detect it and convert to UTF8?
<?php
$url = "http://vkontakte.ru";
$ch = curl_init($url);
$options = array(
CURLOPT_RETURNTRANSF开发者_C百科ER => true,
);
curl_setopt_array($ch, $options);
$data = curl_exec($ch);
// $data = magic($data);
print $data;
See this at: http://paulisageek.com/tmp/curl-utf8
What is magic()
?
Going by Gumbo and Pekka's advice, I wrote curl_exec_utf8
/** The same as curl_exec except tries its best to convert the output to utf8 **/
function curl_exec_utf8($ch) {
$data = curl_exec($ch);
if (!is_string($data)) return $data;
unset($charset);
$content_type = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);
/* 1: HTTP Content-Type: header */
preg_match( '@([\w/+]+)(;\s*charset=(\S+))?@i', $content_type, $matches );
if ( isset( $matches[3] ) )
$charset = $matches[3];
/* 2: <meta> element in the page */
if (!isset($charset)) {
preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s*charset=([^\s"]+))?@i', $data, $matches );
if ( isset( $matches[3] ) ) {
$charset = $matches[3];
/* In case we want do do further processing downstream: */
$data = preg_replace('@(<meta\s+http-equiv="Content-Type"\s+content="[\w/]+\s*;\s*charset=)([^\s"]+)@i', '$1utf-8', $data, 1);
}
}
/* 3: <xml> element in the page */
if (!isset($charset)) {
preg_match( '@<\?xml.+encoding="([^\s"]+)@si', $data, $matches );
if ( isset( $matches[1] ) ) {
$charset = $matches[1];
/* In case we want do do further processing downstream: */
$data = preg_replace('@(<\?xml.+encoding=")([^\s"]+)@si', '$1utf-8', $data, 1);
}
}
/* 4: PHP's heuristic detection */
if (!isset($charset)) {
$encoding = mb_detect_encoding($data);
if ($encoding)
$charset = $encoding;
}
/* 5: Default for HTML */
if (!isset($charset)) {
if (strstr($content_type, "text/html") === 0)
$charset = "ISO 8859-1";
}
/* Convert it if it is anything but UTF-8 */
/* You can change "UTF-8" to "UTF-8//IGNORE" to
ignore conversion errors and still output something reasonable */
if (isset($charset) && strtoupper($charset) != "UTF-8")
$data = iconv($charset, 'UTF-8', $data);
return $data;
}
The regexes are mostly from http://nadeausoftware.com/articles/2007/06/php_tip_how_get_web_page_content_type
The converting is easy. The detecting is the hard part. You could try mb_detect_encoding but that is a very shaky method, it's literally "guessing" the content type and as @troelskn highlights in the comments can guess "rough" differences at best (Is it a multi-byte encoding?) but fails at detecting nuances of similar character sets.
The proper way would be IMO:
- Interpreting any
content-type
Meta tags in the page - Interpreting any
content-type
headers sent by the server - If that yields nothing, try to "sniff" the encoding using mb_detect_encoding()
- If that yields nothing, fall back to a defined default (maybe ISO-8859-1, maybe UTF-8).
Different than outlined in the guidelines in @Gumbo's answer, I personally think Meta tags should have priority over server headers because I'm pretty sure that if a Meta tag is present, that is a more reliable indicator of the actual encoding of the page than a server setting some site operators don't even know how to change. The correct way, however, seems to be to treat content-type headers with higher priority.
For the former, I think you can use get_meta_tags(). The latter you should be getting from curl already, you would just have to parse it. Here is a full example on how to systematically process response headers served by cURL.
The conversion would then be using iconv:
$new_content = iconv("incoming-charset", "utf-8", $content);
I was extremely happy to find this answer, but noticed there's a flaw in the <meta>
tag detection. It simply didn't seem to match any content-type tags, and it's not yet equipped for the new HTML5 style tags: <meta charset="UTF-8">
. So I wrote this, hope it helps you guys, and thanks again for this excellent solution!
/* 2: <meta> element in the page */
if (!isset($charset)) {
preg_match('/<[\s]*meta[^>]*charset="?([^\s"]+)\s?"/i', $data, $matches);
if (isset($matches[1])) {
$charset = $matches[1];
}
}
(P.S. I couldn't figure out how to post this as a comment, as it's obviously not a full answer.)
You can try and use something like:
http://www.php.net/manual/en/function.mb-detect-encoding.php
http://www.php.net/manual/en/function.mb-convert-encoding.php
Although this is not fool proof.
There is a defined order how to specify the character encoding in HTML:
[…] conforming user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest):
- An HTTP "charset" parameter in a "Content-Type" field.
- A
META
declaration with "http-equiv" set to "Content-Type" and a value set for "charset".- The
charset
attribute set on an element that designates an external resource.
If no character encoding declaration is present, HTTP defines ISO 8859-1 as default character encoding. You can either use that as default character encoding for HTML too or simply refuse to process the response.
For XHTML you additionally have the XML declaration as source for the encoding:
In an XML document, the character encoding of the document is specified on the XML declaration (e.g.,
<?xml version="1.0" encoding="EUC-JP"?>
). In order to portably present documents with specific character encodings, the best approach is to ensure that the web server provides the correct headers. If this is not possible, a document that wants to set its character encoding explicitly must include both the XML declaration an encoding declaration and ameta
http-equiv statement (e.g.,<meta http-equiv="Content-type" content="text/html; charset=EUC-JP" />
). In XHTML-conforming user agents, the value of the encoding declaration of the XML declaration takes precedence.
If no character encoding declaration, XML defines UTF-8 and UTF-16 as default character encoding:
Unless an encoding is determined by a higher-level protocol, it is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16.
So, to sum up, the order is:
- An HTTP "charset" parameter in a "Content-Type" field.
- XML declaration with
encoding
attribute. - A
META
declaration with "http-equiv" set to "Content-Type" and a value set for "charset".
If no character encoding declaration is present, you may assume ISO 8859-1 as default encoding for HTML and must assume UTF-8 or UTF-16 as default encoding for XHTML.
精彩评论