How can I get html webpage charset encode from html as string and not as dom?
I get html string like that:
$html = file_get_contents($url);
preg_match_all (string pattern, string subject, array matches, int flags)
but i dont know regex, and I need to find o开发者_如何学Cut webpage charset (UTF-8/windows-255/etc..) Thanks,
preg_match('~charset=([-a-z0-9_]+)~i',$html,$charset);
First thing you have to check the Content-type header.
//add error handling
$f = fopen($url, "r");
$md = stream_get_meta_data($f);
$wd = $md["wrapper_data"];
foreach($wd as $response) {
if (preg_match('/^content-type: .+?/.+?;\\s?charset=([^;"\\s]+|"[^;"]+")/i',
$response, $matches) {
$charset = $matches[1];
break;
}
}
$data = stream_get_contents($f);
You can then fallback on the meta
element. That's been answered before here.
More complex version of header parsing to please the audience:
if (preg_match('~^content-type: .+?/[^;]+?(.*)~i', $response, $matches)) {
if (preg_match_all('~;\\s?(?P<key>[^()<>@,;:\"/[\\]?={}\\s]+)'.
'=(?P<value>[^;"\\s]+|"[^;"]+")\\s*~i', $matches[1], $m)) {
for ($i = 0; $i < count($m['key']); $i++) {
if (strtolower($m['key'][$i]) == "charset") {
$charset = trim($m['value'][$i], '"');
}
}
}
}
you could use
mb_detect_encoding($html);
but it is generally a bad idea. Better use curl instead and look at the Content-Type header.
精彩评论