开发者

Proper character encoding to display "”"?

开发者 https://www.devze.com 2023-03-14 18:23 出处:网络
I\'m having some nasty character encoding problems that I just can\'t figure out. Essentially, I\'m screen scraping some HTML off of a site using PHP, then running it through PHP\'s DOMDocument to ch

I'm having some nasty character encoding problems that I just can't figure out.

Essentially, I'm screen scraping some HTML off of a site using PHP, then running it through PHP's DOMDocument to change out some URL's, etc., and when it's done, it outputs HTML with some weird things. Ex: where there should be an end quote, it puts out ”

I have the page's meta tag for charset set to utf-8 but then the ” characters are showing up as †on the site. I'm not sure if I just don'开发者_StackOverflow中文版t understand character encoding, or what.

Any suggestions on the best way to resolve this? Something client side with a meta tag, or some kind of server-side PHP conversion?


Sometimes setting the charset in HTML or the response header isn't enough. If everything isn't setup for UTF-8 on your server, your text may get incorrectly converted somewhere along the way. You may need to enable UTF-8 encoding for both Apache and PHP right in their config files. (If you're not using Apache, try skipping that step.)

Apache UTF-8 setup:

Edit either your charset.conf (ideal), or httpd.conf file, by adding this line to the end:

AddDefaultCharset utf-8

(If you don't have access to Apache's config files, you can create a ".htaccess" file in your HTML's root directory with that same code.)

PHP UTF-8 setup:

Edit your php.ini file, searching for "default_charset", and change it to:

default_charset = "utf-8"

Restart Apache:

Depending on your server type, this command may do the trick via command line:

sudo service apache2 restart


I think you should link/post the page(part of it) you are having problems with and some of your code to get better feedback.

A few suggestions: try to convert the page string you got, from encoding specified in it's meta tag (or real document encoding, if that is not the case) to UTF-8 and/or force document encoding in the DOMDocument object (as ~mario described or using properties) as DOMDocument seems to properly use encoding meta tag only if it is the first thing in HTML head tag.

You can also try to disable entities conversion or some other properties as you don't need it for simple URL changing.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号