I am somewhat confused with this whole character set thingy. Everything seems fine when the data is inputting manually into the web sites and database tables. Except when data is inputted by copy and pasting – the character sets being to get screwy.
I asked several clients where there are getting this data from – the majority seems to be either from another web site or fr开发者_如何学Goom a MS Document.
The characters that seem to be messing up are common characters like the following:
‘ © "
What is being inserted the the black triangle with the dreaded question mark! On my server I have the following settings.
PHP TIDY to clean the text before input to web page or database - output-encoding > UTF-8 Each web page has meta tag > charset=UTF-8 The database tables default > latin1_swedish_ci
I assume at first it was a database problem until I noticed that the same issue occurs with static web pages that are not database driven.
Help?
It's not really a good solution to replace away the smart quotes. If you can't cope with smart quotes or the copyright symbol, you can't cope with any other non-ASCII characters either, leaving you with an ASCII-only application (which these days is a pretty sad thing).
Instead you should ideally ensure that your web application using UTF-8 throughout, which means:
Serve all your pages as UTF-8 using a
header('Content-Type: text/html; charset=utf-8');
and/or a<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
.Ensure your .php source files are saved as UTF-8, if they contain any non-ASCII characters themselves.
Use
mysql_set_charset('utf-8')
when connecting to the database.Ensure your MySQL tables are created with a UTF-8
CHARACTER SET
/COLLATION
. They won't be by default if you didn't specify one when you created them. In this case you would need toALTER TABLE
on each text column to change it.If you use
htmlentities()
to HTML-escape database content when putting it into the page, you need to pass inutf-8
for the$charset
argument or it will mangle all non-ASCII characters by treating them as ISO-8859-1 (which is never the proper encoding). Better: usehtmlspecialchars()
instead, which doesn't touch non-ASCII characters so doesn't care.
精彩评论