开发者

Input utf-8 - Validate?

开发者 https://www.devze.com 2023-02-26 03:16 出处:网络
Never trust the input. But it is also true for the character encoding? Is good practice to control the encoding of the string received, to avoid unexpected errors? Some people use preg_match to check

Never trust the input. But it is also true for the character encoding? Is good practice to control the encoding of the string received, to avoid unexpected errors? Some people use preg_match to check invalid string. Others make a control byte for byte to开发者_Python百科 validate it. And who normalized using iconv. What is the fastest and safest way to do this check?

edit

I noticed that if I try to save a string utf-8 corrupted in my mysql database, the string will be truncated without warning. There are countermeasures for this eventuality?


Is good practice to control the encoding of the string received, to avoid unexpected errors?

No. There is no reliable way to detect the incoming data's encoding*, so the common practice is to define which encoding is expected:

  • If you are exposing an API of some sort, or a script that gets requests from third party sites, you will usually point out in the documentation what encoding you are expecting.

  • If you have forms on your site that are submitted to scripts, you will usually have a site-wide convention of which character set is used.

The possibility that broken data comes in is always there, if the declared encoding doesn't match the data's actual encoding. In that case, your application should be designed so there are no errors except that a character gets displayed the wrong way.

Looking at the encoding that the request declares the incoming data to be in like @Ignacio suggests is a very interesting idea, but I have never seen it implemented in the PHP world. That is not saying anything against it, but you were asking about common practices.

*: It is often possible to verify whether incoming data has a specific encoding. For example, UTF-8 has specific byte values that can't stand on their own, but form a multi-byte character. ISO-8859-1 special characters overlap with those values, and will therefore be detected as invalid in UTF-8. But detecting a completely unknown encoding from an arbitrary set of data is close to impossible.


Look at the charset specified in the request.


Your web publishes the webservice or produces the form and you can specify which encoding you expect. So if the input passes your validation everything is ok. If it doesn't you don't need to take care why it didn't pass. If it was due to wrong encoding it is not your fault.

0

精彩评论

暂无评论...
验证码 换一张
取 消