I was looking into multi-byte characters and how they are used but how many different identifiers/pasterns are used for dif开发者_高级运维ferent multi-bytes.
e.g: &nbps;
,&#nbsp;
,U+0026
,%20
how many different identifiers such as &
,&#
,u+
,%
etc are there ?
Im trying to look for inputs if they have words which are more than 255 characters long then its probably a multi-byte (hack attempt) and then I can check if word can be split has the multi-byte identifier then stop the hack attempt.
%
format - a url-encoded value for embedding into URLS, e.g. %20 is a space (ascii 20)
- named character entity, a non-breaking space in this case
U+0026
- a unicode character in hex notation, an &
in this case
&#...;
- a numbered character entity in decimal (base10) &
= &
&#x...;
- a numbered character entity in hex (base 16): &
= &
Are you trying to avoid homoglyph-based spoofing ? Does identifier means username here ?
If yes, and if your users use a latin alphabet, just allow only ascii letters and numbers:
$identifier = preg_replace('#[^A-Za-z0-9]+#', '', $identifier);
精彩评论