开发者

Would this regex be multibyte safe?

开发者 https://www.devze.com 2023-02-17 20:27 出处:网络
I\'m using the following regex to 开发者_如何学Pythoncheck an image filename only contains alphanumeric, underscore, hyphen, decimal point:

I'm using the following regex to 开发者_如何学Pythoncheck an image filename only contains alphanumeric, underscore, hyphen, decimal point:

preg_match('!^[\w.-]*$!',$filename) 

This works ok. But I have concerns about multibyte characters. Should I specifically handle them to prevent undetermined errors, or should this regex reject mb filenames ok?


PHP does not have "native" support for multibyte characters; you need to use the "mbstring" extension­Docs (which may or may not be available). Furthermore, it would appear that there is no way to create a "multibyte-character string", as such -- rather, one chooses to treat a native string as multibyte-character string by using special "mbstring" functions. In other words, a PHP string does not know its own character encoding -- you have to keep track of it manually.

You may be able to get away with it so long as you use UTF-8 (or similar) encoding. UTF-8 always encodes multibyte characters to "high" bytes (for instance, ß is encoded as 0xcf 0x9f), so PHP will probably treat them just like any other character. You would not be able to use an encoding that might potentially encode a multibyte character into "special" PHP bytes, such as 0x22, the "double-quote" symbol.

The only regular expression functions in PHP that know how to deal with specific multibyte characters out of a range of multiple character-sets are mb_ereg­Docs, mb_eregi­Docs, mb_ereg_replace­Docs and mb_eregi_replace­Docs.

PCRE based regular expression functions like preg_match­Docs support UTF-8 by using the u-modifier (PCRE8)­Docs.

But of course, as described above PHP strings don't know their own encoding, so you first need to instruct the "mbstring" library using the mb_regex_encoding function. Note that that function specifies the encoding of the string you're matching, not the string containing the regular expression itself.

0

精彩评论

暂无评论...
验证码 换一张
取 消