I'm using the following regex to 开发者_如何学Pythoncheck an image filename only contains alphanumeric, underscore, hyphen, decimal point:
preg_match('!^[\w.-]*$!',$filename)
This works ok. But I have concerns about multibyte characters. Should I specifically handle them to prevent undetermined errors, or should this regex reject mb filenames ok?
PHP does not have "native" support for multibyte characters; you need to use the "mbstring" extensionDocs (which may or may not be available). Furthermore, it would appear that there is no way to create a "multibyte-character string", as such -- rather, one chooses to treat a native string as multibyte-character string by using special "mbstring" functions. In other words, a PHP string does not know its own character encoding -- you have to keep track of it manually.
You may be able to get away with it so long as you use UTF-8 (or similar) encoding. UTF-8 always encodes multibyte characters to "high" bytes (for instance, ß
is encoded as 0xcf 0x9f
), so PHP will probably treat them just like any other character. You would not be able to use an encoding that might potentially encode a multibyte character into "special" PHP bytes, such as 0x22
, the "double-quote" symbol.
The only regular expression functions in PHP that know how to deal with specific multibyte characters out of a range of multiple character-sets are mb_ereg
Docs, mb_eregi
Docs, mb_ereg_replace
Docs and mb_eregi_replace
Docs.
PCRE based regular expression functions like preg_match
Docs support UTF-8 by using the u
-modifier (PCRE8)Docs.
But of course, as described above PHP strings don't know their own encoding, so you first need to instruct the "mbstring" library using the mb_regex_encoding function. Note that that function specifies the encoding of the string you're matching, not the string containing the regular expression itself.
精彩评论