I have the following code:
preg_replace('/[^\w-]/u','.','Bréánná MÓÚLÍN');
Which on server A (PHP 5.3.5) returns:
"Br开发者_运维知识库éánná.Móúlín" (as it should)However, on server B (PHP 5.2.11) it returns:
"Br..n..M..l.n" (not what what I want at all)Am I right in thinking that this is down to whether or not PCRE_UCP was set when the whole thing was compiled?
Is there any way of overriding this if this is the case?
Failing that, is there any way of easily replacing such characters with a 'standard' equivalent? (Like utf8_decode but more expansive)
I am not sure whether PCRE_UCP
defined during compilation affects preg_replace()
, but a work-around to your problem is to use the multibyte string function mb_ereg_replace()
:
<?php
mb_internal_encoding("UTF-8");
mb_regex_encoding("UTF-8");
echo mb_ereg_replace('[^0-9A-Za-zÀ-ÖØ-öø-˿Ͱ-ͽͿ--⁰-Ⰰ-、-豈-﷏ﷰ-�̀-ͯ‿-⁀\\-]','.','Bréánná MÓÚLÍN');
PHP 5.2 results: http://codepad.viper-7.com/UnZeyf
EDIT: I originally thought that the multibyte ereg functions supported Unicode character type escapes, but this turns out not to be true. Instead, you need to determine the ranges of characters that you consider "letters". I used the character ranges from the XML Standard's definition of NameChar
with the following Java program to generate the RegExp string (as apparently the multibyte ereg functions do not support Unicode character escape sequences, either):
import java.io.*;
public class SO7456963 {
public static void main(String[] args) throws Throwable {
Writer w = new OutputStreamWriter(new FileOutputStream("SO7456963.txt"), "UTF-8");
w.write("[^0-9A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02FF\u0370-\u037D\u037F-\u1FFF\u200C-\u200D\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD\u0300-\u036F\u203F-\u2040\\\\-]");
w.close();
}
}
精彩评论