I have some text files that contain some non ASCII characters, I want to remove them, however keep the formatting characters.
I tried
$description = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $description);
However that appeared to strip newlines and other formatting out and also had problems with some Hebrew which then converted this
משפטים נוספים מהמומחה. נסו ותהנו! חג חנוכה שמח **************************************** חדש - הא开发者_运维问答פליקציה היחידה שאומרת לך מה מצב הסוללה שלך ** NEW to version 1.1 - the expert talks!!! *
to this
1.4 :", ..."" ..."" 50 ..." . , . ! **************************************** - ** NEW to version 1.1 - the expert talks!!! *
That's not replacing non-ASCII characters... Ascii characters are inside of the range 0-127. So basically what you're trying to do is write a rexeg to convert one character set to another (not just replace out some of the characters, which is a lot harder)...
As for what you want to do, I think you want the iconv
function... You'll need to know the input encoding, but once you do you can then tell it to ignore non-representable characters:
$text = iconv('UTF-8', 'ASCII//IGNORE', $text);
You could also use ISO-8859-1
, or any other target character set you want.
What you're doing won't work because you're treating a UTF-8 string as if it were a single-byte encoding. You are actually removing portions of characters. If you must add the u
flag to the regex expression to activate UTF-8 mode.
Since you want to leave only the control characters and the other ASCII range characters, you have to replace all the others with ''. So:
$description = preg_replace('/[^\x{0000}-\x{007F}]/u', '', $description);
which gives for your input:
. ! ********************* - * NEW to version 1.1 - the expert talks!!! *
精彩评论