I need to define a PCRE regexp for certain spam-ish words in Arabic/Persian alphabet to be used in drupal spam module. The problem is that the usual PCRE regexp is apparently unable to find patters in Arabic alphabets.
For example, while /bad word/ flags开发者_开发百科 instances of 'bad word', but
/کلمه بد/i
Is unable to flag 'کلمه بد'.
I have no problem with that if I use the u
(Unicode) PCRE modifier:
$string = 'کلمه بد';
if (preg_match('~\p{Arabic}~u', $string) > 0)
{
var_dump('contains Arabic characters');
if (preg_match('~کلمه بد~ui', $string) > 0)
{
var_dump('contains spam-ish Arabic characters');
}
}
string(26) "contains Arabic characters"
string(35) "contains spam-ish Arabic characters"
It runs just fine on IDEOne.com too. Be sure to save your files (and convert input data) in (to) UTF-8.
Literal Unicode text in Perl source will only be recognized properly if the source file has use utf8;
in it.
You can do /\x{644}/
and you can do
open my $fh, '<:utf8', 'somefile.txt' or die "blah blah";
my $bad_thing = <$fh>;
/$bad_thing/;
and either will work without the utf8
pragma if your data is properly decoded, but if you want to do /ل/
then you need use utf8
. Make sense?
精彩评论