开发者

Perl regex choking on multiple instances of character sets

开发者 https://www.devze.com 2023-01-26 04:13 出处:网络
I started out with some crazy failures using preg_replace in php and boiled it down to the problem case of having more than one character class using turkish dotted \"i\" and undotted \"ı\" together.

I started out with some crazy failures using preg_replace in php and boiled it down to the problem case of having more than one character class using turkish dotted "i" and undotted "ı" together. Here is a simple test case in php:

<?php
    echo 'match single normal i: ';
    $str = 'mi';
    echo (preg_match('!m[ıi]!', $str)) ? "ok\n" : "fail\n";

    echo 'match single undotted ı: ';
    $str = 'mı';
    echo (preg_match('!m[ıi]!', $str)) ? "ok\n" : "fail\n";

    echo 'match double normal i: ';
    $str = 'misir';
    echo (preg_match('!m[ıi]s[ıi]r!', $str)) ? "ok\n" : "fail\n";

    echo开发者_Go百科 'match double undotted ı: ';
    $str = 'mısır';
    echo (preg_match('!m[ıi]s[ıi]r!', $str)) ? "ok\n" : "fail\n";
?>

And the same test case again in perl:

#!/usr/bin/perl

$str = 'mi';
$str =~ m/m[ıi]/ && print "match single normal i\n";

$str = 'mı';
$str =~ m/m[ıi]/ && print "match single undotted ı\n";

$str = 'misir';
$str =~ m/m[ıi]s[ıi]r/ && print "match double normal i\n";

$str = 'mısır';
$str =~ m/m[ıi]s[ıi]r/ && print "match double undotted ı\n";

The first three tests work fine. The last one does not match.

Why does this work fine as a character class once but not the second time in the same expression? How do I write an expression to match for a word like this that needs to match no matter what combinations of letters it is written with?

Edit: Background on the language problem I'm trying to program for.

Edit 2: Adding a use utf8; directive does fix the perl version. Since my original problem was with a php program and I only switched to perl to see if it was a bug in php, that doesn't help me a whole lot. Does anybody know the directive to make PHP not choke on this?


You may need to tell Perl that your source file contains utf8 characters. Try:

#!/usr/bin/perl

use utf8;   # **** Add this line

$str = 'mısır';
$str =~ m/m[ıi]s[ıi]r/ && print "match double undotted ı\n";

Which doesn't help you with PHP but there may be a similar directive in PHP. Otherwise, try using some form of escape-sequence to avoid putting the literal character in your source-code. I know nothing about PHP so I can't help with that.

Edit
I'm reading that PHP has no Unicode support. Therefore, the unicode input you pass it is likely treated as the string of bytes that the unicode was encoded as.

If you can be assured that your input is coming in as utf-8 then you can match for the utf-8 sequence for ı which is \xc4 \xb1 as in:

$str = 'mısır';  # Make sure this source-file is encoded as utf-8 or this match will fail
echo (preg_match('!m(i|\xc4\xb1)s(i|\xc4\xb1)r!', $str)) ? "ok\n" : "fail\n";

Does that work?

Edit again:
I can explain why your first three tests pass. Let's pretend that in your encoding, ı is encoded as ABCDE. then PHP sees the following:

echo 'match single normal i: ';
$str = 'mi';
echo (preg_match('!m[ABCDEi]!', $str)) ? "ok\n" : "fail\n";

echo 'match single undotted ABCDE: ';
$str = 'mABCDE';
echo (preg_match('!m[ABCDEi]!', $str)) ? "ok\n" : "fail\n";

echo 'match double normal i: ';
$str = 'misir';
echo (preg_match('!m[ABCDEi]s[ABCDEi]r!', $str)) ? "ok\n" : "fail\n";

echo 'match double undotted ABCDE: ';
$str = 'mABCDEsABCDEr';
echo (preg_match('!m[ABCDEi]s[ABCDEi]r!', $str)) ? "ok\n" : "fail\n";

which makes it obvious why the first three tests pass and the last one fails. If you use a start/end anchor ^...$ I think you'll find that only the first test passes.


Multibyte sequences won’t do what you want in bracketed char classes if the UTF-8 is being mis-interpreted as a sequence of 8-bit bytes. Think about it. If [nñm] is misconstructed not as three logical characters but as four physical bytes, you would only match a character whose code point is 6E or C3 or B1 or 6D.

For some purposes, you might get away with rewriting [nñm] as (?:n|ñ|m). It just depends what you’re doing. Casing stuff won’t work.

Also, Unicode has special casing rules for a Turkish dotless i.

Sounds like PHP just isn’t modern enough. Sigh.

0

精彩评论

暂无评论...
验证码 换一张
取 消