开发者

What's this regex doing?

开发者 https://www.devze.com 2023-03-29 06:46 出处:网络
I\'ve found this regex in a script I\'m customizing. Can someone tell me what its doing? function test( $text) {

I've found this regex in a script I'm customizing. Can someone tell me what its doing?

function test( $text) {
    $regex = '/( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF]开发者_如何学编程 | [\xE0-\xEF][\x80-\xBF]{2} | [\xF0-\xF7][\x80-\xBF]{3} ) | ./x';
    return preg_replace($regex, '$1', $text);
}


Inside of the capturing group there are four options:

  1. [\x00-\x7F]
  2. [\xC0-\xDF][\x80-\xBF]
  3. [\xE0-\xEF][\x80-\xBF]{2}
  4. [\xF0-\xF7][\x80-\xBF]{3}

If none of these patterns are matched at a given location, then any character will be matched by the . that is outside of the capturing group.

The preg_replace call will iterate over $text finding all non-overlapping matches, replacing each match with whatever was captured.

There are two possibilities here, either the entire match was inside the capturing group so the replacement doesn't change $text, or the . at the end matched a single character and that character is removed from $text.

Here are some basic examples:

  • If a character in the range \xF8-\xFF appears in the text, it will always be removed
  • A character in \xC0-\xDF will be removed unless followed by a character in \x80-\xBF
  • A character in \xE0-\xEF will be removed unless followed by two characters in \x80-\xBF
  • A character in \xF0-\xF7 will be removed unless followed by three characters in \x80-\xBF
  • A character in \x80-\xBF will be removed unless it was matched as a part of one of the above cases


The purpose appears to be to "clean" UTF-8 encoded text. The part in the capturing group,

( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2} | [\xF0-\xF7][\x80-\xBF]{3} )

...roughly matches a valid UTF-8 byte sequence, which may be one to four bytes long. The value of the first byte determines how long that particular byte sequence should be.

Since the replacement is simply, '$1', valid byte sequences will be plugged right back into the output. Any byte that's not matched by that part will instead be matched by the dot (.), and effectively removed.

The most important thing to know about this technique is that you should never have to use it. If you find invalid UTF-8 byte sequences in your UTF-8 encoded text, it means one of two things: it's not really UTF-8, or it's been corrupted. Instead of "cleaning" it, you should find out how it got dirty and fix that problem.

0

精彩评论

暂无评论...
验证码 换一张
取 消