开发者

php + vim - बंगलौर (Bangalore) has a break before the last character र

开发者 https://www.devze.com 2023-01-14 03:59 出处:网络
I used http://translate.google.com/#en|hi|Bangalore to get the Hindi for Bangalore and बंगलौर.

I used http://translate.google.com/#en|hi|Bangalore to get the Hindi for Bangalore and बंगलौर.

But when I pasted it in vim there is a break before the last character र.

I am using preg_replace with the regex pattern /[^\p{L}\p{Nd}\p{Mn}_]/u for matching words. But this is treating the last character as a separate word.

This is my input string मैनेजमेंट, बंगलौर and I am expecting the output to be मैनेजमेंट बंगलौर after the preg_replace

$cleanedString = preg_replace('/[^\p{L}\p{Nd}\p{Mn}_]/u', ' ', $name);

But the output I am getting is मैनेज开发者_运维技巧मेंट बंगल र . What am I doing wrong here? I guess the problem starts from how vim handled the text I pasted.


Try this regex "/[^\p{L}\p{Nd}\p{Mn}\p{Mc}_]/u"

The O symbol in लौ takes extra horizontal space as opposed to the ae in मै. The unicode class \p{Mn} matches only non-spacing marks. Use \p{Mc} to match spacing-marks. You can use \p{M} to match all combining-marks: "/[^\p{L}\p{Nd}\p{M}_]/u"

From regular-expressions.info/unicode

\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).

  • \p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).
  • \p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
  • \p{Me} or \p{Enclosing_Mark}: a character that encloses the character is is combined with (circle, square, keycap, etc.).
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号