开发者

\w doesn't match enough, what should I use instead?

开发者 https://www.devze.com 2023-03-08 00:47 出处:网络
(in PHP) I have the following string: 开发者_C百科$string = \'<!--:fr--><p>Mamá lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ut est et tortor sagittis auctor id ut urna.

(in PHP) I have the following string:

开发者_C百科$string = '<!--:fr--><p>Mamá lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ut est et tortor sagittis auctor id ut urna. Etiam quañ justo, pharetra sed bibendum at, vulputate et augue.</p> <p>Curabitur cursus mi vel quam placerat malesuada. Fusce euismod mollis tincidunt. Sed cursus, sem et porta dictum, elit purus facilisis massa, eget consectetur nisi libero eget leo. Vivamus vitae mattis nulla. varius fermentum.</p><!--:-->'

And I wanna eliminate <!--:fr--> and <!--:--> using

preg_replace('/<!--:[a-z]{2}-->(\w+)<!--:-->/', '${1}', $string)

But it return the same $string. What is the problem?


You have characters that fall outside of [a-zA-Z0-9_] (which is what \w matches). You can match with [\s\S], which means any whitespace or non whitespace character (i.e. everything).

You could also use . with s flag.

Try this...

preg_replace('/<!--:[a-z]{2}-->([\s\S]+?)<!--:-->/', '${1}', $string);

Ideone.


The other possibility is that you just remove the part you don't want.

preg_replace('/<!--:(?:[a-z]{2})?-->/', '', $string);

This matches only your not wanted part <!--:(?:[a-z]{2})?--> where the (?:[a-z]{2})? is two optional lowercase letters, that means it will match both parts.


To solve your problem, you only need a simple regex like <!--:(fr)?--> and a PHP code like:

$string = preg_replace('/<!--:(fr)?-->/', '', $string);

To answer the question: \w is a very limited and not recommended shortcut. It will e.g. not match ñ from your input and neither will it match ,. PHP has good support for Unicode. The shortcut \p{L} match any letter from any language. There are also shortcuts for any punctuation etc. These can be combined in a character class. E.g. if you want to match at least one letter (including French and Spanish letters), dot or comma in any sequence, you can write:

[\p{L}.,]+

There are some information on how this works here:

  • http://www.regular-expressions.info/unicode.html
0

精彩评论

暂无评论...
验证码 换一张
取 消