开发者

Regexp word boundaries in non-ASCII situations

开发者 https://www.devze.com 2023-02-24 14:21 出处:网络
I have a regular expression in my PHP script like this: /(\\b$term|$term\\b)(?!([^<]+)?>)/iu This matches the word contained in $term, as long as there\'s a word boundary before or after and

I have a regular expression in my PHP script like this:

/(\b$term|$term\b)(?!([^<]+)?>)/iu

This matches the word contained in $term, as long as there's a word boundary before or after and it's not inside a HTML tag.

However, this doesn't work in non-ASCII cases, for example with Russian text. Is there a way to make it work?

I can get almost as good result with

/(\s$term|$term\s)(?!([^<]+)?>)/iu

but this is obviously more limited and since this regexp is about highlighting search terms, it has the problem of including the space in the highlight.

I've read this StackOverflow question about the problem, but it doesn't help - does开发者_StackOverflown't work correctly. In that example the captures are the other way around (capture text outside the search term, when I need to capture the search term).

Any way to make this work? Thanks!


You could use zero-width lookahead/lookbehind assertions to assert the that characters to the left and right of what you're matching are non-letters?


The \b is certainly defined to work perfectly well on Unicode, as is required by UTS#18. What are you saying it is not doing? What are the exact text strings involved?

0

精彩评论

暂无评论...
验证码 换一张
取 消