开发者

Regex negation - word parsing

开发者 https://www.devze.com 2023-01-14 05:56 出处:网络
I am trying to parse a phrase and exclude common words. 开发者_如何学GoFor instance in the phrase \"as the world turns\", I want to exclude the common words \"as\" and \"the\" and return only \"world

I am trying to parse a phrase and exclude common words.

开发者_如何学Go

For instance in the phrase "as the world turns", I want to exclude the common words "as" and "the" and return only "world" and "turns".

(\w+(?!the|as))

Doesn't work. Feedback appreciated.


The lookahead should come first:

(\b(?!(the|as)\b)\w+\b)

I have also added word boundaries to ensure that it only matches whole words otherwise it would fail to match the complete word "as" but it would successfully match the letter "s" of that word.

You might also want to consider what \w matches and if that meets your needs. If you are looking for words in English you probably are interested in letters but not digits and you may wish to include some punctuation characters that are excluded by \w, such as apostrophes. You could try something like this instead (Rubular):

/(\b(?!(?:the|as)\b)[a-z'-]+\b)/i

To match words more accurately in a human language you could consider using a natural language parsing library instead of regular expressions.


You should use word boundaries to only match whole words. Either with a look-ahead assertion:

(\b(?!(?:the|as)\b)\w+\b)

Or with a look-behind assertion:

(\b\w+\b(?<!\b(?:the|as)))
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号