开发者

Regex: How to exclude chararacters from a match?

开发者 https://www.devze.com 2023-03-23 19:03 出处:网络
I\'m trying to parse the following string, similar to how google treats search operators: type1:words in key1 type2:word in key2 type3:key3

I'm trying to parse the following string, similar to how google treats search operators:

type1:words in key1 type2:word in key2 type3:key3

To produce groups as key-value pairs, e.g.

type1 -> words in key1 
type2 -> word in key2 
type3 -> key3

This is what I've got so far, but the end of the match overlaps with the next pair, so I only get the first group.

开发者_开发技巧
([\w\^]+):(.*?) \w+: 

type1 -> words in key1 

I have a feeling this should be done with backreferences, but my attempts so far have failed. What's the right approach?


(\w+):([^:]*)(?=\s\w|$)

works on all your sample data.

(\w+)    # Match a keyword
:        # Match :
([^:]*)  # Match as many non-colon characters as possible
(?=      # Lookahead assertion: backtrack to
 \s      # the closest space
|        # or
 $       # don't backtrack at all if we're at the end of the string
)        # End of lookahead

Example Python program:

>>> import re
>>> r = re.compile(r"(\w+):([^:]*)(?=\s|$)")
>>> test = "type1:words in key1 type2:word in key2 type3:key3 type4:yet another key"
>>> for match in r.finditer(test):
...     print("{} -> {}".format(match.group(1), match.group(2)))
...
type1 -> words in key1
type2 -> word in key2
type3 -> key3
type4 -> yet another key


To avoid eating the beginning of the next part, make the last \w+: part of your regex non-consuming. This is called lookahead:

(?=re) matches re via zero-width positive lookahead (without consuming it)

So your regex should look like

([\w\^]+):(.*?) (?=\w+:|$)


It might be easier to split the input on the pattern

\s(?=\w+:\w)

Or, although it would reverse the order of the matches, you can evaluate from right to left and match

\w+:\w.*?


my try in php:

preg_match_all( '/([\w\^]+?):(.+?)\s?(?=\w+:|$)/', 'type1:words in key1 type2:word in key2 type3:key3', $matches );
var_dump( $matches );

results:

array(3) {
  [0]=>
  array(3) {
    [0]=>
    string(20) "type1:words in key1 "
    [1]=>
    string(19) "type2:word in key2 "
    [2]=>
    string(10) "type3:key3"
  }
  [1]=>
  array(3) {
    [0]=>
    string(5) "type1"
    [1]=>
    string(5) "type2"
    [2]=>
    string(5) "type3"
  }
  [2]=>
  array(3) {
    [0]=>
    string(13) "words in key1"
    [1]=>
    string(12) "word in key2"
    [2]=>
    string(4) "key3"
  }
}
0

精彩评论

暂无评论...
验证码 换一张
取 消