I have a regular expression that grabs BBcode tags. It works great except for a minor glitch.
Here is the current expression:
\[([^=\[\]]+)[=\x22']*([^ \[\]]*)['\x22]*\](.+)\[/\1\]
Here is some text it successfully matches against and the groups it builds:
[url=http://www.google.com]Go to google![/url]
1: url 2: http://www.google.com 3: Go to google![img]http://www.somesite.com/someimage.jpg[/img]
1: img 2: NULL 3: http://www.somesite.com/someimage.jpg[quote][quote]first nested quote[/quote][quote]second nested quote[/quote][/quote]
1: quote 2: NULL 3: [quote]first nested quote[/quote][quote]second nested quote[/quote]
All of this is great. I can handle nested tags by running the 3rd match group against the same regex and recursively handle all tags that are nested. The problem is with the example using the [quote] tags. Notice that the 3rd match group is a set of two quote tags, so we would expect two matches. However, we get one match, like this:
[quote]first nested quote[/quote][quote]second nested quote[/quote]
1: quote 2: NULL 3: first nested quote[/quote][quote]second nested quote
Ahhhh! That's not what we wanted at all. There is a fairly simple way to fix it, I modify the regex from this:
\[([^=\[\]]+)[=\x22']*([^ \[\]]*)['\x22]*\](.+)\[/\1\]
To this:
\[([^=\[\]]+)[=\x22']*([^ \[\]]*)['\x22]*\](((?!\[/\1\]).)+)\[/\1\]
By adding ((?!\[/\1\]).)
we invalidate the entire match if the 3rd match group contains the closing BBcode tag. So now this works, we get two matches:
[quote]first nested quote[/quote][quote]second nested quote[/quote]
[quote]first nested quote[/quote]
1: quote 2: NULL 3: first nested quote[quote]second nested quote[/quote]
1: quote 2: NULL 3: second nested quote
I was happy that fixed it, but now we have another problem. This n开发者_开发百科ew regex fails on the first one where we nest the two quote tags under one larger quote tag. We get two matches instead of one:
[quote][quote]first nested quote[/quote][quote]second nested quote[/quote][/quote]
[quote][quote]first nested quote[/quote]
1: quote 2: NULL 3: [quote]first nested quote[quote]second nested quote[/quote]
1: quote 2: NULL 3: second nested quote
The first match is all wrong and the second match, while well-formed, is not a desired match. We wanted one big match with the 3rd match group being the two nested quote tags, like when we used the first expression.
Any suggestions? If I can just cross this gap I should have a fairly powerful BBcode expression.
Using balancing groups you can construct a regex like this:
(?>
\[ (?<tag>[^][/=\s]+) \s*
(?: = \s* (?<val>[^][]*) \s*)?
]
)
(?<content>
(?>
\[(?<innertag>[^][/=\s]+)[^][]*]
|
\[/(?<-innertag>\k<innertag>)]
|
[^][]+
)*
(?(innertag)(?!))
)
\[/\k<tag>]
Simplified according to Kobi's example.
In the following:
[foo=bar]baz[/foo]
[b]foo[/b]
[i][i][foo=bar]baz[/foo]foo[/i][/i]
[i][i][i][i]foo[/i][/i][/i][i][i]foo[/i][/i][/i]
[quote][quote][b][img]foo[/img][b]bold[/b][b][b]deep[/b][/b][/b][/quote]bar[quote]baz[/quote][/quote]
It finds these matches:
[foo=bar]baz[/foo]
[b]foo[/b]
[i][i][foo=bar]baz[/foo]foo[/i][/i]
[i][i][i][i]foo[/i][/i][/i][i][i]foo[/i][/i][/i]
[quote][quote][b][img]foo[/img][b]bold[/b][b][b]deep[/b][/b][/b][/quote]bar[quote]baz[/quote][/quote]
Full example at http://ideone.com/uULOs
(Old version http://ideone.com/AXzxW)
精彩评论