开发者

How to fix a BBcode regular expression

开发者 https://www.devze.com 2023-03-27 11:12 出处:网络
I have a regular expression that grabs BBcode tags. It works great except for a minor glitch. Here is the current expression:

I have a regular expression that grabs BBcode tags. It works great except for a minor glitch.

Here is the current expression:

\[([^=\[\]]+)[=\x22']*([^ \[\]]*)['\x22]*\](.+)\[/\1\]

Here is some text it successfully matches against and the groups it builds:

[url=http://www.google.com]Go to google![/url]

1: url

2: http://www.google.com

3: Go to google!

[img]http://www.somesite.com/someimage.jpg[/img]

1: img

2: NULL

3: http://www.somesite.com/someimage.jpg

[quote][quote]first nested quote[/quote][quote]second nested quote[/quote][/quote]

1: quote

2: NULL

3: [quote]first nested quote[/quote][quote]second nested quote[/quote]

All of this is great. I can handle nested tags by running the 3rd match group against the same regex and recursively handle all tags that are nested. The problem is with the example using the [quote] tags. Notice that the 3rd match group is a set of two quote tags, so we would expect two matches. However, we get one match, like this:

[quote]first nested quote[/quote][quote]second nested quote[/quote]

1: quote

2: NULL

3: first nested quote[/quote][quote]second nested quote

Ahhhh! That's not what we wanted at all. There is a fairly simple way to fix it, I modify the regex from this:

\[([^=\[\]]+)[=\x22']*([^ \[\]]*)['\x22]*\](.+)\[/\1\]

To this:

\[([^=\[\]]+)[=\x22']*([^ \[\]]*)['\x22]*\](((?!\[/\1\]).)+)\[/\1\]

By adding ((?!\[/\1\]).) we invalidate the entire match if the 3rd match group contains the closing BBcode tag. So now this works, we get two matches:

[quote]first nested quote[/quote][quote]second nested quote[/quote]

[quote]first nested quote[/quote]

1: quote

2: NULL

3: first nested quote

[quote]second nested quote[/quote]

1: quote

2: NULL 3: second nested quote

I was happy that fixed it, but now we have another problem. This n开发者_开发百科ew regex fails on the first one where we nest the two quote tags under one larger quote tag. We get two matches instead of one:

[quote][quote]first nested quote[/quote][quote]second nested quote[/quote][/quote]

[quote][quote]first nested quote[/quote]

1: quote

2: NULL

3: [quote]first nested quote

[quote]second nested quote[/quote]

1: quote

2: NULL

3: second nested quote

The first match is all wrong and the second match, while well-formed, is not a desired match. We wanted one big match with the 3rd match group being the two nested quote tags, like when we used the first expression.

Any suggestions? If I can just cross this gap I should have a fairly powerful BBcode expression.


Using balancing groups you can construct a regex like this:

(?>
  \[ (?<tag>[^][/=\s]+) \s*
  (?: = \s* (?<val>[^][]*) \s*)?
  ]
)

(?<content>
  (?>
    \[(?<innertag>[^][/=\s]+)[^][]*]
    |
    \[/(?<-innertag>\k<innertag>)]
    |
    [^][]+
  )*
  (?(innertag)(?!))
)

\[/\k<tag>]

Simplified according to Kobi's example.


In the following:

[foo=bar]baz[/foo]
[b]foo[/b]
[i][i][foo=bar]baz[/foo]foo[/i][/i]
[i][i][i][i]foo[/i][/i][/i][i][i]foo[/i][/i][/i]
[quote][quote][b][img]foo[/img][b]bold[/b][b][b]deep[/b][/b][/b][/quote]bar[quote]baz[/quote][/quote]

It finds these matches:

  • [foo=bar]baz[/foo]
  • [b]foo[/b]
  • [i][i][foo=bar]baz[/foo]foo[/i][/i]
  • [i][i][i][i]foo[/i][/i][/i][i][i]foo[/i][/i][/i]
  • [quote][quote][b][img]foo[/img][b]bold[/b][b][b]deep[/b][/b][/b][/quote]bar[quote]baz[/quote][/quote]

Full example at http://ideone.com/uULOs

(Old version http://ideone.com/AXzxW)

0

精彩评论

暂无评论...
验证码 换一张
取 消