开发者

Recursive regex for div tags(not trying to parse html with regex)

开发者 https://www.devze.com 2023-03-06 06:40 出处:网络
I have a bunch of wiki markup, sometimes people just throw random html down in the middle of wiki markup and somehow wikipedia just rolls with it, as it does for all kinds of other badly formed wiki m

I have a bunch of wiki markup, sometimes people just throw random html down in the middle of wiki markup and somehow wikipedia just rolls with it, as it does for all kinds of other badly formed wiki markup. I want to match everything inside the divs.

I need to recursively find all the <div>blah</div> tags including div tags with other div tags inside them. I am trying to match the div tags and everything inside of them. I have this which I believe almost works:

new Regex(@"\<div.*?\> (?<DEPTH>)                   # opening 
            (?>                # now match...
               [^(\<div.*?\>)(\<\/div\>)]+          # any characters except divs
            |                  # or
               \<div.*?\>  (?<DEPTH>)  # a opening div, increasing the depth counter
            |                  # or
               \<\/div\>  (?<-DEPTH>) # a closing div, decreasing the depth counter
            )*                 # any number of times
            (?(DEPTH)(?!))     # until the depth counter is zero again
          \<\/div\>                   # then match the closing fix",
            RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline);

Maybe I should be using another methodology to parse this but at this point this is the final regex statement that I need.

Here is an example:

<div class="infobox sisterproject" style="font-size: 90%; padding: .5em 1em 1em 1em;">
<div style="text-align:center;">
Find more about '''{{{display|{{{1|{{PAGENAME}}}}}}}}''' on Wikipedia's [[Wikipedia:Wikimedia sister projects|sister projects]]:
</div><!--
-->{{#ifeq:{{{wikt}}}|no||<!--
-->[[File:Wiktionary-logo-en.svg|25px|link=wikt:Specia开发者_开发百科l:Search/{{{wikt|{{{1|{{PAGENAME}}}}}}}}|Search Wiktionary]] [[wikt:Special:Search/{{{wikt|{{{1|{{PAGENAME}}}}}}}}|Definitions]] from Wiktionary<br />}}<!--
-->{{#ifeq:{{{b}}}|no||<!--
-->[[File:Wikibooks-logo.svg|25px|link=b:Special:Search/{{{b|{{{1|{{PAGENAME}}}}}}}}|Search Wikibooks]] [[b:Special:Search/{{{b|{{{1|{{PAGENAME}}}}}}}}|Textbooks]] from Wikibooks<br />}}<!--
-->{{#ifeq:{{{q}}}|no||<!--
-->[[File:Wikiquote-logo.svg|25px|link=q:Special:Search/{{{q|{{{1|{{PAGENAME}}}}}}}}|Search Wikiquote]] [[q:Special:Search/{{{q|{{{1|{{PAGENAME}}}}}}}}|Quotations]] from Wikiquote<br />}}<!--
-->{{#ifeq:{{{s}}}|no||{{#ifeq:{{{author|no}}}|yes|<!--
-->[[File:Wikisource-logo.svg|25px|link=s:Special:Search/Author:{{{s|{{{1|{{PAGENAME}}}}}}}}|Search Wikisource]] [[s:Special:Search/Author:{{{s|{{{1|{{PAGENAME}}}}}}}}|Source texts]] from Wikisource<br />|<!--
-->[[File:Wikisource-logo.svg|25px|link=s:Special:Search/{{{s|{{{1|{{PAGENAME}}}}}}}}|Search Wikisource]] [[s:Special:Search/{{{s|{{{1|{{PAGENAME}}}}}}}}|Source texts]] from Wikisource<br />}}}}<!--
-->{{#ifeq:{{{commons}}}|no||<!--
-->[[File:Commons-logo.svg|25px|link=commons:Special:Search/{{{commons|{{{1|{{PAGENAME}}}}}}}}|Search Commons]] [[commons:Special:Search/{{{commons|{{{1|{{PAGENAME}}}}}}}}|Images and media]] from Commons<br />}}<!--
-->{{#ifeq:{{{n}}}|no||<!--
-->[[File:Wikinews-logo.svg|25px|link=n:Special:Search/{{{n|{{{1|{{PAGENAME}}}}}}}}|Search Wikinews]] [[n:Special:Search/{{{n|{{{1|{{PAGENAME}}}}}}}}|News stories]] from Wikinews<br />}}<!--
-->{{#ifeq:{{{v}}}|no||<!--
-->[[File:Wikiversity-logo-Snorky.svg|25px|link=v:Special:Search/{{{v|{{{1|{{PAGENAME}}}}}}}}|Search Wikiversity]] [[v:Special:Search/{{{v|{{{1|{{PAGENAME}}}}}}}}|Learning resources]] from Wikiversity<br />}}<!--
-->{{#ifeq:{{{species<includeonly>|no</includeonly>}}}|no||<!--
-->[[File:Wikispecies-logo.svg|25px|link=species:Special:Search/{{{species<noinclude>|{{{1|{{PAGENAME}}}}}</noinclude>}}}|Search Wikispecies]] [[species:Special:Search/{{{species<noinclude>|{{{1|{{PAGENAME}}}}}</noinclude>}}}|{{{species<noinclude>|{{{1|{{PAGENAME}}}}}</noinclude>}}}]] from Wikispecies}}
</div><noinclude>

Thanks


I think it is not a good idea to parse the html with regex you could use the Html Agility pack


 new Regex(@"<div\b[^>]*>(?><div\b[^>]*>(?<DEPTH>)|</div>(?<-DEPTH>)|.?)*(?(DEPTH)(?!))</div>", RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline);

In the time it took me to fix my expression I would not even be half way done with getting html agility pack up and working.

0

精彩评论

暂无评论...
验证码 换一张
取 消