开发者

Regular expression to get html without comments

开发者 https://www.devze.com 2022-12-11 09:42 出处:网络
I need to carry out a task that is to get some html out from a webpage.Within the webpage there are commen开发者_运维百科ts and i need to get the html out from within the comments.I hope the example b

I need to carry out a task that is to get some html out from a webpage. Within the webpage there are commen开发者_运维百科ts and i need to get the html out from within the comments. I hope the example below can help. I need it to be done in c#.

<!--get html from here-->
<div><p>some text in a tag</p></div>
<!--get html from here-->

I want it to return

<div><p>some text in a tag</p></div>

How would I do this??


What about finding the index of the first delimiter, the index of the second delimiter and "cropping" the string in between? Sounds way simpler, might be as much effective as.


If all the instances are similarly formatted, an expression like this

<!--[^(-->)]*-->(.*)<!--[^(-->)]*-->

would retrieve everything between two comments. If your "get html from here" text in your comments is well defined, you could be more specific:

<!--get html from here-->(.*)<!--get html from here-->

When you run the RegEx over the string, the Groups collection would contain the HTML between the comments.


Regexes are not ideal for HTML. If you really do want to process the HTML in all its glory, consider HtmlAgilityPack as discussed in this question. Looking for C# HTML parser

The Simplest Thing That Could Possibly Work is:

string pageBuffer=...;
string wrapping="<!--get html from here-->";
int firstHitIndex=pageBuffer.IndexOf(wrapping) + wrapping.Length;
return pageBuffer.Substring( firstHitIndex, pageBuffer.IndexOf( wrapping, firstHitIndex) - firstHitIndex));

(with error checking that both markers are present)

Depending on your context, WatiN might be useful (not if you're in a server, but if you're on the client side and doing something more interesting that could benefit from full HTML parsing.)


I encountered with such a requirement to strip off HTML comments. I had been looking for some regular expression based solution so that it can work out of the box with free style commenting and having any type of characters under them.

I tried with it and it worked perfectly for single line, multi-line, comments with Unicode character and symbols.

<!--[\u0000-\u2C7F]*?-->
0

精彩评论

暂无评论...
验证码 换一张
取 消