开发者

How to extract string between 2 markers using Regex in .NET?

开发者 https://www.devze.com 2023-04-06 07:59 出处:网络
I have a source to a web page and I need to extract the body.So anything between </head><body> and </body></html>.

I have a source to a web page and I need to extract the body. So anything between </head><body> and </body></html>.

I've tried the following with no success:

var match = Regex.Match(output, @"(?<=\</head\>\<body\>)(.*?)(?=\</body\>\</html\>)");

It finds a 开发者_运维问答string but cuts it off long before </body></html>. I escaped characters based on the RegEx cheat sheet.

What am i missing?


I'd recommend using the HtmlAgilityPack instead - parsing HTML with regular expressions is very, very fragile.

The latest version even supports Linq so you can get your content like this:

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://stackoverflow.com");
string html = doc.DocumentNode.Descendants("body").Single().InnerHtml;


Regex is not meant for such html handling, as many here would say. Without having your sample web page / html, I can only say that try removing the non-greedy ? quantifier in (.*?) and try. After all, a html page will have only one head and body.


Though regexes are definitely not the best tool for this task, there are a few suggestions and points I would like to make:

  1. un-escape the angle brackets - with the @ before your string, they are going through to the regex and they do not need to be escaped for a .NET regex
  2. with your regex, you need to make sure that the head/body tag combinations do not have any white-space between them.
  3. with your regex, the body tag cannot have any attributes.

I would suggest something more like:

(?<=</head>\s*<body(\s[^>]*)?>)(.*?)(?=</body>\s*</html>)

this seems to work for me on the source of this page!


As the others have said, the correct way to handle this is with an HTML-specific tool. I just want to point out some problems with that cheat-sheet.

First, it's wrong about angle brackets: you do not need to escape them. In fact, it's wrong twice: it also says \< and \> match word boundaries, which is both incorrect for .NET, and incompatible with the advice about escaping angle brackets.

That cheat-sheet is just a random collection of regex syntax elements; most of them will work in most flavors, but many are guaranteed not to work in your particular flavor, whatever it happens to be. I recommend you disregard it and rely instead on .NET-specific documents or Regular-Expressions.info. The books Mastering Regular Expressions and Regular Expressions Cookbook are both excellent, too.

As for your regex, I don't see how it could behave the way you say it does. If it were going to fail, I would expect it to fail completely. Does your HTML document contain a CDATA section or SGML comment with </body></html> inside it? Or is it really two or more HTML documents run together?

0

精彩评论

暂无评论...
验证码 换一张
取 消