开发者

Regex to remove HTML-head-tag

开发者 https://www.devze.com 2023-02-22 16:42 出处:网络
how can I remove, with NSRegularExpression, the entire head-开发者_StackOverflow社区tag in a HTML file. Can some one give me a regex?

how can I remove, with NSRegularExpression, the entire head-开发者_StackOverflow社区tag in a HTML file. Can some one give me a regex?

Thanks in advance, Ph99Ph


There is none! HTML is a type-2 language and thus not parsable with a regular expression (type-3).

See this wiki article in case of doubt.

Lots of people use regex for parsing/editing HTML. This works quite well in simple cases but is utterly error prone.

This being said: You should have fairly reliable results with this regex:

<head>.+?</head>

This requires "." to also match line breaks. If it doesn't, then use this:

<head>(?:.|\n|\r)+?</head>

Again: This is error prone, don't do it.

What you should use is an XML parser such as NSXMLParser.


Please see the accepted answer at RegEx match open tags except XHTML self-contained tags. Or any version of this exact same question posted each day since the beginning of Stack Overflow.

In short, you cannot reliably parse HTML with Regular Expressions. RegEx is simply not advanced enough because of the complexities of HTML.


use something like this :

result = System.Text.RegularExpressions.Regex.Replace(result,
         @"<( )*head([^>])*>", "<head>",
         System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
         @"(<( )*(/)( )*head( )*>)", "</head>",
         System.Text.RegularExpressions.RegexOptions.IgnoreCase);                
result = System.Text.RegularExpressions.Regex.Replace(result,
         "(<head>).*(</head>)", " ",
         System.Text.RegularExpressions.RegexOptions.IgnoreCase);
0

精彩评论

暂无评论...
验证码 换一张
取 消