开发者

How can I check if a html document includes script tags that are not empty using a regular expression

开发者 https://www.devze.com 2023-01-02 10:10 出处:网络
I am trying to check if a html document includes script tags that are not empty using regular expressions. The regular expres开发者_Go百科sion should match any script tag with a content other than whi

I am trying to check if a html document includes script tags that are not empty using regular expressions. The regular expres开发者_Go百科sion should match any script tag with a content other than whitespaces or linebreaks.

I have tried

<script\b[^>]*>[^.+$]</script>

but this regex only finds script tags with one space.


Don't parse HTML with regexen! Seriously, it's literally impossible in the general case. Why do you want to use a regex here? It would make much more sense to use an HTML parser, though I can't give you any particular suggestions because I don't know what language you're using. If you're using the JavaScript DOM, for instance, you would want something like the following:

var scripts     = document.getElementsByTagName('script')
var numScripts  = scripts.length
var textScripts = []
for (var i = 0; i < numScripts; ++i)
  if (scripts[i].text !== '') textScripts.push(scripts[i])

This looks at the structure of the HTML to determine the properties of the script tags, rather than at the messy text.


Edit 1: Apparently, you're using Java. Unfortunately, I don't know anything about parsing HTML in Java, so I can't give you any recommendations; however, look into that, because it's the way to go.


Regex is not the right tool for this. Use a HTML parser. I can recommend Jsoup for this.

Here's a kickoff example:

URL url = new URL("http://stackoverflow.com/questions/2993515");
Document document = Jsoup.parse(url, 3000);

Elements scripts = document.select("script");
for (Element script : scripts) {
    String data = script.data();
    if (!data.isEmpty()) {
        System.out.println(data);
    }
}

Jsoup is the least verbose of all HTML parsers, it offers a nice API with support for jQuery like selectors.


Although you can match script tags containing only whitespaces or linebreaks, you cannot match script tags containing not only whitespaces or linebreaks, because the content of the tag may contain script tags itself, and any regex you may come up would match the closing tag sometimes too early or sometimes too late.

You would need to recognise a variant of the language of properly nested brackets, which is impossible with regular expressions, because the language is not a regular language.

The problem is further complicated by the possibility of comments containing script tags.


You should not use a regular expression to parse HTML.


Use TagSoup or another Java DOM parser to find this out.

Under no circumstance use regular expressions to parse HTML.

0

精彩评论

暂无评论...
验证码 换一张
取 消