I'm using thi开发者_JAVA百科s regex to find <script> tags:
<script (.|\n)*>(.|\n)*?</script>
The problem is, it matches the ENTIRE string below, not just each tag separately:
<script src="crap2.js"></script><script src="crap2.js"></script>
You really would be better off using the DOM to process HTML for this reason and all sorts of others.
change your first * to *?
This is the non-greedy 'match all', so it will match the smallest set of characters before the next '>'.
I don't think anything else needs to be said other than RegEx match open tags except XHTML self-contained tags.
Also see this week's Coding Horror: Parsing Html The Cthulhu Way, inspired by the epic answer by @bobince that @JS Bangs links to.
I'll keep posting links to my previous answers until this question type has been wiped from this planet's surface (hopefully in 10 years or so): Don't user regular expressions for irregular languages like html or xml. Use a parser instead.
<script[\s\S]*?>[\s\S]*?</script>
This matches most common situations, but it's very important to consider JS Bangs answer.
try to exclude any '<' from the content
<script (.|\n)*>(.|\n|[^<])*?</script>
精彩评论