What's the appropriate Perl or Java regex to extract only the second line below? It should find the div tag containing the class="mat开发者_高级运维chthis" attribute.
<div>Do not match this</div>
<div class="matchthis">MATCH THIS</div>
<div class="unimportant">Do not match this</div>
Please do not tell me to use DOM/Soup/etc. I wonder if raw regex can solve the simple problem above (you'll be awarded for the answer!). Yes I'm aware of this post so don't even mention it.
As you already seem to know, using regular expressions to parse HTML is a bad idea.
In this specific case, I'm pretty sure all you really want is this:
<div class="lulz">(.*)<\/div>
Now, the more flexible you want to get, the more unreadable your regular expression will become. And this is the danger of trying to use regular expressions instead of a proper parser. For instance, say you want to allow for additional attributes besides class
. A kind of functional regular expression for this might look like:
<div[^>]*class="[^\"]*lulz[^\"]*".*>(.*)<\/div>
Totally readable, right? (Also, almost certainly very wrong.)
If there are no nested tags inside your <div>
you can use this
/<div[^>]+class="matchthis"[^>]*>[^>]*<\/div>/
Otherwise you need to know what is inside or a different solution (as you know).
If your are interested only in text between tags, instead of the whole line, you could use lookarounds.
With this regex,
m{(?<=<div class="matchthis">)([^<]+)(?=</div>)}
you can get text between tags inside the $1 variable; note that the second group of round parentheses is the capturing one.
The first and the last group of round parentheses are positive lookarounds, they don't capture text.
Anyway, others have already given advice: don't (ab)use regexes on HTML.
精彩评论