Extracting everything but tags from a web page without a parser - using scanner and regex?_问答_开发者

Extracting everything but tags from a web page without a parser - using scanner and regex?

开发者 https://www.devze.com 2023-01-14 22:47 出处：网络

Working on Android SDK, it\'s Java minus some things. I have a solution that pulls out two regex patterns from web pages. The problems I\'m having is that it\'s finding things inside HTML tags. I tr

Working on Android SDK, it's Java minus some things.

I have a solution that pulls out two regex patterns from web pages. The problems I'm having is that it's finding things inside HTML tags. I tried jTidy, but it was just too slow on the Android. Not sure why but my Scanner regex match solution whips it many times over.

currently, I grab the page source into a IntputStream

is = uconn.getInputStream();

and the match and extr开发者_运维技巧act like this:

Scanner scanner = new Scanner(in, "UTF-8");
String match = "";   
while (match != null) {   
    match = scanner.findWithinHorizon(extractPattern, 0);   
    if (match != null) {   
        String matchit = scanner.match().group(grp);

it works very nicely and is fast.

My regex pattern is already kinda crazy, actually two patterns in an or like this (p1|p2)

Any ideas on how I do that "but not inside HTML tags" or exclude HTML tags at the start? If I can exclude HTML tags from my source that will likely speed up my interface significantly as I have a few other things I need to do with the raw data.

One thing you can do is add a lookahead for the closing angle bracket:

(p1|p2)(?![^<>]*+>)

The idea is, after you find a match you scan forward a bit; if you find a closing bracket without first seeing an opening bracket, the match must have occurred inside a tag, so reject it. But be aware that even in well-formed HTML there are many things that can mess you up, like SGML comments, CDATA sections, or even angle brackets in attribute values.

Another approach would be to match the tags and ignore those matches:

((?:<[^<>]++>)++)(p1|p2)

Then you test whether it was group #1 that matched:

MatchResult match = scanner.match();
if (match.start(1) != -1) {
    // keep searching
}

But again, as a general solution this is way too fragile, for the reasons I cited above. You should only use one of these solutions (or any regex solution) if you're sure it's compatible with the particular pages you're working on.

Why don't you use javax.xml.parsers to parse HTML (ergo xml)