开发者

How can I get the file extensions from relative links in HTML text using Perl?

开发者 https://www.devze.com 2022-12-24 01:54 出处:网络
For example, scanning the contents of an HTML page with a Perl regular expression, I want to match all file extensions but n开发者_如何学JAVAot TLD\'s in domain names. To do this I am making the assum

For example, scanning the contents of an HTML page with a Perl regular expression, I want to match all file extensions but n开发者_如何学JAVAot TLD's in domain names. To do this I am making the assumption that all file extensions must be within double quotes.

I came up with the following, and it is working, however, I am failing to figure out a way to exclude the TLDs in the domains. This will return "com", "net", etc.

m/"[^<>]+\.([0-9A-Za-z]*)"/g

Is it possible to negate the match if there is more than one period between the quotes that are separated by text? (ie: match foo.bar.com but not ./ or ../)

Edit I am using $1 to return the value within parentheses.


#!/usr/bin/perl

use strict; use warnings;
use File::Basename;
use HTML::TokeParser::Simple;
use URI;

my $parser = HTML::TokeParser::Simple->new( \*DATA );

while ( my $tag = $parser->get_tag('a') ) {
    my $uri = URI->new( $tag->get_attr('href') );
    my $ext = ( fileparse $uri->path, qr/\.\w+\z/ )[2];
    print "$ext\n";
}

__DATA__
<p><a href="../test.png">link</a> <a
href="http://www.example.com/test.jpg">link on example.com</a>
</p>


First of all, extract the names with an HTML parser of your choice. You should then have something like an array containing the names, as if produced like this:

my @names = ("http://foo.bar.net/quux",
             "boink.bak",
             "mms://three.two.one"
             "hello.jpeg");

The only way to distinguish domain names from file extensions seems to be that in "file names", there is at least one more slash between the :// part and the extension. Also, a file extension can only be the last thing in the string.

So, your regular expression would be something like this (untested):

^(?:(?:\w+://)?(?:\w+\.)+\w+/)?.*\.(\w+)$


#!/usr/bin/perl -w

use strict;

while (<>) {
    if (m/(?<=(?:ref=|src=|rel=))"([^<>"]+?\.([0-9A-Za-z]+?))"/g) {
       if ($1 !~ /:\/\//) {
            print $2 . "\n";
       }
    }
}

Used positive lookbehind to get only the stuff between doublequotes behind one of the 'link' attributes (scr=, rel=, href=). Fixed to look at "://" for recognizing URL's, and allow files with absolute paths.

@Structure : There is no proper way to protect against someone leaving off the protocol part, as it would just turn into a legitimate pathname : http://www.noo.com/afile.cfg -> www.noo.com/afile.cfg. You would need to wget (or something) all of the links to make sure they are actually there. And that's an entirely different question...

Yes, I know I should use a proper parser, but am just not feeling like it right now :P

0

精彩评论

暂无评论...
验证码 换一张
取 消