Parse HTML Page For Links With Regex Using Perl [duplicate]_问答_开发者

Parse HTML Page For Links With Regex Using Perl [duplicate]

开发者 https://www.devze.com 2022-12-11 04:03 出处：网络

This question already has answers here: Closed 13 years ago. 开发者_JAVA技巧 Possible Duplicate: How can I remove external links from HTML using Perl?

This question already has answers here: Closed 13 years ago. 开发者_JAVA技巧

Possible Duplicate:
How can I remove external links from HTML using Perl?

Alright, i'm working on a job for a client right now who just switched up his language choice to Perl. I'm not the best in Perl, but i've done stuff like this before with it albeit a while ago.

There are lots of links like this:

<a href="/en/subtitles/3586224/death-becomes-her-en" title="subtitlesDeath Becomes Her" onclick="reLink('/en/subtitles/3586224/death-becomes-her-en');" class="bnone">Death Becomes Her
        (1992)</a>

I want to match the path "/en/subtitles/3586224/death-becomes-her-en" and put those into an array or list (not sure which ones better in Perl). I've been searching the perl docs, as well as looking at regex tutorials, and most if not all seemed geared towards using ~= to match stuff rather than capture matches.

Thanks,

Cody

Use a proper HTML parser to parse HTML. See this example included with HTML::Parser.

Or, consider the following simple example:

#!/usr/bin/perl

use strict; use warnings;

use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new(\*DATA);

my @hrefs;

while ( my $anchor = $parser->get_tag('a') ) {
    if ( my $href = $anchor->get_attr('href') ) {
        push @hrefs, $href if $href =~ m!/en/subtitles/!;
    }
}

print "$_\n" for @hrefs;

__DATA__
<a href="/en/subtitles/3586224/death-becomes-her-en" title="subtitlesDeath 
Becomes Her" onclick="reLink('/en/subtitles/3586224/death-becomes-her-en');" 
class="bnone">Death Becomes Her
                (1992)</a>

Output:

/en/subtitles/3586224/death-becomes-her-en

Don't use regexes. Use an HTML parser like HTML::TreeBuilder.

my @links;
my $tree = HTML::TreeBuilder->new; # empty tree
$tree->parse_file($file_name);
$tree->elementify;

my @links = map { $_->attr('href') } $tree->look_down( _tag => 'a');

$tree = $tree->delete;

# Do stuff with links array

URLs like the one in your example can be matched with a regular expression like

($url) = /href=\"([^\"]+)\"/i

If the HTML ever uses single quotes (or no quotes) around a URL, or if there are ever quote characters in the URL, then this will not work quite right. For this reason, you will get some answers telling you not to use regular expressions to parse HTML. Heed them but carry on if you are confident that the input will be well behaved.