开发者

Perl Regex to extract URLs from HTML

开发者 https://www.devze.com 2023-01-30 18:46 出处:网络
This s开发者_运维百科hould be a simple regex but I can\'t seem to figure it out. Can someone please provide a 1-liner to take any string of arbitrary HTML input and populate an array with all the Fac

This s开发者_运维百科hould be a simple regex but I can't seem to figure it out.

Can someone please provide a 1-liner to take any string of arbitrary HTML input and populate an array with all the Facebook URLs (matching http://www.facebook.com) that were in the HTML code?

I don't want to use any CPAN modules and would much prefer a simple regex 1-liner.

Thanks in advance for your help!


Obligatory link explaining why you shouldn't parse HTML using a regular expression.

That being said, try this for a quick and dirty solution:

my $html = '<a href="http://www.facebook.com/">A link!</a>';
my @links = $html =~ /<a[^>]*\shref=['"](https?:\/\/www\.facebook\.com[^"']*)["']/gis;


See HTML::LinkExtor. There is no point wasting your life energy (nor ours) trying to use regular expressions for these types of tasks.

You can read the documentation for a Perl module installed on your computer by using the perldoc utility. For example, perldoc HTML::LinkExtor. Usually, module documentation begins with an example of how to use the module.

Here is a slightly more modern adaptation of one of the examples in the documentation:

#!/usr/bin/env perl

use v5.20;
use warnings;

use feature 'signatures';
no warnings 'experimental::signatures';

use autouse Carp => qw( croak );

use HTML::LinkExtor qw();
use HTTP::Tiny qw();
use URI qw();

run( $ARGV[0] );

sub run ( $url ) {
    my @images;

    my $parser = HTML::LinkExtor->new(
        sub ( $tag, %attr ) {
            return unless $tag eq 'img';
            push @images, { %attr };
            return;
        }
    );

    my $response = HTTP::Tiny->new->get( $url, {
            data_callback => sub { $parser->parse($_[0]) }
        }
    );

    unless ( $response->{success} ) {
        croak sprintf('%d: %s', $response->{status}, $response->{reason});
    }

    my $base = $response->{url};

    for my $image ( @images ) {
        say URI->new_abs( $image->{src}, $base )->as_string;

    }
}

Output:

$ perl t.pl https://www.perl.com/
https://www.perl.com/images/site/perl-onion_20.png
https://www.perl.com/images/site/twitter_20.png
https://www.perl.com/images/site/rss_20.png
https://www.perl.com/images/site/github_light_20.png
https://www.perl.com/images/site/perl-camel.png
https://www.perl.com/images/site/perl-onion_20.png
https://www.perl.com/images/site/twitter_20.png
https://www.perl.com/images/site/rss_20.png
https://www.perl.com/images/site/github_light_20.png
https://i.creativecommons.org/l/by-nc/3.0/88x31.png


Russell C, have you seen the beginning of the Facebook movie, where Mark Zuckerburg uses Perl to automatically extract all the photos from a college facebook (and then posted them online). I was like "that's how i'd do it! I'd use Perl too!" (except it would probably take me a few days to work out, not 2 minutes). Anyway I'd use the module WWW::Mechanize to extract links (or photos):

use strict; use WWW::Mechanize; open (OUT, ">out.txt"); my $url="http://www.facebook.com"; my $mech=WWW::Mechanize->new(); $mech->get($url); my @a = $mech->links; print OUT "\n", $a[$_]->url for (0..$#a);

However this won't log you in to your facebook page, it will just take you to the log in screen. I'd use HTTP::Cookies to log in. For that, see the documentation. Only joking, just ask. Oh god the apple strudle is burning!


Maybe this can help you:

if ($input =~ /(http:\/\/www\.facebook\.com\/\S+)/) { push(@urls, $1); }
0

精彩评论

暂无评论...
验证码 换一张
取 消