I have a web scraping application, written in OO Perl. There's single WWW::Mechanize object used in the app. How can I make it to not fetch the same URL twice, i.e. make the second get()
with the same URL no-op:
my $mech = WWW::Mechanize->new();
my $url = 'http:://google.com';
$mech->ge开发者_运维技巧t( $url ); # first time, fetch
$mech->get( $url ); # same url, do nothing
See WWW::Mechanize::Cached:
Synopsis
use WWW::Mechanize::Cached;
my $cacher = WWW::Mechanize::Cached->new;
$cacher->get( $url );
Description
Uses the Cache::Cache hierarchy to implement a caching Mech. This lets one perform repeated requests without hammering a server impolitely.
You can store the URLs and their content in a hash.
my $mech = WWW::Mechanize->new();
my $url = 'http://google.com';
my %response;
$response{$url} = $mech->get($url) unless $response{$url};
You can subclass WWW::Mechanize
and redefine the get()
method to do what you want:
package MyMech;
use base 'WWW::Mechanize';
sub get {
my $self = shift;
my($url) = @_;
if (defined $self->res && $self->res->request->uri ne $url) {
return $self->SUPER::get(@_)
}
return $self->res;
}
精彩评论