开发者

Perl Question with UserAgent Get Website on Loop

开发者 https://www.devze.com 2023-02-03 17:48 出处:网络
I\'m able to grab the first image fine, but then the content seems to be looping inside itself. Not sure what I\'m doing wrong.

I'm able to grab the first image fine, but then the content seems to be looping inside itself. Not sure what I'm doing wrong.

#!/usr/bin/perl
use LWP::Simple;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
for(my $id=1;$id<55;$id++)
{
    my $response = $ua->get("http://www.gamereplays.org/community/index.php?act=medals&CODE=showmedal&MDSID=" . $id );
    my $content = $response->content;    
        for(my $id2=1;$id2<10;$id2++)
        {
                $content =~ /<img src="http:\/\/www\.gamereplays.org\/community\/style_medals\/(.*)$id2\.gif" alt=""\/>/;
                $url = "http://www.gamere开发者_高级运维plays.org/community/style_medals/" . $1 . $id2 . ".gif";
  print "--\n\r";
  print "ID: ".$id."\n\r";
  print "ID2: ".$id2."\n\r";
  print "URL: ".$url."\n\r";
  print "1: ".$1."\n\r";
  print "--\n\r";
  getstore($url, $1 . $id2 . ".gif");
        }
}


As others have stated, this is really a job for an HTML::Parser. Also, you should 'use strict;' and remove use LWP::Simple as you're not using the library.

You could change your regex to the following:

$content =~ m{http://www\.gamereplays\.org/community/style_medals/([\w\_]+)$id2\.gif}s;

But you won't get style_medals/comp_graphics_10.gif - which may be what you want. I think something like the following would work better. My apologies for the style changes but I can't resist modifying for PBP.

#!/usr/bin/perl                                                                 

use LWP::UserAgent;
use Carp;
use strict;

my $ua = LWP::UserAgent->new();

# Fetch pages from 1 to 55.  Are we sure we won't have page 56?                 
# Perhaps consider running until a 404 is found.                                
for (my $id = 1; $id < 55; $id++) {

    # Get the page data                                                         
    my $response = $ua->get( 'http://www.gamereplays.org/community/index.php?ac\
t=medals&CODE=showmedal&MDSID='.$id );

    # Check for failure and abort                                               
    if (!defined $response || !$response->is_success) {
        croak 'Request failed! '.$response->status_line();
    }

    my $content = $response->content();

    # Run this loop each time we find the url                                   
  CONTENT_LOOP:
    while ($content =~ s{<img src="(http://www\.gamereplays\.org/community/styl\
e_medals/([^\"]+))" }{}ms) {

        my $url   = $1;  # The entire url, no need to recreate the domain       
        my $file  = $2;  # Just the file name portion                           
        my ($id2) = $file =~ m{ _(\d+)\.gif \Z}xms; # extract id2 for debug     

        next CONTENT_LOOP if !defined $id2;         # Handle SOTW.gif file(s)   

        # Display stats about each id found                                     
        print "--\n";
        print "ID:  $id\n";
        print "ID2: $id2\n";
        print "URL: $url\n";
        print "1:   $file\n";
        print "--\n";

        # You might want to consider involving the $id in the filename as       
        # you could have the same filename on multiple pages                    
        getstore( $url, $file);
    }
}


The problem comes in your regular expression. (.*) is greedy, in which it will match all characters between style_medals/ and $id2.gif. When $id2 is 1, this is fine, but when $id2 is 2, it'll match everything up until 2.gif, which includes the full string from 1.gif.

Try making (.*) non-greedy by adding the ? non-greedy modifier: (.*?). This should fix your problem.

Edit: Ideally you wouldn't be using a regular expression to parse HTML, instead using something like, say, HTML::Parser.


I won't push on the HTML parsing module (though LinkExtor can be your friend here...) as I understand the problems that can come with HTML parsers: If the HTML isn't properly valid, they often choke, where a simple regex can do the trick on anything no matter how broken as long as you're looking for the right thing.

As has been stated above by CanSpice, (.*) is greedy. The non-greedy modifier will usually do what you want. However, another option is to let it be greedy, but make sure it doesn't grab anything past the quoted src attribute of the image tag:

/<img src="http:\/\/www\.gamereplays.org\/community\/style_medals\/([^"]*)$id2\.gif"[^>]*>/

Note: I also modified it to not care if there's an alt attribute. However, I'm not familiar with the site you're grabbing things from.

If it's generated code it should be fine unless they change something on a grand scale. But to avoid that contingency, even not using a proper HTML parser, you may want to write a mini-parser just for the image tags yourself -- extract the image tags into the keys of a hash (grab them with a regex like /<\s*(img\s+[^>])\s>/) and then for each key in the hash (using a hash avoids dupes), then read everything inside quotes into separate storage and replace the quoted values to remove any whitespace inside quotes, then split it into attributes on whitespace (with element 0 being the tagname, and the rest being attributes which you split into values on the =, getting back the values you just stored a moment ago (or treat as something like '0E0' when they don't have a value--thus keeping them true but effectively valueless)

If it's handwritten code, however, you may be up against some nightmares because many people aren't consistent with their use of quotes on attributes, if they use them at all.

0

精彩评论

暂无评论...
验证码 换一张
取 消