开发者

Perl alarm working intermittently

开发者 https://www.devze.com 2023-01-07 04:09 出处:网络
I am currently working on a project that involves crawling certain websites.However sometimes my Perl program will get \"stuck\" on a website for some reason (can\'t figure out why) and the program wi

I am currently working on a project that involves crawling certain websites. However sometimes my Perl program will get "stuck" on a website for some reason (can't figure out why) and the program will freeze for hours. To get around this I inserted some code to time out on the subroutine that crawls the webpage. The problem with this is that, lets say I set the ala开发者_运维知识库rm to 60 sec, most of the time the page will timeout correctly, but occasionally the program will not time out and just sit for hours on end (maybe forever since I usually kill the program).

On the really bad websites the Perl program will just eat through my memory, taking 2.3GB of RAM and 13GB of swap. Also the CPU usage will be high, and my computer will be sluggish. Luckily if it times out all the resources get released quickly.

Is this my code or a Perl issue? What should I correct and why was it causing this problem?

Thanks

Here is my code:

eval {

    local $SIG{ALRM} = sub { die("alarm\n") };

    alarm 60;
    &parsePageFunction();
    alarm 0;
};#eval

if($@) {

    if($@ eq "alarm\n") { print("Webpage Timed Out.\n\n"); }#if
    else { die($@."\n"); }#else
}#if


Depending on where exactly in the code it is getting stuck, you might be running into an issue with perl's safe signals. See the perlipc documentation on workarounds (E.g. Perl::Unsafe::Signals).


You may want to elaborate on the crawling process.

I'm guessing it's a recursive crawl, where for each crawled page, you crawl all links on it, and repeat crawling all links on all those pages too.

If that's the case, you may want to do two things:

  1. Create some sort of a depth limit, on each recursion you increment the counter and stop crawling if limit is reached

  2. Detect circular linking, if you have a PAGE_A with a link to PAGE_B, and PAGE_B has a link to PAGE_A you'll be crawling until you run out of memory.

Other than that, you should look into using the standard timeout facility of the module you're using, if that's LWP::UserAgent you do LWP::UserAgent->new(timeout => 60)

0

精彩评论

暂无评论...
验证码 换一张
取 消