开发者

Perl::Mechanize: running a simple crawler with a loop [multiple queries]

开发者 https://www.devze.com 2023-03-06 13:11 出处:网络
currently ironing out a way to parse the data of a page: http://www.foundationfinder.ch/ i love to do it in Perl: Well - i am just musing which is the best way to do the job.

currently ironing out a way to parse the data of a page: http://www.foundationfinder.ch/

i love to do it in Perl: Well - i am just musing which is the best way to do the job. Guess that i am in front of a nice learning curve. ;) This task will give me some nice Perl lessions. At the moment it goes abit over my head...;-)

So here is a sample-page:

Perl::Mechanize: running a simple crawler with a loop [multiple queries]

... and as i thought i can find all 790 resultpages within a certain range between Id= 0 and Id= 100000 i thought, that i can go the way with a loop:

http://www.foundationfinder.ch/ShowDetails.php?Id=11233&InterfaceLanguage=&Type=Html http://www.foundationfinder.ch/ShowDetails.php?Id=927&InterfaceLanguage=1&Type=Html http://www.foundationfinder.ch/ShowDetails.php?Id=949&InterfaceLanguage=1&Type=Html http://www.foundationfinder.ch/ShowDetails.php?Id=20011&InterfaceLanguage=1&Type=Html http://www.foundationfinder.ch/ShowDetails.php?Id=10579&InterfaceLanguage=1&Type=Html

i thought i can go the Perl-Way but i am not very very sure: I was trying to use LWP::UserAgent on the same URLs [see below] with different query arguments, and i am wondering if LWP::UserAgent provides a way for us to loop through the query arguments? I am not sure that LWP::UserAgent has a method for us to do that. Well - i sometimes heard that it is easier to use Mechanize. But is it really easier!?

BTW; But if i am going the PHP way i could do it with Curl - couldnt i!?

Here is my approach: I tried to figure it out. And i digged deeper in the Manpages and Howtos. We can have a loop constructing the URLs and use Curl - repeatedly

As noted above: here we have some resultpages;

http://www.foundationfinder.ch/ShowDetails.php?Id=11233&InterfaceLanguage=&Type=Html http://www.foundationfinder.ch/ShowDetails.php?Id=927&InterfaceLangua开发者_如何学编程ge=1&Type=Html

Alternatively we can add a request_prepare handler that computes and add the query arguments before we send out the request.

Again: What is aimed: i want to parse the data and afterwards i want to store it in a local MySQL-database

should i define a extern_uid !?

and go like this:

for my $i (0..10000) {
  $ua->get('http://www.foundationfinder.ch/ShowDetails.php?Id=', id => 21, extern_uid => $i);
  # process reply
}

Well but now i get stuck- i need help - can i do the job like this!?

regards

zero


Dont do it like this. Use HTTP live headers (Firefox Plugin) or eqv. to see what the javasript does behind the scenes while you select what you need from here to get to that page (with the table).

To get the data from the table, use HTML::TableExtract or HTML::TreeBuilder::XPath if you want to use XPath

If you do want to iterate over the queries, just create another var:

my $a = 'http://www.foundationfinder.ch/ShowDetails.php?Id=' . $q . '&InterfaceLanguage=&Type=Html';

and increment $q as you go, make sure the page is valid before trying to load it with get

0

精彩评论

暂无评论...
验证码 换一张
取 消