开发者

Why am I getting a new session ID on every page fetch in my Perl WWW::Mechanize script?

开发者 https://www.devze.com 2022-12-28 13:54 出处:网络
So I\'m scraping a site that I have access to via HTTPS, I can login and start the process but each time I hit a new page (URL) the cookie Session Id changes. How do I keep the logged in Cookie Sessio

So I'm scraping a site that I have access to via HTTPS, I can login and start the process but each time I hit a new page (URL) the cookie Session Id changes. How do I keep the logged in Cookie Session Id?

#!/usr/bin/perl -w
use strict;
use warnings;
use WWW::Mechanize;
use HTTP::Cookies;
use LWP::Debug qw(+);
use HTTP::Request;
use LWP::UserAgent;
use HTTP::Request::Common;

my $un = 'username';
my $pw = 'password';

my $url = 'https://subdomain.url.com/index.do';

my $agent = WWW::Mechanize->new(cookie_jar => {}, autocheck => 0);
$agent->{onerror}=\&WWW::Mechanize::_warn;
$agent->agent('Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.3) Gecko/20100407 Ubuntu/9.10 (karmic) Firefox/3.6.3');
$agent->get($url);

$agent->form_name('form');
$agent->field(username => $un);
$agent->field(password => $pw);
$agent->click("Log In");

print "After Login Cookie: ";
print $agent->cookie_jar->as_string();
print "\n\n";

my $searchURL='https://subdomain.url开发者_Python百科.com/search.do';
$agent->get($searchURL);    

print "After Search Cookie: ";
print $agent->cookie_jar->as_string();
print "\n";

The output:

After Login Cookie: Set-Cookie3: JSESSIONID=367C6D; path="/thepath"; domain=subdomina.url.com; path_spec; secure; discard; version=0

After Search Cookie: Set-Cookie3: JSESSIONID=855402; path="/thepath"; domain=subdomain.com.com; path_spec; secure; discard; version=0

Also I think the site requires a CERT (Well in the browser it does), would this be the correct way to add it?

$ENV{HTTPS_CERT_FILE} = 'SUBDOMAIN.URL.COM'; ## Insert this after the use HTTP::Request...

Also for the CERT In using the first option in this list, is this correct?

X.509 Certificate (PEM)
X.509 Certificate with chain (PEM)
X.509 Certificate (DER)
X.509 Certificate (PKCS#7)
X.509 Certificate with chain (PKCS#7)


When your user-agent isn't doing something you think it should be doing, compare it's requests with that of an interactive browser. A Firefox plugin are handy for this sort of thing.

You're probably missing part of the process that the server expects. You probably aren't logging in or interacting correctly, and that could be for all sorts of reasons. For instance, there might be JavaScript on the page that WWW::Mechanize isn't handling.

When you can pinpoint what an interactive browser is doing that you are not, you'll know where you need to improve your script.

In your script, you can also watch what is happening by turning on debugging in LWP, which Mech is built on:

 use LWP::Debug qw(+); 

rjh already answered the certificate part of your question.


If your session cookie changes every page load, then likely you are not logging in correctly. But you could try forcing the JSESSIONID to be the same for each request. Construct your own cookie jar and tell WWW::Mechanize to use it:

my $cookie_jar = HTTP::Cookies->new(file => 'cookies', autosave => 1, ignore_discard => 1);
my $agent = WWW::Mechanize->new(cookie_jar => $cookie_jar, autocheck => 0);

The ignore_discard => 1 means that even session cookies are saved to disk (normally they are discarded for security reasons).

Then, after logging in, call:

$cookie_jar->save;

Then, after each request:

$cookie_jar->revert;  # re-loads the save

Alternately, you could sub-class HTTP::Cookies and override the set_cookie method to reject re-setting the session cookie if it already exists.


Also I think the site requires a CERT (Well in the browser it does), would this be the correct way to add it?

Some browsers (Internet Explorer for example) prompt for a security certificate even if one is not needed. If you are not getting any errors and the response content looks good, you probably don't need to set one.

If you do have a certificate file, check the POD for Crypt::SSLeay. Your certificate is PEM0-encoded so yes, you want to set $ENV{HTTPS_CERT_FILE} to the path of your cert. You might want to set $ENV{HTTPS_DEBUG} = 1 to see what's happening.


Setup the cookie jar, something akin to this:

my $cookie = HTTP::Cookies->new(file => 'cookie',autosave => 1,);
my $mech = WWW::Mechanize->new(cookie_jar => $cookie, ....);
0

精彩评论

暂无评论...
验证码 换一张
取 消