I am building a Spider in Perl and have a problem:
The Site I want to spider uses a JavaScript for Age-Verification and I don't know how to get past this in Perl...?
The Script looks like this:
<script type = "text/javascript">
function set_age_verified(){
new Request({
method: "post",
url: "/user/set_age_verified"
}).send();
$('age_verification').setStyles({visibility: 'hidden', display: 'none'});
$('page_after_verification').set开发者_运维百科Styles({visibility: 'visible', display: 'block'});
return false;
}
</script>
And here the OnClick Event :
<a href="#" onclick="return set_age_verified();"><img src="http://example.com/age-verification-enter.gif" alt="ENTER"></a>
The function has two effects. One is to POST a request to the URL "/user/set_age_verified" and the other is to alter the display visibility of some HTML.
Your spider can easily ignore the second effect, but presumably the first effect, by going to the server, sets some cookie or server variable which the server will require.
You do not have to actually run the javascript, so long as the server sees the same POST data.
The answer is for your Perl script to detect pages which have this javascript, and to call a Perl function to POST the data to the age verification URL.
Any cookie or similar which is returned will have to be recorded by you - your HTTP library may take care of this for you though.
What Perl modules are you using? WWW::Mechanize has an AJAX plugin, although it hasn't been updated in a while. I guess you could also look at something like WWW::Selenium.
But I bet that AJAX request is going to inject some HTML that requires the user to input some data, then submit a form. Pretty tricky to cover all bases for that general case...
Take a look at the WWW::Mechanize::Firefox module. It allows you handle some JavaScript.
Also, in Firefox HTTPHeaders is your best friend.
Turn it on, manually click what ever you need to in order for the Javascript to run and submit to the server, then go back to the HTTPHeaders window. It will show you exactly what that Javascript event sent to the server (GET or POST + the data, even if it is HTTPS) - as well as the server response.
精彩评论