Having trouble with Google blocking my IPs when querying Google for content matches. I've got 300 private IPs and have no trouble connecting to Google with a desktop app (w/ the same IPs) that performs a similar function. Yet, when I crank it up on my server using CURL my IPs get temporarily blocked - your machine maybe sending automated queries, please try again in 30 secs. So there must be a footprint somewhere.
Any how here's my code:
function file_get_contents_curl($url, $proxy = true) {
global $proxies;
App::import('Vendor', 'proxies');
$proxies = $this->shuffle_assoc($proxies);
$proxy_ip = $proxies[array_rand($proxies, 1)];//proxy IP here
$proxy = $proxy_ip.':60099';
$loginpassw = 'myusername:mypassword'; //proxy login and password here
$ch = curl_init();
if($proxy) {
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
//curl_setopt($ch, CURLOPT_PROXYPORT, $proxy_port);
curl_setopt($ch, CURLOPT_PROXYTYPE, 'HTTP');
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_PROXYUSERPWD, $loginpassw);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1开发者_如何学编程; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 (.NET CLR 3.5.30729)');
}
curl_setopt($ch, CURLOPT_HEADER, 1);
@curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
//Set curl to return the data instead of printing it to the browser.
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_0);
curl_setopt($ch, CURLOPT_URL, $url);
$data = curl_exec($ch);
//echo $data;
curl_close($ch);
return $data;
}
And I've verified that the IP is being set and that I'm connecting thru the proxy.
Anyone got any ideas?
Tried with SOCKS5 but no difference. Trouble with the Google api is that you only get 100 queries per day.
HTTP proxies as well as SOCKS proxies can be used, there is no difference when scraping google results.
There are multiple possible reasons why you get detected.
- Your proxies are of bad quality or shared (maybe without your knowledge)
- Your proxies are in only one or two subnets / too sequential
- You query Google too fast or too often
You should not query Google more often with an IP than 20 times per hour, that's just a rough value that works and doesn't get punished by the search engine.
So you should implement delay based on your proxy count.
But if option 1) or 2) are true than even that won't help, you'll need another IP solution.
Check out the Google rank scraper ( http://google-rank-checker.squabbel.com/), it's a free PHP project for scraping Google and includes proper delay routines you could use for your own code.
Also the caching functions might proof useful for you as you don't want to query Google more than required.
And not to forget:
If you get detected then make your script STOP automated!
You just cause trouble by going on, detection means you did something wrong.
Http-proxies doesn't guarantee your privacy. You may try to use socks.
But you better use google-api instead.
精彩评论