I'm getting a too many redirects redirect error from URLConnection when trying to fetch www.palringo.com
URL url = new URL("http://www.palringo.com/");
HttpURLConnection.setFollowRedirects(true);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
System.out.println("Response code = " + connection.getResponseCode());
outputs the dreaded:
Exception in thread "main" java.net.ProtocolException: Server redirected too many times (20)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
According to wget there is just one redirect, from www.palringo.com
to www.palringo.com/en/gb/
Any ideas why my request using URLConnection for /en/gb
results in another 302 response for the same resource ?
The problem is exemplified by:
URL url = new URL("http://www.palringo.com/en/gb/");
HttpURLConnection.setFollowRedirects(false);
HttpURLConnection开发者_开发百科 connection = (HttpURLConnection) url.openConnection();
// Just for testing, use Chrome header, to eliminate "anti-crawler" response!
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (X11; Linux i686) AppleWebKit/534.30 (KHTML, like Gecko) Ubuntu/11.04 Chromium/12.0.742.112 Chrome/12.0.742.112 Safari/534.30");
System.out.println("Response code = " + connection.getResponseCode());
This outputs:
Response code = 302
Redirected to /en/gb/
hence an infinite redirect loop.
Interestingly although browsers and wget handle it, curl does not:
joel@bohr:/tmp$ curl http://www.palringo.com/en/gb/
curl: (7) couldn't connect to host
A request for /en/gb/
is redirected to /en/gb/
precisely once.
The problem is that your HttpURLConnection
(or whatever code you use -- sorry, I'm NOT familiar with Java) does not use cookies.
Disable cookies in browser and observe exactly the same behaviour -- infinite redirect.
The reason: Server checks if cookie is set. If not set -- it sets it and redirects. Because cookies are not supported/disabled, script on server side redirects over and over again.
Solution: Enable/add cookie support to your code and try again.
I think that redirect is defined with pattern like /* -> /en/gb So, when you arrive to /en/gb the redirect rule works again.
Check your redirect rules. Where are they defined? In apache web server or in other place? Check all. Verify that this is (or is not) a case and fix the rules accordingly.
The problem is on the server side. It might be a broken Apache httpd rewrite rule that is sending redirects that loop back to the same place. It might be something else. Whatever it is, you are unlikely to be able to fix it on the client side.
I'm basically running a crawler and just noticed this issue.
Ah.
It is possible that it is an anti-crawler defence measure. "Hmmm ... looks like one of those pesky crawlers who ignore my robots.txt file, waste all of my bandwidth and steal my precious content. Lets cause him some pain with a redirect loop!!".
Check that your crawler is obeying the "robots.txt" protocol. Check the ToS for the site you are crawling to see if what you are doing is allowed.
You could be right, but if so how come wget and browsers handle this with just the one redirect?
Maybe because the server is looking at the request headers, or at your pattern of requests.
The Terms of Service (that I see) say this:
"You agree to not use the Service to: ... xiii - Run any automated systems, processes, scripts or bots for any purpose without the express written permission of Palringo."
Arguably, crawling their site is in violation of that.
You will also get this error if you're trying to connect to a service that requires authentication and you provide wrong username and password.
精彩评论