We have a tool which checks if a given URL is a live URL. If a given url is live another part of our software can screen scrap the content from it.
This is my code for checking if a url is live
public static bool IsLiveUrl(string url)
{
HttpWebRequest webRequest = WebRequest.Create(url) as HttpWebRequest;
webRequest.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.6) Gecko/20060728 Firefox/1.5";
webRequest.CookieContainer = new CookieContainer();
WebResponse webResponse;
try
{
webResponse = webRequest.GetResponse();
}
catch (WebException e)
{
return false;
}
catch (Exception ex)
{
return false;
}
return true;
}
This code works perfectly but for a particular site hosted on apache i am getting a web exception with following message. "The remote server returned an error: (开发者_JS百科403) Forbidden" On further inspection i found the following details in the WebException object
Status="ProtocolError" StatusDescription="Bad Behaviour"
This is the request header "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.6) Gecko/20060728 Firefox/1.5 Host: scenicspares.co.uk Connection: Keep-Alive"
This is the response header "Keep-Alive: timeout=4, max=512 Connection: Keep-Alive Transfer-Encoding: chunked Content-Type: text/html Date: Thu, 13 Jan 2011 10:29:36 GMT Server: Apache"
I extracted these headers using a watch in vs2008. The frame work in use is 3.5.
It turned out that all i needed to do was following
webRequest.Accept = "*/*";
webResponse = webRequest.GetResponse();
and it was fixed.
I believe there are quite a lot of similar problems that depend on server application. In my particular case see: The remote server returned an error: (403) Forbidden
I fixed it for my web scraping app after facing this issue for day long, hope it might help others:
public static string GetPageContent(string url)
{
CookieContainer cookieContainer = new CookieContainer();
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
request.CookieContainer = cookieContainer; // after Create() method
request.AllowAutoRedirect = true; // should be true
request.UserAgent= ".NET Framework Test Client"; // should not be null
var responseStr = string.Empty;
using (var response = request.GetResponse())
{
Stream dataStream = response.GetResponseStream();
StreamReader reader = new StreamReader(dataStream);
responseStr = reader.ReadToEnd();
reader.Close();
dataStream.Close();
}
return responseStr;
}
精彩评论