开发者

Getting Error "The remote server returned an error: (403) Forbidden" when screen scraping using HttpWebRequest.GetResponse()

开发者 https://www.devze.com 2023-02-03 20:05 出处:网络
We have a tool which checks if a given URL is a live URL. If a given url is live another part of our software can screen scrap the content from it.

We have a tool which checks if a given URL is a live URL. If a given url is live another part of our software can screen scrap the content from it.

This is my code for checking if a url is live

    public static bool IsLiveUrl(string url)
    {
        HttpWebRequest webRequest = WebRequest.Create(url) as HttpWebRequest;
        webRequest.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.6) Gecko/20060728 Firefox/1.5";
        webRequest.CookieContainer = new CookieContainer();
        WebResponse webResponse;
        try
        {
            webResponse = webRequest.GetResponse();
        }
        catch (WebException e)
        {
            return false;
        }
        catch (Exception ex)
        {

            return false;
        }
        return true;
    }

This code works perfectly but for a particular site hosted on apache i am getting a web exception with following message. "The remote server returned an error: (开发者_JS百科403) Forbidden" On further inspection i found the following details in the WebException object

Status="ProtocolError" StatusDescription="Bad Behaviour"

This is the request header "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.6) Gecko/20060728 Firefox/1.5 Host: scenicspares.co.uk Connection: Keep-Alive"

This is the response header "Keep-Alive: timeout=4, max=512 Connection: Keep-Alive Transfer-Encoding: chunked Content-Type: text/html Date: Thu, 13 Jan 2011 10:29:36 GMT Server: Apache"

I extracted these headers using a watch in vs2008. The frame work in use is 3.5.


It turned out that all i needed to do was following

            webRequest.Accept = "*/*";
            webResponse = webRequest.GetResponse();

and it was fixed.


I believe there are quite a lot of similar problems that depend on server application. In my particular case see: The remote server returned an error: (403) Forbidden


I fixed it for my web scraping app after facing this issue for day long, hope it might help others:

    public static string GetPageContent(string url)
    {
        CookieContainer cookieContainer = new CookieContainer();
        HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
        request.CookieContainer = cookieContainer; // after Create() method
        request.AllowAutoRedirect = true; //  should be true
        request.UserAgent= ".NET Framework Test Client"; // should not be null

        var responseStr = string.Empty;
        using (var response = request.GetResponse())
        {
            Stream dataStream = response.GetResponseStream();
            StreamReader reader = new StreamReader(dataStream);
            responseStr = reader.ReadToEnd();
            reader.Close();
            dataStream.Close();
        }
        return responseStr;
    }
0

精彩评论

暂无评论...
验证码 换一张
取 消