I'm working a on a link checker/broken link finder and I am getting many false positives, after double checking I noticed that many error codes were returning webexceptions but they were actually downloadable, but in some other cases the statuscode is 404 and i can access the page from the browse.
So here is the code, its pretty ugly, and id like to have something more, id say practical. All the status 开发者_运维知识库codes are in that big if are used to filter the ones i dont want to add to brokenlink because they are valid links ( i tested them all ). What i need to fix is the structure (if possible) and how to not get false 404.
Thank you!
try
{
HttpWebRequest request = ( HttpWebRequest ) WebRequest.Create ( uri );
request.Method = "Head";
request.MaximumResponseHeadersLength = 32; // FOR IE SLOW SPEED
request.AllowAutoRedirect = true;
using ( HttpWebResponse response = ( HttpWebResponse ) request.GetResponse() )
{
request.Abort();
}
/* WebClient wc = new WebClient();
wc.DownloadString( uri ); */
_validlinks.Add ( strUri );
}
catch ( WebException wex )
{
if ( !wex.Message.Contains ( "The remote name could not be resolved:" ) &&
wex.Status != WebExceptionStatus.ServerProtocolViolation )
{
if ( wex.Status != WebExceptionStatus.Timeout )
{
HttpStatusCode code = ( ( HttpWebResponse ) wex.Response ).StatusCode;
if (
code != HttpStatusCode.OK &&
code != HttpStatusCode.BadRequest &&
code != HttpStatusCode.Accepted &&
code != HttpStatusCode.InternalServerError &&
code != HttpStatusCode.Forbidden &&
code != HttpStatusCode.Redirect &&
code != HttpStatusCode.Found
)
{
_brokenlinks.Add ( new Href ( new Uri ( strUri , UriKind.RelativeOrAbsolute ) , UrlType.External ) );
}
else _validlinks.Add ( strUri );
}
else _brokenlinks.Add ( new Href ( new Uri ( strUri , UriKind.RelativeOrAbsolute ) , UrlType.External ) );
}
else _validlinks.Add ( strUri );
}
You should add a UserAgent header, since many websites require them.
精彩评论