开发者

prevent crawler from following POST form action

开发者 https://www.devze.com 2023-03-14 07:52 出处:网络
I have simple form on my site: <form method=\"POST\" action=\"Home/Import\"> ... </form> I get tons of error reports bec开发者_开发技巧ause of crawlers sending HEAD request to Home/Impo

I have simple form on my site:

<form method="POST" action="Home/Import"> ... </form>

I get tons of error reports bec开发者_开发技巧ause of crawlers sending HEAD request to Home/Import

Notice form is POST.

Questions

  1. Why crawlers try to crawl those actions?
  2. Anything I can do to prevent it? (I already have Home in robots.txt)
  3. What is a good way to deal with those invalid (but correct) HEAD requests?

Details:

I use Post-Redirect-Get pattern, if that matters. Platform: ASP.NET MVC 3.0 (C#) on IIS 7.5


1) A crawler typically makes HEAD requests to get the mime-type of the response.

2) The HEAD request shouldn't invoke the action handler for a POST. If I saw that I was getting alot of HEAD requests to a resource I don't want the crawler to crawl I would give it a link I do want it to crawl. Most crawlers read a Robots.txt


you can disable head requests at webserver level... for apache:

<LimitExcept GET POST>
deny from all
</LimitExcept>

you can work this at robots.txt level by adding:

Disallow: /Home/Import

Head requests are used to get information about the page, without getting the whole page, like last-modified-time, size etc. it is an efficiency thing. your script should not be giving errors because of head requests, and those errors are probably because of lack of validations in your code. your code could check if the request http method is 'head' and do something different.


4 years ago but still answering question #1: Google does indeed try to crawl POST forms, both by just sending a "GET" to the URL and actual "POST" requests. See their blog on this. The why is in the nature of the web: bad web developers hide their content links behind POST search forms. To reach that content, search engines have to improvise.

About #2: The reliability of robots.txt varies.

And about #3: The ultra clean way would probably be: HTTP Status 405 Method not allowed if HEAD requests in particular are your problem. Not sure browsers will like this, though.

0

精彩评论

暂无评论...
验证码 换一张
取 消