This is a general question about writing web apps.
I have an application that counts page views of articles as well as a url shortner script that I've installed for a client of mine. The problem is that, whenever bots hit the site, they tend to inflate the page views.
Does anyone have an idea on how to go about eliminat开发者_Python百科ing bot views from the view count of these applications?
There are a few ways you could determine whether your articles are being viewed by an actual user or by a search engine bot. Probably the best way is to check the User-Agent header sent by the browser (or bot). The User-Agent header is essentially a field that is sent identifying the client application used to access the resource. For example, Internet Explorer might send something Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; en-US)
. Google's bot might send something like Googlebot/2.1 (+http://www.google.com/bot.html)
. It is possible to send a fake User-Agent header, but I can't see the average site user or a major company like Google doing that. If it's blank or a common User-Agent string associated with a commercial bot, it's most likely a bot.
While you're at it, you may want to make sure you have an up-to-date robots.txt file. It's a simple text file that provides rules automated bots should respect in terms of which content they are not allowed to retrieve for indexing.
Here's a few resources that may be helpful:
- List of User-Agents
- How to Verify Googlebot
- Web Robots Page
- How do I stop bots from incrementing my file download counter in PHP?
Check User-Agent
. Use this header value to distinguish bots from regular browsers/users.
For example,
Google bot:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Safari:
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_3; lv-lv) AppleWebKit/531.22.7 (KHTML, like Gecko) Version/4.0.5 Safari/531.22.7
精彩评论