I need to analyze the search engine crawling going on in my site. Is there a good tool for this? I've tried AWStats and Sawmill. But both of those give me very limited insight into the crawling. I need to know information like how many unique/distinct web开发者_如何转开发pages in a section of my site was crawled by a specific crawler within a time period.
Google analytics doesn't track crawling at all due to its javascript tracking mechanism.
Upon following a link to the first page of your Site, the major Search Engine crawlers will first request a file called robots.txt which of course tells the search crawler which pages it is permitted by the Site owner to visit and which files or directories are off limits.
What if you don't have a robots.txt? Nearly always, the crawler 'interprets' this to mean that no pages/directories are off limits and it will proceed to crawl your entire Site. So why include a robots.txt file if that's what you want--i.e., for the crawler to index your entire Site? Because if it's there, the Crawler will nearly always request it so it can read it--this request of course shows up as a line in your server access log file, which is a pretty strong signature for a Crawler.
Second, a good server access log parser such as Webalyzer or Awstats. compare user agent and ip addresses against published, authoritative lists: IAB (http://www.iab.net/sites/spiders/login.php) and the user-agents.org publish the two lists that seem to be the most widely used for this purpose. The former is a few thousand dollars per year and up; the latter is free.
Both Webalyzer and AWStats can do what you want, though i recommend AWStats for the following reasons: it was updated fairly recently (approx. one year ago) while Webalyzer was last updated over eight years ago. In addition, AWStats has much nicer report templates. The advantage of Webalyzer is that is is much faster.
Here's sample output from AWStats (based on out-of-the-box config) that is probably what you are looking for:
精彩评论