I'd like to find out how current google's cached copy of a large set of pages is. I think I need to
- look in the logs for IP's,
- check to find user-agent "googlebot", then
- export a list that says each page and when it was last visited.
I imagine this could be a cron job that runs weekly. If this i开发者_如何转开发s right, how would I write the script? If this is wrong, what would be a better way?
Google already provides this information via Google SiteMaps. I have used it for the past three years - works great.
Add your site to SiteMaps and put a generated SiteMap XML of your site (Google for websites that provide this free) on your web server, then let Google do the rest. There is section in SiteMaps called Crawl Stats that gives you what you want.
Get Google's view of your site and diagnose problems
See how Google crawls and indexes your site and learn about specific problems we're having accessing it.
Discover your link and query traffic
View, classify, and download comprehensive data about internal and external links to your site with new link reporting tools. Find out which Google search queries drive traffic to your site, and see exactly how users arrive there.
Share information about your site
Tell us about your pages with Sitemaps: which ones are the most important to you and how often they change. You can also let us know how you would like the URLs we index to appear.
That isn't necessary, you can do a service call to Google to look up the cached page, i.e. searching for cache:stackoverflow.com, which included the time and date. I wouldn't be surprised if there's an api call to do this more directly (update: Google Search API).
Last Googlebot Access can also be found for free via some websites like mypagerank.net or the Google Toolbar.
精彩评论