Robots.txt and Google Calendar_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-02-03 23:27 出处：网络

I\'m looking for the best solution on how I can ensure I am doing this correctly: I have a calendar on my website, in which users can take the calendar iCal feed a开发者_如何学JAVAnd import it into e

I'm looking for the best solution on how I can ensure I am doing this correctly:

I have a calendar on my website, in which users can take the calendar iCal feed a开发者_如何学JAVAnd import it into external calendars of their preference (Outlook, iCal, Google Calendar, etc...).

To deter bad people from crawling/searching my website for the *.ics files, I've setup Robots.txt to disallow the folders in which the feeds are stored.

So, essentially, an iCal feed might look like: webcal://www.mysite.com/feeds/cal/a9d90309dafda390d09/feed.ics

I understand the above is still a public URL. However, I have a function in which the user can change address of their feed, if they want.

My question is: All external calendars have no problem importing/subscribing to the calendar feed, except for Google Calendar. It throws the message: Google was unable to crawl the URL due to a robots.txt restriction. Google's Answer to This.

Consequently, after searching around, I've found that the following works:

1) Setup a PHP file (which I am using) that essentially forces a download of the file. It basically looks like this:

<?php
$url = "/home/path/to/local/feed/".$_GET['url'];
 $file = fopen ($url, "r");
 if (!$file) {
    echo "<p>Unable to open remote file.\n";
    exit;
  }
 while (!feof ($file)) {
  $line = fgets ($file, 1024);
 print $line;
}
fclose($file);
?>

I tried using this script, and it appeared to work with Google Calendar, with no issues. (Although, I'm not sure if it updates/refreshes yet. I'm still waiting to see if this works).

My question is this: Is there a better way to approach such an issue? I'd like to keep the current Robots.txt in place to disallow crawling my directories for *.ics files and keep the files hidden.

I recently had this problem and this robots.txt works for me.

User-agent: Googlebot
Allow: /*.ics$
Disallow: /

User-agent: *
Disallow: /

This allows access to any .ics files if they know the address and prevents the bots from searching the site (it's a private server). You will want to change the disallow tag for your server.

I don't think the allow tag is part of the spec but some bots seem to support it. Here is Google's Webmaster Tools help page on robots.txt
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449

Looks to me you have two problems:

Prevent bad-behavioral bots accessing the website.
After installing robots.txt, allow Googlebot access your site.

The first problem cannot be solved by robots.txt. As Marc B points out in comment, robots.txt is a purely voluntary mechanism. In order to block badbots once for all, I will suggest you using some kind of behavior-analysis program/firewall to detect bad bots and deny access from these IPs.

For the second problem, robots.txt do allow you whitelist a particular bot. Check http://facebook.com/robots.txt as example. Noted that Google identify their bots in different names (for Adsence, search, image search, mobile search), I am not if the Google calendar bot uses the generic Google bot name or not.