I'm looking for the best solution on how I can ensure I am doing this correctly:
I have a calendar on my website, in which users can take the calendar iCal feed a开发者_如何学JAVAnd import it into external calendars of their preference (Outlook, iCal, Google Calendar, etc...).
To deter bad people from crawling/searching my website for the *.ics files, I've setup Robots.txt to disallow the folders in which the feeds are stored.
So, essentially, an iCal feed might look like: webcal://www.mysite.com/feeds/cal/a9d90309dafda390d09/feed.ics
I understand the above is still a public URL. However, I have a function in which the user can change address of their feed, if they want.
My question is: All external calendars have no problem importing/subscribing to the calendar feed, except for Google Calendar. It throws the message: Google was unable to crawl the URL due to a robots.txt restriction. Google's Answer to This.
Consequently, after searching around, I've found that the following works:
1) Setup a PHP file (which I am using) that essentially forces a download of the file. It basically looks like this:
<?php
$url = "/home/path/to/local/feed/".$_GET['url'];
$file = fopen ($url, "r");
if (!$file) {
echo "<p>Unable to open remote file.\n";
exit;
}
while (!feof ($file)) {
$line = fgets ($file, 1024);
print $line;
}
fclose($file);
?>
I tried using this script, and it appeared to work with Google Calendar, with no issues. (Although, I'm not sure if it updates/refreshes yet. I'm still waiting to see if this works).
My question is this: Is there a better way to approach such an issue? I'd like to keep the current Robots.txt in place to disallow crawling my directories for *.ics files and keep the files hidden.
I recently had this problem and this robots.txt works for me.
User-agent: Googlebot
Allow: /*.ics$
Disallow: /
User-agent: *
Disallow: /
This allows access to any .ics files if they know the address and prevents the bots from searching the site (it's a private server). You will want to change the disallow tag for your server.
I don't think the allow tag is part of the spec but some bots seem to support it. Here is Google's Webmaster Tools help page on robots.txt
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449
Looks to me you have two problems:
- Prevent bad-behavioral bots accessing the website.
- After installing robots.txt, allow Googlebot access your site.
The first problem cannot be solved by robots.txt. As Marc B points out in comment, robots.txt is a purely voluntary mechanism. In order to block badbots once for all, I will suggest you using some kind of behavior-analysis program/firewall to detect bad bots and deny access from these IPs.
For the second problem, robots.txt do allow you whitelist a particular bot. Check http://facebook.com/robots.txt as example. Noted that Google identify their bots in different names (for Adsence, search, image search, mobile search), I am not if the Google calendar bot uses the generic Google bot name or not.
精彩评论