BOT/Spider Trap Ideas_问答_开发者_运维开发者技术经验分享

I have a client whose domain seems to be getting hit pretty hard by what appears to be a DDoS. In the logs it's normal looking user agents with random IPs but they're flipping through pages too fast to be human. They also don't appear to be requesting any images. I can't seem to find any pattern and my suspicion is it's a fleet of Windows Zombies.

The clients had issues in the past with SPAM attacks--even had to point MX at Postini to get the 6.7 GB/day of junk to stop server-side.

I want to setup a BOT trap in a directory disallowed by robots.txt... just never attempted anything like this before, hoping someone out there has a creative ideas for trapping BOTs!

EDIT: I already have plenty of ideas for catching one.. it's 开发者_开发百科what to do to it when lands in the trap.

You can set up a PHP script whose URL is explicitly forbidden by robots.txt. In that script, you can pull the source IP of the suspected bot hitting you (via $_SERVER['REMOTE_ADDR']), and then add that IP to a database blacklist table.

Then, in your main app, you can check the source IP, do a lookup for that IP in your blacklist table, and if you find it, throw a 403 page instead. (Perhaps with a message like, "We've detected abuse coming from your IP, if you feel this is in error, contact us at ...")

On the upside, you get automatic blacklisting of bad bots. On the downside, it's not terribly efficient, and it can be dangerous. (One person innocently checking that page out of curiosity can result in the ban of a large swath of users.)

Edit: Alternatively (or additionally, I suppose) you can fairly simply add a GeoIP check to your app, and reject hits based on country of origin.

What you can do is get another box (a kind of sacrificial lamb) not on the same pipe as your main host then have that host a page which redirects to itself (but with a randomized page name in the url). this could get the bot stuck in a infinite loop tieing up the cpu and bandwith on your sacrificial lamb but not on your main box.

I tend to think this is a problem better solved with network security more so than coding, but I see the logic in your approach/question.

There are a number of questions and discussions about this on server fault which may be worthy of investigating.

https://serverfault.com/search?q=block+bots

Well I must say, kinda disappointed--I was hoping for some creative ideas. I did find the ideal solutions here.. http://www.kloth.net/internet/bottrap.php

<html>
    <head><title> </title></head>
    <body>
    <p>There is nothing here to see. So what are you doing here ?</p>
    <p><a href="http://your.domain.tld/">Go home.</a></p>
    <?php
      /* whitelist: end processing end exit */
      if (preg_match("/10\.22\.33\.44/",$_SERVER['REMOTE_ADDR'])) { exit; }
      if (preg_match("Super Tool",$_SERVER['HTTP_USER_AGENT'])) { exit; }
      /* end of whitelist */
      $badbot = 0;
      /* scan the blacklist.dat file for addresses of SPAM robots
         to prevent filling it up with duplicates */
      $filename = "../blacklist.dat";
      $fp = fopen($filename, "r") or die ("Error opening file ... <br>\n");
      while ($line = fgets($fp,255)) {
        $u = explode(" ",$line);
        $u0 = $u[0];
        if (preg_match("/$u0/",$_SERVER['REMOTE_ADDR'])) {$badbot++;}
      }
      fclose($fp);
      if ($badbot == 0) { /* we just see a new bad bot not yet listed ! */
      /* send a mail to hostmaster */
        $tmestamp = time();
        $datum = date("Y-m-d (D) H:i:s",$tmestamp);
        $from = "badbot-watch@domain.tld";
        $to = "hostmaster@domain.tld";
        $subject = "domain-tld alert: bad robot";
        $msg = "A bad robot hit $_SERVER['REQUEST_URI'] $datum \n";
        $msg .= "address is $_SERVER['REMOTE_ADDR'], agent is $_SERVER['HTTP_USER_AGENT']\n";
        mail($to, $subject, $msg, "From: $from");
      /* append bad bot address data to blacklist log file: */
        $fp = fopen($filename,'a+');
        fwrite($fp,"$_SERVER['REMOTE_ADDR'] - - [$datum] \"$_SERVER['REQUEST_METHOD'] $_SERVER['REQUEST_URI'] $_SERVER['SERVER_PROTOCOL']\" $_SERVER['HTTP_REFERER'] $_SERVER['HTTP_USER_AGENT']\n");
        fclose($fp);
      }
    ?>
    </body>
</html>

Then to protect pages throw <?php include($DOCUMENT_ROOT . "/blacklist.php"); ?> on the first line of every page.. blacklist.php contains:

<?php
    $badbot = 0;
    /* look for the IP address in the blacklist file */
    $filename = "../blacklist.dat";
    $fp = fopen($filename, "r") or die ("Error opening file ... <br>\n");
    while ($line = fgets($fp,255))  {
      $u = explode(" ",$line);
      $u0 = $u[0];
      if (preg_match("/$u0/",$_SERVER['REMOTE_ADDR'])) {$badbot++;}
    }
    fclose($fp);
    if ($badbot > 0) { /* this is a bad bot, reject it */
      sleep(12);
      print ("<html><head>\n");
      print ("<title>Site unavailable, sorry</title>\n");
      print ("</head><body>\n");
      print ("<center><h1>Welcome ...</h1></center>\n");
      print ("<p><center>Unfortunately, due to abuse, this site is temporarily not available ...</center></p>\n");
      print ("<p><center>If you feel this in error, send a mail to the hostmaster at this site,<br>
             if you are an anti-social ill-behaving SPAM-bot, then just go away.</center></p>\n");
      print ("</body></html>\n");
      exit;
    }
?>

I plan to take Scott Chamberlain's advice and to be safe I plan to implement Captcha on the script. If user answers correctly then it'll just die or redirect back to site root. Just for fun I'm throwing the trap in a directory named /admin/ and of coursed adding Disallow: /admin/ to robots.txt.

EDIT: In addition I am redirecting the bot ignoring the rules to this page: http://www.seastory.us/bot_this.htm

You could first take a look at where the ip's are coming from. My guess is that they are all coming from one country like china or Nigeria, in which case you could set up something in htaccess to disallow all ip's from those two countries, as for creating a trap for bots, i havent the slightest idea