开发者

How do I make a simple crawler in PHP? [closed]

开发者 https://www.devze.com 2022-12-21 02:17 出处:网络
Closed. This question needs to be more focused. It is not currently accepting answers. Closed 3 years ago.
Closed. This question needs to be more focused. It is not currently accepting answers. Closed 3 years ago. 开发者_StackOverflow社区 Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.

I have a web page with a bunch of links. I want to write a script which would dump all the data contained in those links in a local file.

Has anybody done that with PHP? General guidelines and gotchas would suffice as an answer.


Meh. Don't parse HTML with regexes.

Here's a DOM version inspired by Tatu's:

<?php
function crawl_page($url, $depth = 5)
{
    static $seen = array();
    if (isset($seen[$url]) || $depth === 0) {
        return;
    }

    $seen[$url] = true;

    $dom = new DOMDocument('1.0');
    @$dom->loadHTMLFile($url);

    $anchors = $dom->getElementsByTagName('a');
    foreach ($anchors as $element) {
        $href = $element->getAttribute('href');
        if (0 !== strpos($href, 'http')) {
            $path = '/' . ltrim($href, '/');
            if (extension_loaded('http')) {
                $href = http_build_url($url, array('path' => $path));
            } else {
                $parts = parse_url($url);
                $href = $parts['scheme'] . '://';
                if (isset($parts['user']) && isset($parts['pass'])) {
                    $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                }
                $href .= $parts['host'];
                if (isset($parts['port'])) {
                    $href .= ':' . $parts['port'];
                }
                $href .= dirname($parts['path'], 1).$path;
            }
        }
        crawl_page($href, $depth - 1);
    }
    echo "URL:",$url,PHP_EOL,"CONTENT:",PHP_EOL,$dom->saveHTML(),PHP_EOL,PHP_EOL;
}
crawl_page("http://hobodave.com", 2);

Edit: I fixed some bugs from Tatu's version (works with relative URLs now).

Edit: I added a new bit of functionality that prevents it from following the same URL twice.

Edit: echoing output to STDOUT now so you can redirect it to whatever file you want

Edit: Fixed a bug pointed out by George in his answer. Relative urls will no longer append to the end of the url path, but overwrite it. Thanks to George for this. Note that George's answer doesn't account for any of: https, user, pass, or port. If you have the http PECL extension loaded this is quite simply done using http_build_url. Otherwise, I have to manually glue together using parse_url. Thanks again George.


Here my implementation based on the above example/answer.

  1. It is class based
  2. uses Curl
  3. support HTTP Auth
  4. Skip Url not belonging to the base domain
  5. Return Http header Response Code for each page
  6. Return time for each page

CRAWL CLASS:

class crawler
{
    protected $_url;
    protected $_depth;
    protected $_host;
    protected $_useHttpAuth = false;
    protected $_user;
    protected $_pass;
    protected $_seen = array();
    protected $_filter = array();

    public function __construct($url, $depth = 5)
    {
        $this->_url = $url;
        $this->_depth = $depth;
        $parse = parse_url($url);
        $this->_host = $parse['host'];
    }

    protected function _processAnchors($content, $url, $depth)
    {
        $dom = new DOMDocument('1.0');
        @$dom->loadHTML($content);
        $anchors = $dom->getElementsByTagName('a');

        foreach ($anchors as $element) {
            $href = $element->getAttribute('href');
            if (0 !== strpos($href, 'http')) {
                $path = '/' . ltrim($href, '/');
                if (extension_loaded('http')) {
                    $href = http_build_url($url, array('path' => $path));
                } else {
                    $parts = parse_url($url);
                    $href = $parts['scheme'] . '://';
                    if (isset($parts['user']) && isset($parts['pass'])) {
                        $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                    }
                    $href .= $parts['host'];
                    if (isset($parts['port'])) {
                        $href .= ':' . $parts['port'];
                    }
                    $href .= $path;
                }
            }
            // Crawl only link that belongs to the start domain
            $this->crawl_page($href, $depth - 1);
        }
    }

    protected function _getContent($url)
    {
        $handle = curl_init($url);
        if ($this->_useHttpAuth) {
            curl_setopt($handle, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
            curl_setopt($handle, CURLOPT_USERPWD, $this->_user . ":" . $this->_pass);
        }
        // follows 302 redirect, creates problem wiht authentication
//        curl_setopt($handle, CURLOPT_FOLLOWLOCATION, TRUE);
        // return the content
        curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE);

        /* Get the HTML or whatever is linked in $url. */
        $response = curl_exec($handle);
        // response total time
        $time = curl_getinfo($handle, CURLINFO_TOTAL_TIME);
        /* Check for 404 (file not found). */
        $httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);

        curl_close($handle);
        return array($response, $httpCode, $time);
    }

    protected function _printResult($url, $depth, $httpcode, $time)
    {
        ob_end_flush();
        $currentDepth = $this->_depth - $depth;
        $count = count($this->_seen);
        echo "N::$count,CODE::$httpcode,TIME::$time,DEPTH::$currentDepth URL::$url <br>";
        ob_start();
        flush();
    }

    protected function isValid($url, $depth)
    {
        if (strpos($url, $this->_host) === false
            || $depth === 0
            || isset($this->_seen[$url])
        ) {
            return false;
        }
        foreach ($this->_filter as $excludePath) {
            if (strpos($url, $excludePath) !== false) {
                return false;
            }
        }
        return true;
    }

    public function crawl_page($url, $depth)
    {
        if (!$this->isValid($url, $depth)) {
            return;
        }
        // add to the seen URL
        $this->_seen[$url] = true;
        // get Content and Return Code
        list($content, $httpcode, $time) = $this->_getContent($url);
        // print Result for current Page
        $this->_printResult($url, $depth, $httpcode, $time);
        // process subPages
        $this->_processAnchors($content, $url, $depth);
    }

    public function setHttpAuth($user, $pass)
    {
        $this->_useHttpAuth = true;
        $this->_user = $user;
        $this->_pass = $pass;
    }

    public function addFilterPath($path)
    {
        $this->_filter[] = $path;
    }

    public function run()
    {
        $this->crawl_page($this->_url, $this->_depth);
    }
}

USAGE:

// USAGE
$startURL = 'http://YOUR_URL/';
$depth = 6;
$username = 'YOURUSER';
$password = 'YOURPASS';
$crawler = new crawler($startURL, $depth);
$crawler->setHttpAuth($username, $password);
// Exclude path with the following structure to be processed 
$crawler->addFilterPath('customer/account/login/referer');
$crawler->run();


Check out PHP Crawler

http://sourceforge.net/projects/php-crawler/

See if it helps.


In it's simplest form:

function crawl_page($url, $depth = 5) {
    if($depth > 0) {
        $html = file_get_contents($url);

        preg_match_all('~<a.*?href="(.*?)".*?>~', $html, $matches);

        foreach($matches[1] as $newurl) {
            crawl_page($newurl, $depth - 1);
        }

        file_put_contents('results.txt', $newurl."\n\n".$html."\n\n", FILE_APPEND);
    }
}

crawl_page('http://www.domain.com/index.php', 5);

That function will get contents from a page, then crawl all found links and save the contents to 'results.txt'. The functions accepts an second parameter, depth, which defines how long the links should be followed. Pass 1 there if you want to parse only links from the given page.


Why use PHP for this, when you can use wget, e.g.

wget -r -l 1 http://www.example.com

For how to parse the contents, see Best Methods to parse HTML and use the search function for examples. How to parse HTML has been answered multiple times before.


With some little changes to hobodave's code, here is a codesnippet you can use to crawl pages. This needs the curl extension to be enabled in your server.

<?php
//set_time_limit (0);
function crawl_page($url, $depth = 5){
$seen = array();
if(($depth == 0) or (in_array($url, $seen))){
    return;
}   
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$result = curl_exec ($ch);
curl_close ($ch);
if( $result ){
    $stripped_file = strip_tags($result, "<a>");
    preg_match_all("/<a[\s]+[^>]*?href[\s]?=[\s\"\']+"."(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/", $stripped_file, $matches, PREG_SET_ORDER ); 
    foreach($matches as $match){
        $href = $match[1];
            if (0 !== strpos($href, 'http')) {
                $path = '/' . ltrim($href, '/');
                if (extension_loaded('http')) {
                    $href = http_build_url($href , array('path' => $path));
                } else {
                    $parts = parse_url($href);
                    $href = $parts['scheme'] . '://';
                    if (isset($parts['user']) && isset($parts['pass'])) {
                        $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                    }
                    $href .= $parts['host'];
                    if (isset($parts['port'])) {
                        $href .= ':' . $parts['port'];
                    }
                    $href .= $path;
                }
            }
            crawl_page($href, $depth - 1);
        }
}   
echo "Crawled {$href}";
}   
crawl_page("http://www.sitename.com/",3);
?>

I have explained this tutorial in this crawler script tutorial


Hobodave you were very close. The only thing I have changed is within the if statement that checks to see if the href attribute of the found anchor tag begins with 'http'. Instead of simply adding the $url variable which would contain the page that was passed in you must first strip it down to the host which can be done using the parse_url php function.

<?php
function crawl_page($url, $depth = 5)
{
  static $seen = array();
  if (isset($seen[$url]) || $depth === 0) {
    return;
  }

  $seen[$url] = true;

  $dom = new DOMDocument('1.0');
  @$dom->loadHTMLFile($url);

  $anchors = $dom->getElementsByTagName('a');
  foreach ($anchors as $element) {
    $href = $element->getAttribute('href');
    if (0 !== strpos($href, 'http')) {
       /* this is where I changed hobodave's code */
        $host = "http://".parse_url($url,PHP_URL_HOST);
        $href = $host. '/' . ltrim($href, '/');
    }
    crawl_page($href, $depth - 1);
  }

  echo "New Page:<br /> ";
  echo "URL:",$url,PHP_EOL,"<br />","CONTENT:",PHP_EOL,$dom->saveHTML(),PHP_EOL,PHP_EOL,"  <br /><br />";
}

crawl_page("http://hobodave.com/", 5);
?>


As mentioned, there are crawler frameworks all ready for customizing out there, but if what you're doing is as simple as you mentioned, you could make it from scratch pretty easily.

Scraping the links: http://www.phpro.org/examples/Get-Links-With-DOM.html

Dumping results to a file: http://www.tizag.com/phpT/filewrite.php


I used @hobodave's code, with this little tweak to prevent re-crawling all fragment variants of the same URL:

<?php
function crawl_page($url, $depth = 5)
{
  $parts = parse_url($url);
  if(array_key_exists('fragment', $parts)){
    unset($parts['fragment']);
    $url = http_build_url($parts);
  }

  static $seen = array();
  ...

Then you can also omit the $parts = parse_url($url); line within the for loop.


You can try this it may be help to you

$search_string = 'american golf News: Fowler beats stellar field in Abu Dhabi';
$html = file_get_contents(url of the site);
$dom = new DOMDocument;
$titalDom = new DOMDocument;
$tmpTitalDom = new DOMDocument;
libxml_use_internal_errors(true);
@$dom->loadHTML($html);
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
$videos = $xpath->query('//div[@class="primary-content"]');
foreach ($videos as $key => $video) {
$newdomaindom = new DOMDocument;    
$newnode = $newdomaindom->importNode($video, true);
$newdomaindom->appendChild($newnode);
@$titalDom->loadHTML($newdomaindom->saveHTML());
$xpath1 = new DOMXPath($titalDom);
$titles = $xpath1->query('//div[@class="listingcontainer"]/div[@class="list"]');
if(strcmp(preg_replace('!\s+!',' ',  $titles->item(0)->nodeValue),$search_string)){     
    $tmpNode = $tmpTitalDom->importNode($video, true);
    $tmpTitalDom->appendChild($tmpNode);
    break;
}
}
echo $tmpTitalDom->saveHTML();


Thank you @hobodave.

However I found two weaknesses in your code. Your parsing of the original url to get the "host" segment stops at the first single slash. This presumes that all relative links start in the root directory. This only true sometimes.

original url   :  http://example.com/game/index.html
href in <a> tag:  highscore.html
author's intent:  http://example.com/game/highscore.html  <-200->
crawler result :  http://example.com/highscore.html       <-404->

fix this by breaking at the last single slash not the first

a second unrelated bug, is that $depth does not really track recursion depth, it tracks breadth of the first level of recursion.

If I believed this page were in active use I might debug this second issue, but I suspect the text I am writing now will never be read by anyone, human or robot, since this issue is six years old and I do not even have enough reputation to notify +hobodave directly about these defects by commmenting on his code. Thanks anyway hobodave.


I came up with the following spider code. I adapted it a bit from the following: PHP - Is the there a safe way to perform deep recursion? it seems fairly rapid....

    <?php
function  spider( $base_url , $search_urls=array() ) {
    $queue[] = $base_url;
    $done           =   array();
    $found_urls     =   array();
    while($queue) {
            $link = array_shift($queue);
            if(!is_array($link)) {
                $done[] = $link;
                foreach( $search_urls as $s) { if (strstr( $link , $s )) { $found_urls[] = $link; } }
                if( empty($search_urls)) { $found_urls[] = $link; }
                if(!empty($link )) {
echo 'LINK:::'.$link;
                      $content =    file_get_contents( $link );
//echo 'P:::'.$content;
                    preg_match_all('~<a.*?href="(.*?)".*?>~', $content, $sublink);
                    if (!in_array($sublink , $done) && !in_array($sublink , $queue)  ) {
                           $queue[] = $sublink;
                    }
                }
            } else {
                    $result=array();
                    $return = array();
                    // flatten multi dimensional array of URLs to one dimensional.
                    while(count($link)) {
                         $value = array_shift($link);
                         if(is_array($value))
                             foreach($value as $sub)
                                $link[] = $sub;
                         else
                               $return[] = $value;
                     }
                     // now loop over one dimensional array.
                     foreach($return as $link) {
                                // echo 'L::'.$link;
                                // url may be in form <a href.. so extract what's in the href bit.
                                preg_match_all('/<a[^>]+href=([\'"])(?<href>.+?)\1[^>]*>/i', $link, $result);
                                if ( isset( $result['href'][0] )) { $link = $result['href'][0]; }
                                // add the new URL to the queue.
                                if( (!strstr( $link , "http")) && (!in_array($base_url.$link , $done)) && (!in_array($base_url.$link , $queue)) ) {
                                     $queue[]=$base_url.$link;
                                } else {
                                    if ( (strstr( $link , $base_url  ))  && (!in_array($base_url.$link , $done)) && (!in_array($base_url.$link , $queue)) ) {
                                         $queue[] = $link;
                                    }
                                }
                      }
            }
    }


    return $found_urls;
}    


    $base_url       =   'https://www.houseofcheese.co.uk/';
    $search_urls    =   array(  $base_url.'acatalog/' );
    $done = spider( $base_url  , $search_urls  );

    //
    // RESULT
    //
    //
    echo '<br /><br />';
    echo 'RESULT:::';
    foreach(  $done as $r )  {
        echo 'URL:::'.$r.'<br />';
    }


Its worth remembering that when crawling external links (I do appreciate the OP relates to a users own page) you should be aware of robots.txt. I have found the following which will hopefully help http://www.the-art-of-web.com/php/parse-robots/.


I created a small class to grab data from the provided url, then extract html elements of your choice. The class makes use of CURL and DOMDocument.

php class:

class crawler {


   public static $timeout = 2;
   public static $agent   = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)';


   public static function http_request($url) {
      $ch = curl_init();
      curl_setopt($ch, CURLOPT_URL,            $url);
      curl_setopt($ch, CURLOPT_USERAGENT,      self::$agent);
      curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, self::$timeout);
      curl_setopt($ch, CURLOPT_TIMEOUT,        self::$timeout);
      curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
      $response = curl_exec($ch);
      curl_close($ch);
      return $response;
   }


   public static function strip_whitespace($data) {
      $data = preg_replace('/\s+/', ' ', $data);
      return trim($data);
   }


   public static function extract_elements($tag, $data) {
      $response = array();
      $dom      = new DOMDocument;
      @$dom->loadHTML($data);
      foreach ( $dom->getElementsByTagName($tag) as $index => $element ) {
         $response[$index]['text'] = self::strip_whitespace($element->nodeValue);
         foreach ( $element->attributes as $attribute ) {
            $response[$index]['attributes'][strtolower($attribute->nodeName)] = self::strip_whitespace($attribute->nodeValue);
         }
      }
      return $response;
   }


}

example usage:

$data  = crawler::http_request('https://stackoverflow.com/questions/2313107/how-do-i-make-a-simple-crawler-in-php');
$links = crawler::extract_elements('a', $data);
if ( count($links) > 0 ) {
   file_put_contents('links.json', json_encode($links, JSON_PRETTY_PRINT));
}

example response:

[
    {
        "text": "Stack Overflow",
        "attributes": {
            "href": "https:\/\/stackoverflow.com",
            "class": "-logo js-gps-track",
            "data-gps-track": "top_nav.click({is_current:false, location:2, destination:8})"
        }
    },
    {
        "text": "Questions",
        "attributes": {
            "id": "nav-questions",
            "href": "\/questions",
            "class": "-link js-gps-track",
            "data-gps-track": "top_nav.click({is_current:true, location:2, destination:1})"
        }
    },
    {
        "text": "Developer Jobs",
        "attributes": {
            "id": "nav-jobs",
            "href": "\/jobs?med=site-ui&ref=jobs-tab",
            "class": "-link js-gps-track",
            "data-gps-track": "top_nav.click({is_current:false, location:2, destination:6})"
        }
    }
]


It's an old question. A lot of good things happened since then. Here are my two cents on this topic:

  1. To accurately track the visited pages you have to normalize URI first. The normalization algorithm includes multiple steps:

    • Sort query parameters. For example, the following URIs are equivalent after normalization: GET http://www.example.com/query?id=111&cat=222 GET http://www.example.com/query?cat=222&id=111
    • Convert the empty path. Example: http://example.org → http://example.org/

    • Capitalize percent encoding. All letters within a percent-encoding triplet (e.g., "%3A") are case-insensitive. Example: http://example.org/a%c2%B1b → http://example.org/a%C2%B1b

    • Remove unnecessary dot-segments. Example: http://example.org/../a/b/../c/./d.html → http://example.org/a/c/d.html

    • Possibly some other normalization rules

  2. Not only <a> tag has href attribute, <area> tag has it too https://html.com/tags/area/. If you don't want to miss anything, you have to scrape <area> tag too.

  3. Track crawling progress. If the website is small, it is not a problem. Contrarily it might be very frustrating if you crawl half of the site and it failed. Consider using a database or a filesystem to store the progress.

  4. Be kind to the site owners. If you are ever going to use your crawler outside of your website, you have to use delays. Without delays, the script is too fast and might significantly slow down some small sites. From sysadmins perspective, it looks like a DoS attack. A static delay between the requests will do the trick.

If you don't want to deal with that, try Crawlzone and let me know your feedback. Also, check out the article I wrote a while back https://www.codementor.io/zstate/this-is-how-i-crawl-n98s6myxm

0

精彩评论

暂无评论...
验证码 换一张
取 消