开发者

is there any java script web crawler framework [closed]

开发者 https://www.devze.com 2023-02-22 01:11 出处:网络
As it currently stands, this question is not a good fit for our Q&A format. We expect an开发者_运维问答swers to be supported by facts, references,or expertise, but this question will likely so
As it currently stands, this question is not a good fit for our Q&A format. We expect an开发者_运维问答swers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 9 years ago.

Is there any JavaScript web crawler framework?


There's a new framework that was just release for Node.js called spider. It uses jQuery under the hood to crawl/index a website's HTML pages. The API and configuration are really nice especially if you already know jQuery.

From the test suite, here's an example of crawling the New York Times website:

var spider = require('../main');

spider()
  .route('www.nytimes.com', '/pages/dining/index.html', function (window, $) {
    $('a').spider();
  })
  .route('travel.nytimes.com', '*', function (window, $) {
    $('a').spider();
    if (this.fromCache) return;

    var article = { title: $('nyt_headline').text(), articleBody: '', photos: [] }
    article.body = '' 
    $('div.articleBody').each(function () {
      article.body += this.outerHTML;
    })
    $('div#abColumn img').each(function () {
      var p = $(this).attr('src');
      if (p.indexOf('ADS') === -1) {
        article.photos.push(p);
      }
    })
    console.log(article);
  })
  .route('dinersjournal.blogs.nytimes.com', '*', function (window, $) {
    var article = {title: $('h1.entry-title').text()}
    console.log($('div.entry-content').html())
  })
  .get('http://www.nytimes.com/pages/dining/index.html')
  .log('info')
  ;


Try the PhantomJS. Not exactly a crawler, but could be easily used for that purpose. It has the fully-functional WebKit engine built-in, with an ability to save screenshots etc. Works as the simple command-line JS interpreter.


Server-side?

Try node-crawler: https://github.com/joshfire/node-crawler

0

精彩评论

暂无评论...
验证码 换一张
取 消