anemone Ruby with focus_crawl_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-01-28 10:25 出处：网络

I\'m working to do a crawl, but before I crawl an开发者_StackOverflow entire website, I would like to shoot off a test, of to or so pages. So I was thinking something like below would work, but I keep

相关专题：gem ruby

I'm working to do a crawl, but before I crawl an开发者_StackOverflow entire website, I would like to shoot off a test, of to or so pages. So I was thinking something like below would work, but I keep getting a nomethoderror....

Anemone.crawl(self.url) do |anemone|
      anemone.focus_crawl do |crawled_page|
        crawled_page.links.slice(0..10)
        page = pages.find_or_create_by_url(crawled_page.url)
        logger.debug(page.inspect)
        page.check_for_term(self.term, crawled_page.body)
      end
    end

NoMethodError (private method `select' called for true:TrueClass):
    app/models/site.rb:14:in `crawl'
    app/controllers/sites_controller.rb:96:in `block in crawl'
    app/controllers/sites_controller.rb:95:in `crawl'

Basically I want to have a way to first craw only 10 pages, but I seem to be not understanding the basics here. Can someone help me out? Thanks!!

Add this monkeypatch to your crawling file.

module Anemone
    class Core
        def kill_threads
            @tentacles.each { |thread| 
                Thread.kill(thread)  if thread.alive?
            }
        end
    end
end

Here is an example of how to use it after you've added it to your crawling file.Then in the file which you are running your add this to your anemone.on_every_page method

@counter = 0
Anemone.crawl(http://stackoverflow.com, :obey_robots => true) do |anemone|
    anemone.on_every_page do |page|
        @counter+= 1 
        if @counter > 10
            anemone.kill_threads
        end
    end
end

Source: https://github.com/chriskite/anemone/issues/24

So I found the :depth_limit param and that will be ok, but I would rather limit it to # of links.

i found your question while i was googling for anemone.

I had the same problem. And with Anemone, what i did was:

As soon as i reach the URL limit that i want, i raise an exception. The whole anemone block is inside a begin/rescue block.

In your case specific i would take another approach. I would download the page that you want to parse, and bind it to fakeweb. I wrote a blog entry about it, long time ago, maybe it would be useful: http://blog.bigrails.com/scraper-guide.html