I'm working to do a crawl, but before I crawl an开发者_StackOverflow entire website, I would like to shoot off a test, of to or so pages. So I was thinking something like below would work, but I keep getting a nomethoderror....
Anemone.crawl(self.url) do |anemone|
anemone.focus_crawl do |crawled_page|
crawled_page.links.slice(0..10)
page = pages.find_or_create_by_url(crawled_page.url)
logger.debug(page.inspect)
page.check_for_term(self.term, crawled_page.body)
end
end
NoMethodError (private method `select' called for true:TrueClass):
app/models/site.rb:14:in `crawl'
app/controllers/sites_controller.rb:96:in `block in crawl'
app/controllers/sites_controller.rb:95:in `crawl'
Basically I want to have a way to first craw only 10 pages, but I seem to be not understanding the basics here. Can someone help me out? Thanks!!
Add this monkeypatch to your crawling file.
module Anemone
class Core
def kill_threads
@tentacles.each { |thread|
Thread.kill(thread) if thread.alive?
}
end
end
end
Here is an example of how to use it after you've added it to your crawling file.Then in the file which you are running your add this to your anemone.on_every_page method
@counter = 0
Anemone.crawl(http://stackoverflow.com, :obey_robots => true) do |anemone|
anemone.on_every_page do |page|
@counter+= 1
if @counter > 10
anemone.kill_threads
end
end
end
Source: https://github.com/chriskite/anemone/issues/24
So I found the :depth_limit param and that will be ok, but I would rather limit it to # of links.
i found your question while i was googling for anemone.
I had the same problem. And with Anemone, what i did was:
As soon as i reach the URL limit that i want, i raise an exception. The whole anemone block is inside a begin/rescue block.
In your case specific i would take another approach. I would download the page that you want to parse, and bind it to fakeweb. I wrote a blog entry about it, long time ago, maybe it would be useful: http://blog.bigrails.com/scraper-guide.html
精彩评论