开发者

How can I add URLs to crawler of crawler4j at random times during progress

开发者 https://www.devze.com 2023-04-13 09:10 出处:网络
I\'m tackling to crawler4j. http://code.google.com/p/crawler4j/ and simple test crawl a site was succeeded.

I'm tackling to crawler4j. http://code.google.com/p/crawler4j/

and simple test crawl a site was succeeded. but I want to add URLs at random times during progress.

this code shows the following exception at second constructing CrawlController. how can I add URLs during progress? or reuse CrawlController? (also reuse case without re-constructing CrawlController was failed.)

any idea? or other good crawler in Java?

edit: since it might be a bug, I posted also to the page of crawler4j. http://code.google.com/p/crawler4j/issues/detail?id=87&thanks=87&ts=1318661893

private static final ConcurrentLinkedQueue<URI> urls = new ConcurrentLinkedQueue<URI>();
...
URI uri = null;
while (true) {
    uri = urls.poll();
    if (uri != null) {
        CrawlController ctrl = null;
        try {
            ctrl = new CrawlController("crawler");
            ctrl.setMaximumCrawlDepth(3);
            ctrl.setMaximumPagesToFetch(100);
        } catch (Exception e) {
            e.printStackTrace();
            return;
        }
        ctrl.addSeed(uri.to开发者_JAVA技巧String());
        ctrl.start(MyCrawler.class, depth);
    }else{
        try {
            Thread.sleep(3000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
}

java.lang.IllegalThreadStateException
    at java.lang.Thread.start(Thread.java:638)
    at edu.uci.ics.crawler4j.crawler.PageFetcher.startConnectionMonitorThread(PageFetcher.java:124)
    at edu.uci.ics.crawler4j.crawler.CrawlController.<init>(CrawlController.java:77)


As of version 3.0, this feature is implemented in crawler4j. Please visit http://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawler4j/examples/multiple/ for an example usage.

Basically, you need to start the controller in non-blocking mode:

controller.startNonBlocking(MyCrawler.class, numberOfThreads);

Then you can add your seeds in a loop. Note that you don't need to start the controller several times in a loop.

0

精彩评论

暂无评论...
验证码 换一张
取 消