Niocchi crawler - how to add url to crawle during crawling process (crawling whole website)_问答_开发者

Niocchi crawler - how to add url to crawle during crawling process (crawling whole website)

开发者 https://www.devze.com 2023-03-03 22:41 出处：网络

相关专题：web-crawler

has anyone experience with Niocchi library? I start to crawle with domain url. In Worker method processResource(), I parse resource I get, extract all internal links in this page and I need to add them to crawle. But I can`t fi开发者_StackOverflow社区nd how. Should I add it to UrlPool, or ResourcePool, or somewhere else?

Thanks!

You can add them to an existing URLPool. The existing URLPool implementations are not expandable so you have to create your own URLPool class that is expandable. I called my class ExpandableURLPool.

The URLPool.setProcessed method is called by the framework upon completion of processing and it is there you can add additional URLS to the url list. I will follow with an example, but first, the URLPool documentation states:

setProcessed(Query) is called by the crawler to inform the URLPool when a Query has been crawled and its resource processed. This is typically used by the URLPool to check the crawl status and log the error in case of a failure or to get more URL to crawl in case of success. A typical example where getNextQuery() returns null but hasNextQuery() returns true is when the URLPool is waiting for some processed resources from which more URL to crawl have been extracted to come back. Check the urlpools package for examples of implementation.

This implies that the tricky part in your implementation of ExapndableURLPool is that the hasNextQuery method should return true if there is an outstanding query being processed that MAY result in new urls being added to the pool. Similarly, getNextQuery must return null in cases where there is an outstanding query that has not finished yet and MAY result in new urls being added to the pool. [I dislike the way the niocchi is put together in this regard]

Here is my very preliminary version of ExpandableURLPool:

class ExpandableURLPool implements URLPool {
List<String> urlList = new ArrayList<String>();
int cursor = 0;

int outstandingQueryies = 0;

public ExpandableURLPool(Collection<String> seedURLS) {
    urlList.addAll(seedURLS);
}

@Override
public boolean hasNextQuery() {
   return  cursor < urlList.size() || outstandingQueryies > 0;

}

@Override
public Query getNextQuery() throws URLPoolException {
    try {
        if (cursor >= urlList.size()) {
            return null;
        } else {
            outstandingQueryies++;
            return new Query( urlList.get(cursor++) ) ;
        }
    } catch (MalformedURLException e) {
        throw new URLPoolException( "invalid url", e ) ;
    }    
}

@Override
public void setProcessed(Query query) {
    outstandingQueryies--;


}

public void addURL(String url) {
    urlList.add(url);
}

}

I also created a Worker class, derived from DiskSaveWorker to test the above implementation:

    class MyWorker extends org.niocchi.gc.DiskSaveWorker {

    Crawler mCrawler = null;
    ExpandableURLPool pool = null;

    int maxepansion = 10;

    public MyWorker(Crawler crawler, String savePath, ExpandableURLPool aPool) {
        super(crawler, savePath);
        mCrawler = crawler;
        pool = aPool;
    }

    @Override
    public void processResource(Query query) {
        super.processResource(query);
        // The following is a test
        if (--maxepansion >= 0  ) {
            pool.addURL("http://www.somewhere.com");
        }       

    }


}