I am using Apache Nutch first time. How can I st开发者_如何转开发ore data into a MySQL database after crawling? I want to be able to easily use the data in other web applications.
I found a question related, but I don't clearly understand which part of the code id gona replace by MySQL connector. Please help with a short code example.
Get source from http://mirror.nyi.net/apache//nutch/apache-nutch-1.2-src.zip
Open org.apache.nutch.crawl.Crawl
class in your editor.
Lookup variable Path crawlDb = new Path(dir + "/crawldb");
The variable will give a hint on where to replace the code in order to get your own CustomMySQLCrawl
class.
The persistence is happening during this call: crawlDbTool.update(crawlDb, segs, true, true); // update crawldb
So there is where you should save it to the database. You might want to consider integrating hibernate at this point.
I see 2 possibilities: either you take the content from the Lucene index created by Nutch at the end of the crawl job (I think it is removed in Nutch 2.0) OR take the data from the segment at each iteration.
If what is put in the Lucene index is enough for you, it may be easier that way. But if you need more, each segment contain everything that was fetched by Nutch.
If you will use Nutch's binary executable, run -readseg command after crawling. It will give you a huge file which contains all the raw html and other info in it. You can parse and save the needed data to database after that.
If you willing to run Nutch in Eclipse, you should add some code to the class Fetcher.
pstatus = output(fit.url, fit.datum, content, status, CrawlDatum.STATUS_FETCH_SUCCESS);
updateStatus(content.getContent().length);
Write a simple call and write to database code after these lines in Fetcher class. You can get the raw html by:
content.getContent();
This returns a byte array representation of the html file, convert it to String and save it to your database. You might suffer from character encoding: Nutch with UTF-8 to configure Nutch. However, the problem generally occured by Eclipse's encoding. To overcome that, take the substring of the content which includes "charset" value and:
String yourContent = new String(content.getContent, encodingYouFound);
"encoding" here is a String, so it will be enough to retrieve it from the "content". If you can't, some sites might not have the charset attribute, use a general encoding such as UTF-8.
精彩评论