nutch
Different pages to different Nutch cores (within the same domain)
How can I instruct Nutch to treat page#1 as belonging to a core and page#2 as belonging to a different core (both pages from the same domain)?[详细]
2023-04-12 10:35 分类:问答Exploring nutch over hadoop
What possibly can i do with Hadoop and Nutch used as a search engine ? I know that nutch is used to build a web crawler . But i\'m not finding the perfect picture . Can i use mapreduce with nutch and[详细]
2023-04-08 02:50 分类:问答whether method cancel() and method interrupt() do the duplicate job?
I read the source of org.apache.nutch.parse.ParseUtil.runParser(Parser p, Content content). Do these two method calls do the same thing:[详细]
2023-04-05 05:35 分类:问答Simple Nutch 1.3/Solr index explanation
After much searching, it doesn\'t seem like there\'s any straightforward explanation of how to use Nutch 1.3 with Solr.[详细]
2023-04-05 03:00 分类:问答Exclude duplicate results from Solr query based on highlight snippets?
The scene: I have indexed many websites using Nutch and Solr.I\'ve implemented result grouping by site.My results output includes the page title, highlight snippets and URL. My issue is with the page[详细]
2023-04-04 16:34 分类:问答Setup Nutch 1.3 and Hadoop
I am a newbie to Nutch and Hadoop and trying to follow the tutorial here at http://wiki.apache.org/nutch/NutchHadoopTutorial.[详细]
2023-04-01 20:39 分类:问答Nutch on EMR problem reading from S3
Hi I am trying to run Apache Nutch 1.2 on Amazon\'s EMR. To do this I specifiy an input directory from S3.I get the following error:[详细]
2023-03-31 20:16 分类:问答nutch crawl path
I would like to know how to make nutch crawl not only the domain that I specified, but also the dir path within 开发者_StackOverflowthe domain that I specified.I know that you can configure this infor[详细]
2023-03-29 17:27 分类:问答use nutch to index my local HTML files
I have a lot of HTML files on my hard disk and want to index them with Nutch, but as I know nutch only get URLs and index them and pages that linked in that URLs.开发者_StackOverflow[详细]
2023-03-29 16:46 分类:问答Nutch 1.2 - Why won't nutch crawl url with query strings?
I\'m new to Nutch and not really sure what is going on here.I run nutch and it crawl my website, but it seems to ignore URLs that contain query strings.I\'ve commented out the filter in the crawl-urlf[详细]
2023-03-28 07:06 分类:问答