Custom Parser for Nutch (or open source .NET Crawler)_问答_开发者

Custom Parser for Nutch (or open source .NET Crawler)

开发者 https://www.devze.com 2023-03-07 19:10 出处：网络

I have been using Nutch/Solr/SolrNet for my search solutions, I must say, it works a treat. On a new site I\'m working on, I am using Master pages, a开发者_运维百科s a result, content in the header an

I have been using Nutch/Solr/SolrNet for my search solutions, I must say, it works a treat. On a new site I'm working on, I am using Master pages, a开发者_运维百科s a result, content in the header and footer is getting indexed and distorts the results. For example, I have a link to the Contact Us page in the header. Now, when I search for 'Contact' the result returns all the pages in the site.

Is there a customizable Nutch parser that i can maybe pass a div id and then it only indexes content inside the div.

Or if there are .NET based crawlers that I can customize.

See https://issues.apache.org/jira/browse/NUTCH-585 and https://issues.apache.org/jira/browse/NUTCH-961

BTW you'd get a more relevant audience by posting to the Nutch user list

You can implement a Nutch filter (I like Jericho HTML Parser) to extract only the parts of the page you need to index using DOM manipulation. You can use the TextExtractor class to grab clean text (sans HTML tags) to be used in your index. I usually save that data in custom fields.