Exclude duplicate results from Solr query based on highlight snippets?_问答_开发者

The scene:

I have indexed many websites using Nutch and Solr. I've implemented result grouping by site. My results output includes the page title, highlight snippets and URL. My issue is with the page navigation/copyright/company info bits that appear on many company sites.

A query for "solder", for example, may return 200+ results for a particular site -- but only a handful of the results are actually appropriate; perhaps the company's site structure includes "solder" on every page as part of their core business description, site navigation, etc. There are relevant results to see, but they're flooded by the irrelevant, repetitive matches from the other pages on the site.

The problem:

I've seen other postings asking how to prevent Nutch and Solr from indexing site headers, footers, navigation 开发者_开发知识库and others but with such a diverse group of sites, this approach just isn't feasible. What I'm observing, however, is that although the content for each result is significantly different, the highlighted snippets returned are 90-100% identical for the results I don't want. Observe:

Products | Alloy Information || --------
-Free Solutions Halogen-Free Products Sales Contacts Technical Articles Industry Links Terms & Conditions Products Support Site Map Lead-Free Solutions Halogen-Free Products Sales   Contacts Technical Articles Industry
http://www.--------.com/Products/AlloyInformation.aspx

Products | Chemicals & Cleaners || --------
-Free Solutions Halogen-Free Products Sales Contacts Technical Articles Industry Links Terms & Conditions Products Industrial Division   Products Services News Support Site Map Lead-Free Solutions Halogen-Free Products Sales
http://www.--------.com/Products/ChemicalsCleaners.aspx

Products | Rosin Based || --------
-Free Solutions Halogen-Free Products Sales Contacts Technical Articles Industry Links Terms & Conditions Products   Products Services News Support Site Map Lead-Free Solutions Halogen-Free Products Sales Contacts Technical
http://www.--------.com/Products/RosinBased.aspx

Support | Engineering Guide || --------
-Free Solutions Halogen-Free Products Sales Contacts Technical Articles Industry Links Terms & Conditions Support   Products Services News Support Site Map Lead-Free Solutions   Halogen-Free Products Sales Contacts Technical
http://www.--------.com/Support/EngineeringGuide.aspx

The Big Idea:

This leads me to the question of if I can filter or group results based on the highlighted snippets that are returned. I can't just group on the content because 1) the field is huge; and 2) the content is very different from page to page. If I could group, exclude or deduplicate results whose snippets were >85% identical, that would probably solve the problem. Perhaps some sort of post-processing step or some kind of tokenizer factory? Or a sort of idf for the search results rather than the entire document set?

This seems like it would be a fairly common problem, and perhaps I've just missed how to do it. Essentially this is Google's "To blah blah your search, we have hidden xxx similar results. Click here to show them" feature.

Thoughts?

I don't thinkthere is any way of doing exactly what you are asking, except post-processing that would be up to you, and not very efficient for larger result sets.

Maybe you should ask a different question if the documents being returned are actually quite different, even though the snippets are identical. If the documents are different, presumably there is value in showing them all, rather than de-duplicating.

You could try enhancing the search result display to show more information about the documents so that the user can discriminate amongst them - maybe not relying on highlighting, but showing some other parts of the document as well?

I really do think though that at the heart of the problem is the need to make matches found in site boilerplate less relevant than matches found elsewhere. Usually relevance ranking does a good job of this because the common terms are much less important for relevance ranking, but if you are mixing documents from a wide range of different sites you might find the effect less pronounced - since oft-repeated terms on one site could be very unique on another site. If your results are truly segmented by site, you might consider creating separate indexes (cores) for each site - this would have the effect of performing the relevance calculations in a site-specific way, and might help with this problem.

In the base shipping Nutch (not Solr) there is a clustering mechanism, I don't really know how it works but it does something which I had to remove. Have you looked at that ?

Another idea popping to mind would be : to index separetely real content, from navigational snippets. And at search time you apply a heigher query weight to 'real content' field.

Which would pull forward pages with 'solder' as content as opposed to pages with only 'solder' as navigation and yet you keep all pages just in case.

Hope I understood your problem correctly.