First post on StackOverflow, but I've always looked to this site as a great source of shared knowledge, and I'm excited to see what comes up from this question.
As I feel I have now reached the limits of what I can do with SQL indexes, statistics and full-text search, I'm currently looking for a search library that can provide us with the functionality we need. I'm not averse to writing it myself (and open-sourcing it if I can get the boss's approval), but I would prefer to find something open-source that already exists, natch.
What we're after is a search engine that can provide statistics about the results that are matched when a u开发者_高级运维ser searches for a specific keyword. Let's say, for example, that we were talking about a database of products in an online shop. We need to be able to return statistics about how many products there are that match a given set of keywords (and also be able to filter this result set by price, category, etc), as well as the total number of products in stock (assuming that this is stored in a field in the product table). All the search engines that I have found return the top n results, and if you want statistics about the size of the result set, you need to enumerate the whole set. Even if you didn't you still would need to do so to retrieve the total number of products in stock.
Is there anything anyone knows of that is capable of this functionality? As I say, I'm happy to get my hands dirty and either build it myself, or modify the functionality of something like Lucene, but I have not been able to find anything appropriate on Google.
Thanks in advance guys!
You might take a look at Solr, which is a faceted search engine built on top of Lucene. Solr will count lots of different things for you, in addition to doing full-text search. It is good at handling combinations of structured and full-text data.
Something to keep in mind here is that "enumerating all results" can mean very different things - select count(*)
is very different from doing all the joins etc. required to actually get each object. This is true in Lucene as well as relational databases. So I wouldn't worry about the mere fact that the documentation says "we enumerate all results."
It's been my experience that the standard faceting of Solr scales to what 99% of people need. If you are in that 1% (i.e. you have a huge database) then I can suggest some ways of guessing the results which can be quicker. But Solr will probably work for you.
As I feel I have now reached the limits of what I can do with SQL indexes
Are you sure? I ask because if you are using MySQL, you might want to look into the full text search functionality of PostgreSQL. Especially when you combine it with the btree_gin and the trigram modules, and the extremely decent explain functionality that allows you to extract reasonable row estimates from highly complex queries.
精彩评论