We are building an application which will require us to index data for each of our users so that we can provide full text search on their data. Here are some notable things about the application:
A) The data for every user is totally unrelated to every other user. This gives us few advantages:
- we can keep our indexes small in size.
- merging/compatcting fragmented index will take less time.
- if some indexes becomes inaccessible for whatever reason (corruption?), only those users gets affected. Other users are unaffected and the service is available for them.
B) Each user can have few different types of data. We want to keep each type in separate folders, for the same reasons as above.
So, our index hierarchy will look something like:
/user1/type1/<index files>
/user1/type2/<index files>
/user2/type1/<index files>
/user3/type3/<index files>
C) Often, probably with every itereation, we'll add "types" of data that can be indexed.
So we want to have an efficient/programmatic way to add schemas for different "types". We would like to avoid having fixed schema for indexing. I like Lucene's schema-less way of indexing stuff.D) The users can fire search queries which will search either: - Within a specific "type" for that user - Across all types for that user: in this case we want to fire a parallel query like Lucene has. (ParallelMultiSearcher)
E) We require real time update for the index. This is a must.
F) We are are plannin开发者_JAVA技巧g to shard our index across multiple machines. For this also, we want:
if a shard becomes inaccessible, only those users whose data are residing in that shard gets affected. Other users get uninterrupted service.We were considering Lucene, Sphinx and Solr to do this. This is what we found:
- Sphinx: No efficient way to do A, B, C, F. Or is there?
- Luecne: Everything looks possible, as it is very low level. But we have to write wrappers to do F and build a communication layer between the web server and the search server.
- Solr: Not sure if we can do A, B, C easily. Can we?
So, my question is what is the best software for the above requirements? I am inclined more towards Solr and then Lucene if we get all the requirements.
I can't see Solr being able to handle A or B, as Solr's model is to have everything in one index (per shard core). Solr can handle C if you use the dynamic field types. Although Solr can do real time indexing, it is not as fast as Lucene (even with Embedded Solr, in my experience). This all points to Lucene being your only choice.
I think Solr might work really well for you here.
The key feature that Solr has that will work well for you in your sitiuation is the notion of cores. See http://wiki.apache.org/solr/CoreAdmin
One way you can implement this is that each user/type combination can be a separate Solr core. This satisfies (A) and (B). The client can either direct the search at a single core, or it can direct the search at multiple cores at once (and optional across different Solr servers), which is what you want when you search across a single user and all types. This satisfies (D) and (F). Or you can one core for each user, with a "type" field that you can filter on.
As for (C), Solr has the notion of dynamic fields. See http://wiki.apache.org/solr/SchemaXml#Dynamic_fields
As far as (E) goes, Solr doesn't have "true" real-time indexing yet. But if a lag of a few seconds is acceptable, then Solr can handle that.
精彩评论