Is anyone aware of any links, papers, presentations, or blog posts that describe a large-scale full-text search engine built upon a distributed key/value store?
I'm particularly interested in the organization of the index. What, exactly, is the data structure? Where and how are dictionaries开发者_C百科 and postings stored? What is the workflow for query processing? How are queries handled in such a way that it's not necessary to haul massive amounts of data across the network?
I gather that Blekko is built this way. I'd like to know what they, or their competitors, actually did.
I'm not aware of a blog post or article that answers your question Exactly. However, here are some resources I think are of relevance to your question and I hope they can help you distill an answer.
Firstly, Jeff Dean's keynotes on the evolution of Google's architecture,
- http://research.google.com/people/jeff/WSDM09-keynote.pdf
- http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf
Next, there's an open source search engine on top of a K-V store called Lucandra - as the name suggests, Lucene on top of Cassandra, both being Apache projects.
- http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/
In order to understand how Lucandra works, check out the implementation and presentations that were made that talk about how Lucene indexes Cassandra data.
Similarly, you can also see how Lucene and HBase coexist. Here's a link to the Apache commit/patch which integrates a search layer using one on the other,
- http://mail-archives.apache.org/mod_mbox/hbase-issues/201104.mbox/%3C1865485299.35732.1302031145872.JavaMail.tomcat@hel.zones.apache.org%3E
Another similar article for Redis
- http://playnice.ly/blog/2010/05/05/a-fast-fuzzy-full-text-index-using-redis/
Next, check out Operational Requirements for Scalable Search Systems
- http://www.ir.iit.edu/~abdur/publications/p435-chowdhury.pdf
The CIS lab has some excellent research papers on the subject that you should check out,
- http://cis.poly.edu/westlab/publications.html
For general search engine assumptions that may be made above, here are links to books that will help,
- http://ir.iit.edu/~ophir/pub.html
- http://www.search-engines-book.com/
- http://www.ir.uwaterloo.ca/book/
- http://nlp.stanford.edu/IR-book/information-retrieval-book.html
Google MapReduce will probably interest you a great deal.
精彩评论