NoSQL or YesSQL_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-02-07 14:50 出处：网络

I have a huge dictionary of words: \"word1\" => [value1] \"word2\" => [value2] \"word3\" => [value3, value2]

I have a huge dictionary of words:

"word1" => [value1]
"word2" => [value2]
"word3" => [value3, value2]
...
"word400000000" => [value455, value3435, ..., value3423]

number of words is really big.

Now I want to be able to retrieve, really fast, all the values which are be开发者_开发百科ing pointed by word. word is string value.

What are the best tools to use? I thought of simple DB solution, but DBA guys said that it will not work really fast.

So, before I open Cormen's book, is there some ready solutions for that problem?

Look at key/value storage engines such as Berkeley DB. They are very fast at that sort of thing.

In RDMSs (YesSQL) you will most probably search values with LIKE or = operators on all records, i.e. search will take O(n). What you actually need is a data structure called inverted index, which allows you to find list of needed values in O(1). For description of structure and algorithms see Wikipedia article, for ready-to-use tools keep reading.

There's plenty of implementations of inverted index in search engines like Lucene/Solr, Sphinx (which, by the way, supports several databases as data source), and also in some key-value stores like Berkeley DB or Apache Cassandra. Distinction between search engines and key-value stores is in that:

Search engines implement inverted index more directly (AFAIK, key-value DBs use BigTable-like structures, that are much more complex then inverted index itself).
Search engines have a plenty of tools for text analysis (parsing, stemming). I don't know, if you actually need it, but if you do, use search engines.
Key-value DBs are real databases. I.e., unlike search engines they have real data types, not only strings. Moreover, some of such DBs (e.g. Berkeley DB) can store programming language native data types without converting them to any inner format. So, if you need a real database with all features, use key-value stores.

Also note, that inverted index is really simple structure, so you can easily implement it by yourself, if none of previous options is suitable for you.

It really depends on what behavior you want. If you just want to be able to do an exact text search, then a hash table is probably a really great idea. It has expected O(1) lookup, which is about as fast as you're going to get.

If you need the elements in sorted order (for example, so you can iterate across them in a reasonable order), then one of the myriad balanced search trees might be a good candidate; for example, a red-black tree or an AVL tree.

If you're working with a huge data set that can't all fit into main memory, then a very good choice might be a B-tree, which is a type of balanced binary search tree that minimizes the number of disk reads required to find a given element. Most database systems use some flavor of B-trees for their lookups.

You can use cassandra (http://cassandra.apache.org/). Is Easy to start, has pretty much documentation and is a really fast solution for your problem.

Hope this helps,

If you know that you will only want to search for values based on words and not the other way around, use a simple Key-Value store. Maybe Redis would be best.

If you think you will ever need to search based on the values, then you'll likely need Secondary Indices or off-line MapReduce jobs. Maybe Cassandra would be best.

NoSQL or YesSQL

精彩评论

关注公众号

热门标签

图文推荐

NoSQL or YesSQL

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：