I am looking for a scheme-less database to store roughly 10[TB] of data on disk, ideally, using a python client. The suggested solution should be free for commercial use, and have good performance for reads and writes.
The main goal here is to store time-series data, including more than billion records, accessed by time stamp
.
Data would be stored in the following scheme:
KEY --> "FIELD_NAME.YYYYMMDD.HHMMSS"
VALUE --> [v1, v2, v3, v4, v5, v6] (v1..v6 are just floats
)
For instance, suppose that:
FIELD_NAME = "TOMATO"
TIME_STAMP = "20060316.184356"
VALUES = [72.34, -22.83, -0.938, 0.265, -2047.23]
I need to be able to retrieve VALUE (the entire array) given the combination of FIELD_NAME
& TIME_STAMP
.
The query VALUES["TOMATO.20060316.184356"] would return the vector [72.34, -22.83, -0.938, 0.265, -2047.23]. Reads of arrays should be as fast as possible.
Yet, I also need a way to store (in-place) a scalar value within an array . Suppose that I want to assign the 1st element of TOMATO
on timestamp 2006/03/16.18:43:56
to be 500.867
. In such a case, I need to have a fast mechanism to do so -- something like:
V开发者_如何转开发ALUES["TOMATO.20060316.184356"][0] = 500.867 (this would update on disk)
Can something like MangoDB
work? I will be using just one machine (no need for replication etc), running Linux.
CLARIFICATION: only one machine will be used to store the database. Yet, I need a solution that will allow multiple machines to connect to the same database and update/insert/read/write data to/from it.
MongoDB is likely a good choice related to performance, flexibility and usability (easy approachable). However large databases require careful planning - especially when it comes to aspects of backup and high-availability. Without further insight about project requirements there is little to say if one machine is enough or not (look at replica sets and sharding if you need options scale).
Update: based on your new information - should be doable with MongoDB (test and evaluate it). Easiliy spoken: MongoDB can be the "MySQL" of the NoSQL databases....if you know about SQL databases then you should be able to work with MongoDB easily since it borrows a lot of ideas and concept from the SQL world. Looking at your data model...it's trivial and data can be easily retrieved and stored (not going into details)..I suggest download MongoDB and walking through the tutorial.
A MongoDB instance can allow multiple machines to access it. You will, however, have to give the server special command line arguments in order to allow it to do so. You should search through the MongoDB documentation, it is pretty comprehensive. The documentation for mongodb's authentication model is here. It describes how to run Mongo in secure mode and how to restrict the ip ranges that can bind to it.
MongoDB will work. However looking at your requirement I will strongly recommend Redis.
Redis is a data structure store. Where you can store your arrays as values and access them with keys. It is easy to setup/use and is ridiculously fast. It works well as one machine server and other way too.
There are excellent python client available for Redis such as Redisco, redis-natives-py and redis-wrap or simplest redis-py.
Another option to consider is Berkeley DB or Berkeley DB Java Edition. BDB is a C library, where as BDB JE is a Java Library. Both provide multiple APIs for storing data, including a key-value pair API (NoSQL), a Java Collections API and a Java Direct Persistence Layer (POJO-like) API.
Either library can certainly manage a 10TB repository on a single system. The both provide HA capabilities that allow you to replicate the database (and any changes) to multiple systems. Reads can be sent to the master or any of the replicas (providing load balancing). Updates have to be sent to the master. We have customers who use Berkeley DB in this kind of a setup today. Berkeley DB has been around for many years and this is exactly the kind of application that we do well.
Disclaimer: I'm the product manager for Berkeley DB, so I'm a little biased. :-)
精彩评论