I have heard on cassandra database engine few days ago and searching for a good documentation on it. after studying on cassandra I got cassandra is more scalable than other data engine. I also read on Amazon SimpleDB but as SimpleDB has a limitation 10GB/table and Google Datastore is slower than Amazon SimpleDB, I prefer not to use them (Google Datastore, Amazon SimpleDB). So for making our site scaled specially high write rates with massive data, I like to use Cassandra as our Data Engine.
But before starting using开发者_JAVA百科 cassandra I am confused on "How to handle complex data using casssandra". I am giving you the MySQL database structure below, Please read this and give me a good suggestion.
Users Table
hasColum ID Primary hasColum email Unique hasColum FirstName hasColum LastNameCategory Table
hasColum ID Primary hasColum Parent hasColum CategoryPosts Table
hasColum ID Primary hasColum UID Index foreign key linked to users->ID hasColum CID Index foreign key linked to Category->ID hasColum Title hasColum Post Index hasColum PunDateComments
hasColum ID primary hasColum UID Index foreign key linked to users->ID hasColum PID Index foreign key linked to Posts->ID hasColum CommentUser Group
hasColum ID primary hasColum NameUserToGroup Table (for many to many relation only)
hasColum UID foreign key linked to Users->ID hasColum GID foreign key linked to Group->IDFinally for your information, I like to use SimpleCassie PHP Class http://code.google.com/p/simpletools-php/ So, it will be very helpful if you can give me example using SimpleCassie
From the cassandra's wiki data model reference:
Unlike with relational systems, where you model entities and relationships and then just add indexes to support whatever queries become necessary, with Cassandra you need to think about what queries you want to support efficiently ahead of time, and model appropriately. Since there are no automatically-provided indexes, you will be much closer to one ColumnFamily per query than you would have been with tables:queries relationally. Don't be afraid to denormalize accordingly;
A goog article here.
I hope it helps you.
I will assume that you would have a heavy load and lots of data coming through your system, and again I will assume that you have tried a relational database and crushed under the heavy load, hit millions of rows, 10k+ request per second etc.
After these assumptions I would tell you that you need to change the way you think. For example in your question you wrote down the table structure which is really important when you are thinking about relational databases. But in column stores (like cassandra/hbase/etc) its not that important, its the requests types that counts. Since in column stores you can always throw a new meta data(an extra column which you won't use in your requests but in responses) in a new column, you don't have to alter your design. But in relational databases you need to alter table or even get another table with pk-fk relation.
When using cassandra (or any other column database) you should have your all api in front of you.
Example :
if you have getAllUserPosts($userId)
in your api you should eighter have : UserPosts ColumnFamily or a secondary index on Posts ColumnFamily (which does a similar thing in background). Farther more how do you need the result sorted ? Yes its a key point in design aswell, if you want it to be sort by creation date then you would better use TimeUID in key, or a 3rd party mechanizm to generate increasing uids for you. Maybe you would like to sort them with their "last update", then you would better put a secondary index on it.
From my experience I would tell you that its really cool to develop something with cassandra when your api, or what you need from data is crystal clear but when you want to change a big feature you would have some really big challenges ahead of you, beware. Also be sure that you understand the underlaying "eventually consistency" which makes cassandra fast. Since you would have to bang your head on keyboard a lot of times to get a transaction work (at least I did so). And ofcourse at some point you would want to do a mass operation over the huge data you have on cassandra: be ready to meat cloud computing aka. hadoop.
PS: I believe there are many people here with much experience and knowledge with cassandra then me who would help you design your system much better than I could. I just wanted to share what I experienced and understood while using cassandra in production.
Denormalize. See twissandra.com and the documentation at http://github.com/ericflo/twissandra
More examples at http://wiki.apache.org/cassandra/ArticlesAndPresentations
Here's a good article on Twissandra (Twitter clone on Cassandra) that discusses schema design based on data access requirements. You might find it useful http://www.rackspacecloud.com/blog/2010/05/12/cassandra-by-example/
Are you really competing with Google and Amazon in terms of traffic volumes? I'd recommend starting by looking at upgrading your current MySQL infrastructure - how many database servers do you currently run in your cluster(s)? Do you partition data?
C.
精彩评论