Complex Query on cassandra_问答_开发者_运维开发者技术经验分享

I have heard on cassandra database engine few days ago and searching for a good documentation on it. after studying on cassandra I got cassandra is more scalable than other data engine. I also read on Amazon SimpleDB but as SimpleDB has a limitation 10GB/table and Google Datastore is slower than Amazon SimpleDB, I prefer not to use them (Google Datastore, Amazon SimpleDB). So for making our site scaled specially high write rates with massive data, I like to use Cassandra as our Data Engine.

But before starting using开发者_JAVA百科 cassandra I am confused on "How to handle complex data using casssandra". I am giving you the MySQL database structure below, Please read this and give me a good suggestion.

Users Table

hasColum ID Primary

hasColum email Unique

hasColum FirstName

hasColum LastName

Category Table

hasColum ID Primary

hasColum Parent

hasColum Category

Posts Table

hasColum ID Primary

hasColum UID Index foreign key linked to users->ID

hasColum CID Index foreign key linked to Category->ID

hasColum Title

hasColum Post Index

hasColum PunDate

Comments

hasColum ID primary

hasColum UID Index foreign key linked to users->ID

hasColum PID Index foreign key linked to Posts->ID

hasColum Comment

User Group

hasColum ID primary

hasColum Name

UserToGroup Table (for many to many relation only)

hasColum UID foreign key linked to Users->ID

hasColum GID foreign key linked to Group->ID

Finally for your information, I like to use SimpleCassie PHP Class http://code.google.com/p/simpletools-php/ So, it will be very helpful if you can give me example using SimpleCassie

From the cassandra's wiki data model reference:

Unlike with relational systems, where you model entities and relationships and then just add indexes to support whatever queries become necessary, with Cassandra you need to think about what queries you want to support efficiently ahead of time, and model appropriately. Since there are no automatically-provided indexes, you will be much closer to one ColumnFamily per query than you would have been with tables:queries relationally. Don't be afraid to denormalize accordingly;

A goog article here.

I hope it helps you.

I will assume that you would have a heavy load and lots of data coming through your system, and again I will assume that you have tried a relational database and crushed under the heavy load, hit millions of rows, 10k+ request per second etc.

After these assumptions I would tell you that you need to change the way you think. For example in your question you wrote down the table structure which is really important when you are thinking about relational databases. But in column stores (like cassandra/hbase/etc) its not that important, its the requests types that counts. Since in column stores you can always throw a new meta data(an extra column which you won't use in your requests but in responses) in a new column, you don't have to alter your design. But in relational databases you need to alter table or even get another table with pk-fk relation.

When using cassandra (or any other column database) you should have your all api in front of you.

Example :

if you have getAllUserPosts($userId) in your api you should eighter have : UserPosts ColumnFamily or a secondary index on Posts ColumnFamily (which does a similar thing in background). Farther more how do you need the result sorted ? Yes its a key point in design aswell, if you want it to be sort by creation date then you would better use TimeUID in key, or a 3rd party mechanizm to generate increasing uids for you. Maybe you would like to sort them with their "last update", then you would better put a secondary index on it.

From my experience I would tell you that its really cool to develop something with cassandra when your api, or what you need from data is crystal clear but when you want to change a big feature you would have some really big challenges ahead of you, beware. Also be sure that you understand the underlaying "eventually consistency" which makes cassandra fast. Since you would have to bang your head on keyboard a lot of times to get a transaction work (at least I did so). And ofcourse at some point you would want to do a mass operation over the huge data you have on cassandra: be ready to meat cloud computing aka. hadoop.

PS: I believe there are many people here with much experience and knowledge with cassandra then me who would help you design your system much better than I could. I just wanted to share what I experienced and understood while using cassandra in production.

Denormalize. See twissandra.com and the documentation at http://github.com/ericflo/twissandra

More examples at http://wiki.apache.org/cassandra/ArticlesAndPresentations

Here's a good article on Twissandra (Twitter clone on Cassandra) that discusses schema design based on data access requirements. You might find it useful http://www.rackspacecloud.com/blog/2010/05/12/cassandra-by-example/

Are you really competing with Google and Amazon in terms of traffic volumes? I'd recommend starting by looking at upgrading your current MySQL infrastructure - how many database servers do you currently run in your cluster(s)? Do you partition data?