cassandra data model for web logging_问答_开发者

开发者 https://www.devze.com 2023-04-10 06:34 出处：网络

Been playing around with Cassandra and I am trying to evaluate what would be the best data model for storing things like views or hits for unique page id\'s? Would it best to have a single column fami

相关专题：cassandra

Been playing around with Cassandra and I am trying to evaluate what would be the best data model for storing things like views or hits for unique page id's? Would it best to have a single column family per pageid, or 1 Super-column (logs) with columns pageid? Each page has a unique id, then would like to store date and some other metrics on the view.

I am just not sure which solution handles better scalability, lots of column family OR 1 giant super-column?

page-92838 { date:sept 2, brow开发者_开发问答ser:IE } page-22939 { date:sept 2, browser:IE5 }

logs { page-92838 { date:sept 2, browser:IE } page-22939 { date:sept 2, browser:IE5 } }

And secondly, how to handle lots of different date: entries for page-92838?

You don't need a column-family per pageid.

One solution is to have a row for each page, keyed on the pageid.

You could then have a column for each page-view or hit, keyed and sorted on time-UUID (assuming having the views in time-sorted order would be useful) or other unique, always-increasing counter. Note that all Cassandra columns are time-stamped anyway, so you would have a precise timestamp 'for free' regardless of what other time- or date- stamps you use. Using a precise time-UUID as the key also solves the problem of storing many hits on the same date.

The value of each column could then be a textual value or JSON document containing any other metadata you want to store (such as browser).

page-12345 -> {timeuuid1:metadata1}{timeuuid2:metadata2}{timeuuid3:metadata3}...
page-12346 -> ...

With cassandra, it is best to start with what queries you need to do, and model your schema to support those queries.

Assuming you want to query hits on a page, and hits by browser, you can have a counter column for each page like,

stats { #cf 
    page-id { #key
        hits : # counter column for hits
        browser-ie : #counts of views with ie
        browser-firefox : ....
    }
}

If you need to do time based queries, look at how twitters rainbird denormalizes as it writes to cassandra.