开发者

Developing a Cassandra "schema" for unique-id-less data

开发者 https://www.devze.com 2023-02-19 06:42 出处:网络
I\'m attempting to create a Cassandra schema for stock price data. Each records has the following fields:

I'm attempting to create a Cassandra schema for stock price data.

Each records has the following fields:

  • stock symbol
  • date
  • open price
  • close price

As you can see there is no single field that can act as a unique identifier.

I'd like to create something like:

Stocks { // column family
    ????? { //row key - unique id
        symbol: 'GOOG'        // column  -- secondary index
        date: '2005/01/01'    //column   -- secondary index
        open: '500'           //column
        close: '501'          //column
    }开发者_如何学编程
}

As you can see I want to take advantage of the secondary index feature in cassandra 0.7.

When inserting the data into the database, should I just create a TimeUUID for each record? That seems like the most straightforward route.

Note: I do not work in the financial industry, I'm just using this dataset to wrap my mind around Cassandra.


The answer depends on how you will use the data. I assume, based on the indexes you're indicating, that you want to search by symbol and data. I'm going to further assume that you wish to be able to get a list of the data for a given date or the dates/stats for a given symbol. Stock data is static, so we don't need to worry about our schema being updatable.

We can work with a single column family here, which we will call 'StockData'. Each symbol will have a row, as will each date. So for the above record, you would have keys of 'GOOG' and '2005/01/01'.

For the first type of key (symbols), your column names would be something like '2005/01/01-start' and '2005/01/01-end' with the column value being the start and end values.

For the second type of key (dates), your column names would be something like 'GOOG-start' and 'GOOG-end'. Again, the start and end values would be stored as the column values.

To illustrate:

Column Family: StockData
------------------------------------------------------------------------------------
GOOG       | 2005/01/01-start | 2005/01/01-end |  2005/01/02-start | 2005/02/01-end | 
           |      500         |      501       |         501       |     600        |

APPL       | 2005/01/01-start | 2005/01/01-end |  2005/01/02-start | 2005/02/01-end | 
           |      354         |      360       |         360       |     100        |

2005/01/01 |    GOOG-start    |     GOOG-end   |      APPL-start   |     APPL-end   |
           |        500       |       501      |         354       |       360      |

2005/01/02 |    GOOG-start    |     GOOG-end   |      APPL-start   |     APPL-end   |
           |        501       |       600      |         360       |       100      |

Now, you can select all or some stats for a symbol (ordered by date), using the get_slice function on the symbol's row. Similarly, you can get some or all stats for a day (ordered by symbol).

With NOSQL systems there are nearly as many ways to structure your data as there are developers. This is by no means the only way. Just something to get you started.


You need to explain precisely what queries you wish to run - because Cassandra does not have a flexible query language, the schema must be structured to suit specific queries.

You could use the stock symbol as a unique row key, then add columns with the date as the column name, and the start and end price packed into a compound value.

GOOG -> {'2005/01/01': '354-360'} {'2005/01/02': 360-100}

But this wouldn't give you useful secondary indexing.

As string-literal suggests, you could use a compound column name instead:

GOOG -> {'2005/01/01-start':354} {'2005/01/01-end':360} ...

And this would enable secondary indexing on the prices (but that isn't actually very useful in your example dataset).

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号