Cassandra and Tombstones: Creating a Row , Deleting the Row, Recreating the Row = Performance?_问答_开发者

Could someone please explain, what effect the followi开发者_如何学Gong process has on tombstones:

1.)Creating a "Row" with Key "1" ("Fields": user, password, date)

2.)Deleting the "Row" with Key "1"

3.)Creating a "Row" with Key "1" ("Fields": user, password,logincount)

The sequence is executed in one thread sequentially (so this happens with a relatively high "speed" = no long pauses between the actions).

My Questions:

1.) What effect does this have on the creation of a tombstone. After 2.) a tombstone is created/exists. But what happens to the existing tombstone, if the new (slightly changed row) is created again under the same key (in process Step 3.)). Can cassandra "reanimate" the tombstones very efficiently?)

2.) How much worse is the process described above in comparison to only very targetly deleting the date "field" and then creating the "logincount" field instead? (It will most likely be more performant. But on the contrary it is much more complex to find out which fields have been deleted in comparison to just simply delete the whole row and recreate it from scratch with the correct data...)

Remark/Update:

What I actually want to do is, setting the "date" field to null. But this does not work in cassandra. Nulls are not allowed for values. So in case I want to set it to null I have to delete it. But I am afraid that this explicit second delete request will have a negative performance impact (compared to just setting it to null)...And as described I have to first find out which fields are nulliefied and foremost had a value (I have to compare all atributes for this state...)

Thank you very much! Markus

I would like to belatedly clarify some things here.

First, with respect to Theodore's answer:

1) All rows have a tombstone field internally for simplicity, so when the new row is merged with the tombstone, it just becomes "row with new data, that also remembers that it was once deleted at time X." So there is no real penalty in that respect.

2) It is incorrect to say that "If you create and delete a column value rapidly enough that no flush takes place in the middle... the tombstone [is] simply discarded"; tombstones are always persisted, for correctness. Perhaps the situation Theodore was thinking was the other way around: if you delete, then insert a new column value, then the new column replaces the tombstone (just as it would any obsolete value). This is different from the row case since the Column is the "atom" of storage.

3) Given (2), the delete-row-and-insert-new-one is likely to be more performant if there are many columns to be deleted over time. But for a single column the difference is negligible.

Finally, regarding Tyler's answer, in my opinion it is more idiomatic to simply delete the column in question than to change its value to an empty [byte]string.

1). If you delete the whole row, then the tombstone is still kept and not reanimated by the subsequent insertion in step 3. This is because there may have been an insertion for the row a long time ago (e.g. step 0: key "1", field "name"). Row "1" key "name" needs to stay deleted, while row "1" key "user" is reanimated.

2). If you create and delete a column value rapidly enough that no flush takes place in the middle, there is no performance impact. The column will be updated in-place in the Memtable, and the tombstone simply discarded. Only a single value will end up being written persistently to an SSTable.

However, if the Memtable is flushed to disk between steps 2 and 3, then the tombstone will be written to the resulting SSTable. A subsequent flush will write the new value to the next SSTable. This will make subsequent reads slower, since the column now needs to be read from both SSTables and reconciled. (Similarly if a flush occurs between steps 1 and 2.)

Just set the "date" column to hold an empty string. That's what's typically used instead of null.

If you want to delete the column, just delete the column explicitly instead of deleting the entire row. The performance effect of this is similar to writing an empty string for the column value.