I am considering two options for table design and I'm not sure what the pros and cons are for each.
Here is a somewhat abstracted description of my situation:
I am keeping track of a number of data points (category_id, point_id, value)
.
Most of the time, I am only interested in the current value of the data-point. But I need to log all historical values whenever there is a change.
Occasionally I might want to look at the historical values of a particular point, but it is ok if these queries are a little bit slow. What's most important is that I can get the current values of all points, or the current values of all points in a particular category as quickly as possible.
The two (and possibly three) approaches I am consider开发者_运维百科ing:
- Use two separate tables, a
current_vales
and ahistory
table, with a trigger that will insert a row into the history table every time something incurrent_values
changes. - Use only one table with a boolean flag
isCurrent
on each row. Whenever a value changes mark that row as no longer current and insert a new current row with the updated value. - (Use only one table with timestamps on each row -- then the current value for a particular id is the row with the most recent timestamp. But this seems complicated to express as a query, especially if i want to get all the current values for a particular category, not even sure how I would express this without subqueries or of the performance)
There will only be about 3,000-5,000 current points at a time, but the values change frequently enough that up to half of these can change every day, so there will be hundreds of thousands of rows of history eventually.
What are the pros and cons of each approach above (or is there another better approach that I haven't mentioned)? Given my goal of getting the current set of points as quickly as possible, and being ok with slower queries on the history, which is best?
Option 1 and 2 will have similar performance - your manual "partition" of the data in Option 1 can also be managed with a clustered index with IsCurrent as the first column in Option 2. You can always have a view which only gives current and in some ways, this will be very similar in performance, since changing the IsCurrent will move the old row physically (due to the clustering) and add the new row just like your trigger would delete and insert in two tables.
You could also use partitions feature of MySQL.
A big benefit of having separate tables or partitions of a single table is controlling the backup (and potentially purging) of the data in a more fine-grained way.
A real benefit of Option 1 is you do save that small column, which could be beneficial when you get to billions of rows.
A maintenance benefit of Option 2 is that the schema is always the same (don't have to keep changes in sync), since there is only one table.
Option 3 is not going to perform as well because the leading edge of current values is more difficult to find - i.e. varying (although not impossible to improve performance with an index on identifier and timestamp DESC)
Selecting current values
Option 1 will be about the same as Option 2 with the right indexes.
An example index for option 1 (category_id, point_id)
An example index for option 2 (isCurrent, category_id, point_id)
When you select with the option 2 table you will need to include IsCurrent in your where filter. The db engine will know that the index has the current records grouped so to speak and will not even need to look at the non current records. Think of the table as a phone book, with out an index you will need to check every page to find the the business you are after. But with the index, you look up the index and it says businesses starting with "H" are on page 348. So you go directly to page 348 and ignore the others. Its the same concept with databases only you have the ability to make nested levels of indexes. Sorry about the rant, but a few have stated that option 2 is far slower which is not true, they will be about the same.
Inserting/Updating.
This is where I see the major difference between option 1 and 2.
Option 1, when you make an update to an existing record, theses are the steps I see happening
- you hit the disk to read the existing current record
- you then hit the disk to add that record to the history table
- you then hit the disk again to delete the existing current record
- and then you will hit the disk to add the new current record
Option 2, when you make an update to an existing record, theses are the steps I see happening
- you hit the disk to update the existing current record to set it to non current
- and then you will hit the disk to add the new current record
The more times you hit the disk, the slower things get. Its probably not a biggy until you are doing a lot of updates per minute, but the less work the DB needs to do, the better.
I have left option 3 out as its not easy do get current records without smashing the disk. I think you will need timestamps regardless of which option you choose but only when querying history data, say a value has 20 history values, which one of the 20 do you display?
Thats my two bits worth.
I would go with option #1. It allows the easy retrieval of the current_values, the history table will hold a significant number of transactions but could potentially be cleared of older history if necessary.
Option #2 would make you more regular query of current_values much slower to return.
Option 1 seems to be the right approach given that the changes moved out to history are more of an audit. Its also much better separation of data.
Option 2: Multiple rows in the same table is used when the row represented is unique. So given that the change is going to be on one unique entity its better to not have it from a design point of view.
Option 3: Again comments in Option 2 are applicable & also the queries can get into a tangle.
精彩评论