I have made few projects (CMS and EC system) that required to have some data versioned.
Usually I come with that kind of schema
+-------开发者_开发问答-------+
+ foobar +
+--------------+
+ foobar_id +
+ version +
+--------------+
it worked great but I am wondering if there is a better to way do it. The main problem with that solution you have to always use subquery to get the latest version.
i.e.:
SELECT * FROM foobar WHERE foobar_id = 2 and version = (SELECT MAX(version) FROM foobar f2 WHERE f2 = 2)
This render most of the queries more complicate and also have some performance drawbacks.
So it would be nice if you share your experience creating versioned table and what kind of pro and cons of each method.
Thanks
I prefer to have historical data in another table. I would make foobar_history or something similar and make a FK to foobar_id. This will stop you from having to use a subquery all together. This has the added advantage of not polluting your primary data table with the tons of historical data you probably don't want to see 99% of the time you're accessing it.
You will likely want to make a trigger for updating this data though, as it would require you to copy the current data in to _history and then do the update.
The cleanest solution in my opinion would be to have a History table for each table that requires versioned. In other words, have a foobar table, and then a foobar_History table, with a trigger on foobar that will write existing data to the History table with a timestamp and user that changed the data. Older data is easily queryably, sorted by timestamp descending, and you know that the data in the main table is always the latest version.
I used to work on a system with historical data, and we had a boolean to indicate which one was the latest version of the data. Of course you need to maintain the consitency of the flag at the applicative level. Then you can create indexes that use the flag and if you provide it in the where clause it's fast.
Pro:
- easy to understand
- does not require major change to your (existing) database schema
- no need to copy old data in another table, only flag is updated.
Cons:
- flag need to be maintained at applicative level
Otherwise, you can rely on a separate history table, as suggested in several answers.
Pro:
- clean sepration of history from actual data
- possible to have a db-level cascade delete between actual data and its history, in case the entity is removed
Cons:
- need 2 queries (or a union) if you want the complete history (that is, old data + current data)
- the row that corresponds to the latest version of the data will be updated. I heard that update are slower than insert, depending on the "size" of the data that changed.
What is best will depend from your use case. I had to deal with a document management system where we wanted to be able to version document. But we also had feature like reverting to old version. It was easier to use a boolean to speed up just the operation that required the last one. If you have real historical data (which never change) probably a dedicated history table is better.
Does the concept of history fit in your domain model? If no, then you have a db schema that differs from your conceptual domain model. If at the domain level, the actual data and the old data need to be handled the same way, having two tables complicates the design. Just consider the case you need to return the complete history (old + new). The easiest solution would be to have one class for each table, but then you can't return a list as easily as if you have only one table. But if these are two distinct concepts, then it's fine to have history be first-class in your design.
I would also recommend this article by M. Fowler also interesting when it comes to dealing with temporal data: Patterns for things that change with time
You can simplify the query by using a view over your table which filters to the latest version. This only makes the queries look nicer you still have the performance overhead.
Common technique is to add a column version_status
for current/expired. Also a note, if you keep new and old records in the same table, you should have a business (natural) key for your entity, something like name + pin
, because the primary key will change (increment) with each row.
TABLE foobar(foobar_id PK, business_key, version, version_status, .....)
SELECT *
FROM foobar
WHERE business_key = 'myFoobar3' AND version_status = 'current'
When deciding to keep the record history in the same table -- or move it to a separate one -- consider other tables which have the foobar_id
as a foreign key. When issuing a new version, should existing foreign keys point to the new PK or to the old PK? If you want to keep history of relationships, you would probably want to keep everything in the same table. If only the new version is important, you may consider to move expired rows to another table -- though it is not necessary.
If you had used Oracle you could use analytic functions
select * from ( SELECT a.* , row_number() over (partition by foobar_id order by version desc) rn FROM foobar a WHERE foobar_id = 2 ) where rn = 1
It depends on how many of your tables require versioning, and if you've got a transactional ore reporting system.
If just a few transactional tables - the way that you're doing it is fine as long as the performance issues aren't too significant. You can make the querying easier by adding a column for current_row and a trigger that updates the prior row to make it non-current.
But if you've got a lot of tables or the extra rows are slowing down some of your queries then I'd do as others suggest and use history tables as well as history triggers. Note that you can generate that code to make it easier to develop & maintain.
If you're in the reporting world then there's a lot other options I won't address here. You can find the options given in detail in data warehousing data modeling books.
精彩评论