Database: Insert new rows or update existing ones?_问答_开发者

Object-oriented design encourages the use of immutable objects to improve thread-safety and performance. I'm wondering whether this carries over to relational databases.

Am I better off updating existing rows, or inserting new ones that act as overrides?

Use-case
- Each employee is associated with exactly one company
- Employees change their company over time.
- The employee name should be开发者_StackOverflow社区 unique.
Schema
- Employee[name, company]

Option 1: Every time an employee changes companies, insert a new Employee[name, company] row. The application is instructed to skip older rows (which are pruned up in a background thread over time). Option 2: Every time an employee changes companies, update the existing row.

Option 1 reminds me of immutable objects in that it's thread-safe (no need for locks). On the other hand, every time the employee changes companies, I have to clone all associated objects and point them at the new record. Furthermore, it's not clear how to prevent duplicate Employees from getting created by mistake.

Option 2 makes it easy to prevent duplicate employees but has the downside of returning potentially inconsistent associations in READ_COMMITTED transaction isolation.

I am posting this in the hopes that this helps others in the future. I have personally wasted countless days going down this (wrong) path.

Immutable Objects are meant for value-types (think Integer, a timestamp, a temperature reading, etc). They are types that will never change. The moment you start talking about modifying the values of Immutable Objects it is a pretty strong indication that you are going down the wrong path. When you use genuine Immutable Objects you should never have to update references of associated objects.

So the correct answer, whether for object-oriented programming or database design, is to update mutable objects in-place.

UPDATE: marc_s mentions the fact that some systems require an immutable audit trail. I suggest splitting the system into two. The main table updates data in-place while inserting copies into a separate audit table. This has two advantages:

The main table can take advantage of integrity checks (i.e. "employee names must be unique").
The main table remains very fast for reads and the bigger/slower audit table can be trimmed over time.

This allows you to enjoy the best of both worlds.

Those aren't options. They're completely different things, and they require completely different tables. The painful part is that the data from the tables might look exactly the same. Here's how to tell them apart.

Each table in a relational database has one and only one predicate. The predicate determines what the rows in the table mean. So a table whose data looks like this

Name    Company
--
Gili    Microsoft
Marc    Oracle

might mean

Person named "Gili" is currently an employee of company "Microsoft".
Person named "Marc" is currently an employee of company "Oracle".

Such a table would exclude consultants, because they're not employees. (In the USA they're not, anyway.)

But it might mean

Person named "Gili" once was an employee of company "Microsoft".
Person named "Marc" once was an employee of company "Oracle".

and that table would also allow

Person named "Gili" once was an employee of company "Oracle".

Different predicates, different tables. (That means you have to construct the tables differently to capture the meaning of the predicate.)

What you can't have is a table that means

Person named "Gili" is currently an employee of company "Microsoft".
Person named "Marc" once was an employee of company "Oracle".

Two different predicates in one table. Can't do it in a relational system.

So, if your predicate boils down to this

Person named NAME is currently an employee of company COMPANY.

then you must update COMPANY when the person changes employers. And if your predicate boils down to this

Person named NAME once was an employee of company COMPANY.

then you must insert a new row when the person changes employers.

In general, Data Warehousing tends to follow the "insert only" pattern. The reason is that obsolete rows in diomension tables are still needed to place old facts in the context that existed when they were new facts.

Example: Pennsylvania was part of the Northeast sales region until January 1, when it became part of the Middle Atlantic Sales Region. A sale that was made last December needs to refer back to a row in the geographic dimension table that places it in the northeast region. An update in place of the "State" dimension table would invalidate this old fact.

The tendency in OLTP databases is to perform updates in place, and only keep track of what currently is the case. However, this may result in copying some data into transaction rows. For example, The invoice detail row in a purchase order system may contain the price of the item ordered, copied from the row in the products table. That way, if the price gets updated in the products table, the price affecting this invoice doesn't get clobbered.

The basic difference is this:

if you insert a new row for each change, and e.g. "deactivate" the old row by setting a "ValidTo" date, then you have a history of what the changes where, over time - you're going into the area of a "temporal" database
if you keep updating the same row over and over again, then you always have the current state - but not history.

So I guess, this really is the big question: do you need temporal information, e.g. the ability to go "back in time" and now what the state of your data was three months ago?? If so, option 1 (including "soft deletes" - only mark rows as deleted, don't actually delete them) is your only option. The downside is increased complexity and more storage requirements, obviously.

tblPerson
    PersonID 
    LastName
    FirstName

tblCompany
    CompanyID
    CompanyName

tblCompany_Employee
    PersonID
    CompanyID
    StartDate
    EndDate

Rows are never DELETED from tblCompany_Employee - Records are inserted when a person is hired, with a start date. At the time employment ends, the EndDate is UPDATEd from NULL to the date empoyments ended.

To find current employees for a particular company, SELECT PersonID FROM tblCompanyEmployee WHERE EndDAte IS Null.

To find the Employment Status for a particular person at a specific company:

SELECT PersonID
FROM tblCompany_Employee
WHERE PersonID = @PersonID
AND CompanyID = @CompanyID
AND EndDate IS Null

NOTE: THe above should probably be wrapped in a function which only returns true of a valid EmployeeID is returned - If the person was never employed at the company in question the statement would return no records, hence false.

In this case, an audit trail is maintained, and it is possible (with some additiona refinement, obviously - I have put this pretty crudely here) to determine:

A. The Employment history for a person at all companies B. All Persons who HAVE BEEN employed by a particular company C. Who is CURRENTLY employed by a certain company D. Who is NOT currently employed by a certain company D. Etc.

Without ever losing data due to UPDATES which overwrite the historical record.