For a large table of transactions (100 million rows, 20 GB) that already has a primary key (a natural composite key of 4 columns), will it help performance to add an identity column and make that the primary key?
The current primary key (the natural composite primary key of 4 columns) does the job, but I have bee开发者_如何学Cn told that you should always have a surrogate key. So, could improve performance by creating an identity column and making that the primary key?
I'm using SQL Server 2008 R2 database.
EDIT: This transaction table is mainly joined to definition tables and used to populate reports.
EDIT: If I did add a surrogate key, it wouldn't be used in any joins. The existing key fields would be used.
EDIT: There would be no child tables to this table
Just adding an IDENTITY column and adding a new constraint and index for it is very unlikely to improve performance. The table will be larger and therefore scans and seeks could take longer. There will also be more indexes to update. Of course it all depends what you are measuring the performance of... and whether you intend to make other changes to code or database when you add the new column. Adding an IDENTITY column and doing nothing else would probably be unwise.
Only if:
- you have child tables that are larger
- you have nonclustered indexes
In each of these cases, the PK (assumed clustered) of your table will be in each child entry/NC entry. So making the clustered key narrower will benefit.
If you have just non NC indexes (maybe one) and no child tables all you'll achieve is
- a wider row (more data pages used)
- a slightly smaller B-tree (which is a fraction of total space)
...but you'll still need an index/constraint on the current 4 columns anyway = an increase in space.
If your 4 way key capture parent table keys too (sounds likely) then you'd lose the advantage of overlap. This would be covered by the new index/constraint though.
So no, you probably don't want to do it.
We threw away a surrogate key (bigint) on a billion+ row table and moved to the actual 11-way key and reduced space on disk by 65%+ because of a simpler structure (one less index, slighty more rows per page etc)
Given your edits, and all the conversation the question has sparked notwithstanding, I would suggest that adding an IDENTITY column to this table will create a lot more harm than benefit.
One place where performance is hurt is on the change of the data in the natural key. The change woudl then have to promulgate to all the child records. For instance, suppose one of those fields was company name and the company changed their name, then all the related records, and there could be millions of them, would have to change but if you used a surrogate key, only one record would have to change. Integer joins tend to be faster (generally much faster than 4 column joins) and wrting the code to join is generally faster as well. However, on the other hand, having the vital four fields may mean the join isn't needed as often. Insert performanc ewilltake a slight hit as well as the surrogate key has to be generated and indexed. Usually this is so small a hit as to be unnoticalbe but the possibility is there.
A four column natural key is often not a unique as you think it will be because that number of columns the data tends to change over time. While it is unique now, will it be unique over time? If you have used a surrogate key and a unique index onteh natural key and it turns out later not to unique, then all you have to do is drop the unique index. If it is the PK and there are child tables, you have to totally redesign your database.
Only you can decide which if any of these considerations affects your specific data needs, surrogate keys are better for some applications and worse for others.
---EDIT: Based on the edits to the question, adding an identity/surrogate key might not be the solution to this problem.
--Original Answer.
One case of performance improvement would be when you use joins and when you have child tables.
In the absence of surrogate keys, you would have to replicate all th4 4 keys to the child table and join on the 4 columns.
t_parent
-------------
col1,
col2,
col3,
col4,
col5,
constraint pk_t_parent primary key (col1,col2,col3,col4)
t_child
----------
col1,
col2,
col3,
col4,
col7,
col8,
constraint pk_t_child primary key (col1,col2,col3,col4, col5),
constraint fk_parent_child foreign key (col1, col2, col3, col4) references
t_parent ((col1, col2, col3, col4))
The joins will include all the 4 columns..
select t2.*
from t_parent t1, t_child t2
where (t1.col1 = t2.col1 and
t1.col2 = t2.col2 and
t1.col3 = t2.col3 and
t1.col4 = t2.col4
)
If you use a surrogate key and create a unique constraint on the 4 columns (which are now part of the primary key), it will be both efficient and the data would still be validated as before.
精彩评论