I have a table storing millions of rows. It looks something like this:
Table_Docs
ID, Bigint (Identity col)
OutputFileID, int
Sequence, int
…(many other fields)
We find ourselves in a situation where the developer who designed it made the OutputFileID the clustered index. It is not unique. There can be thousands of records with this ID. It has no benefit to any processes using this table, so we plan to remove it.
The question, is what to change it to… I have two candidates, the ID identity column is a natural choice. However, we have a process which does a lot of update commands on this table, and it uses the Sequence to do so. The Sequence is non-unique. Most records only contain one, but about 20% can have two or more records with the same Sequence.
The INSERT app is a VB6 piece of crud throwing thousands insert commands at the table. The Inserted values are never in any particular order. So the Sequence of one insert may be 12345, and the next could be 12245. I know that this could cause SQL to move a lot of data to keep the clustered index in order. However, the Sequence of the inserts are generally close to being in order. All inserts would take place at the end of the clustered table. Eg: I have 5 million records with Sequence spanning 1 to 5 million. The INSERT app will be inserting sequence’s at the end of that range at any given time. Reordering of the data should be minimal (tens of thousands of records at most).
Now, the UPDATE app is our .NET star. It does all UPDATES on the Sequence column. “Update Table_Docs Set Feild1=This, Field2=That…WHERE Sequence =12345”
– hundreds of thousands of these a day. The UPDATES are completely and totally, random, touching all points of the table.
All other processes are simply doing SELECT’s on this (Web pages). Regular indexes cover those.
So my question is, what’s better….a unique c开发者_JAVA技巧lustered index on the ID column, benefiting the INSERT app, or a non-unique clustered index on the Sequence, benefiting the UPDATE app?
First off, I would definitely recommend to have a clustered index!
Secondly, your clustered index should be:
- narrow
- static (never or hardly ever change)
- unique
- ever-increasing
so an INT IDENTITY is a very well thought out choice.
When your clustering key is not unique, SQL Server will add a 4-byte uniqueifier to those column values - thus making your clustering key and with it all non-clustered indices on that table larger and less optimal.
So in your case, I would pick the ID - it's narrow, static, unique and ever-increasing - can't be more optimal than that! Since the Sequence
is used heavily in UPDATE statements, definitely put a non-clustered index on it, too!
See Kimberly Tripp's excellent blog posts on choosing the right clustering key for great background info on the topic.
As a general rule, you want your clustered index to be unique. If it is not, SQL Server will in fact add a hidden "uniquifier" to it to force it to be unique, and this adds overhead.
So, you are probably best using the ID column as your index.
Just as a side note, using a identity column as your primary key is normally referred to as a surrogate key since it is not inherent in your data. When you have a unique natural key available that is probably a better choice. In this case it looks like you do not, so using the unique surrogate key makes sense.
The worst thing about the inserts out of order is page splits.
When SQL Server
needs to insert a new record into an existing index page and finds no place there, it takes half the records from the page and moves them into a new one.
Say, you have these records filling the whole page:
1 2 3 4 5 6 7 8 9
and need to insert a 10
. In this case, SQL Server
will just start the new page.
However, if you have this:
1 2 3 4 5 6 7 8 11
, 10
should go before 11
. In this case, SQL Server
will move records from 6
to 11
into the new page:
6 7 8 9 10 11
The old page, as it can be easily seen, will remain half filled (only records from 1
to 6
will go there which are very).
This will increase the index size.
Let's create two sample tables:
CREATE TABLE perfect (id INT NOT NULL PRIMARY KEY, stuffing VARCHAR(300))
CREATE TABLE almost_perfect (id INT NOT NULL PRIMARY KEY, stuffing VARCHAR(300))
;
WITH q(num) AS
(
SELECT 1
UNION ALL
SELECT num + 1
FROM q
WHERE num < 200000
)
INSERT
INTO perfect
SELECT num, REPLICATE('*', 300)
FROM q
OPTION (MAXRECURSION 0)
;
WITH q(num) AS
(
SELECT 1
UNION ALL
SELECT num + 1
FROM q
WHERE num < 200000
)
INSERT
INTO almost_perfect
SELECT num + CASE num % 5 WHEN 0 THEN 2 WHEN 1 THEN 0 ELSE 1 END, REPLICATE('*', 300)
FROM q
OPTION (MAXRECURSION 0)
EXEC sp_spaceused N'perfect'
EXEC sp_spaceused N'almost_perfect'
perfect 200000 66960 KB 66672 KB 264 KB 24 KB
almost_perfect 200000 128528 KB 128000 KB 496 KB 32 KB
Even with only 20%
probability of the records being out of order, the table becomes twice as large.
On the other hand, having a clustered key on Sequence
will reduce the I/O
twice (since it can be done with a single clustered index seek rather than two unclustered ones).
So I'd take a sample subset of your data, insert it into the test table with a clustered index on Sequence
and measure the resulting table size.
If it less than twice the size of the same table with an index on ID
, I'd go for the clustered index on Sequence
(since the total resulting I/O
will be less).
If you decide to create a clustered index on Sequence
, make ID
an unclustered PRIMARY KEY
and make the clustered index UNIQUE
on Sequence, ID
. This will use a meaningful ID
instead of opaque uniquiefier.
精彩评论