Does clustered index on foreign key column increase join performance vs non-clustered?_问答_开发者

In many places it's recommended that clustered indexes are better utilized when used to select range of rows using BETWEEN statement. When I select joining by foreign key field in such a way that this clustered index is used, I guess, that clusterization should help too because range of rows is being selected even though they all have same clustered key value and BETWEEN is not used.

Considering that I care only about that one select开发者_开发知识库 with join and nothing else, am I wrong with my guess ?

Discussing this type of issue in the absolute isn't very useful.

It is always a case-by-case situation !

Essentially, access by way of a clustered index saves one indirection, period.

Assuming the key used in the JOIN, is that of the clustered index, in a single read [whether from an index seek or from a scan or partial scan, doesn't matter], you get the whole row (record).

One problem with clustered indexes, is that you only get one per table. Therefore you need to use it wisely. Indeed in some cases, it is even wiser not to use any clustered index at all because of INSERT overhead and fragmentation (depending on the key and the order of new keys etc.)

Sometimes one gets the equivalent benefits of a clustered index, with a covering index, i.e. a index with the desired key(s) sequence, followed by the column values we are interested in. Just like a clustered index, a covering index doesn't require the indirection to the underlying table. Indeed the covering index may be slightly more efficient than the clustered index, because it is smaller.
However, and also, just like clustered indexes, and aside from the storage overhead, there is a performance cost associated with any extra index, during INSERT (and DELETE or UPDATE) queries.

And, yes, as indicated in other answers, the "foreign-key-ness" of the key used for the clustered index, has absolutely no bearing on the the performance of the index. FKs are constraints aimed at easing the maintenance of the integrity of the database but the underlying fields (columns) are otherwise just like any other field in the table.

To make wise decisions about index structure, one needs

to understands the way the various index types (and the heap) work
(and, BTW, this varies somewhat between SQL implementations)
to have a good image of the statistical profile of the database(s) at hand:
which are the big tables, which are the relations, what's the average/maximum cardinality of relation, what's the typical growth rate of the database etc.
to have good insight regarding the way the database(s) is (are) going to be be used/queried

Then and only then, can one can make educated guesses about the interest [or lack thereof] to have a given clustered index.

I would ask something else: would it be wise to put my clustered index on a foreign key column just to speed a single JOIN up? It probably helps, but..... at a price!

A clustered index makes a table faster, for every operation. YES! It does. See Kim Tripp's excellent The Clustered Index Debate continues for background info. She also mentions her main criteria for a clustered index:

narrow
static (never changes)
unique
if ever possible: ever increasing

INT IDENTITY fulfills this perfectly - GUID's do not. See GUID's as Primary Key for extensive background info.

Why narrow? Because the clustering key is added to each and every index page of each and every non-clustered index on the same table (in order to be able to actually look up the data row, if needed). You don't want to have VARCHAR(200) in your clustering key....

Why unique?? See above - the clustering key is the item and mechanism that SQL Server uses to uniquely find a data row. It has to be unique. If you pick a non-unique clustering key, SQL Server itself will add a 4-byte uniqueifier to your keys. Be careful of that!

So those are my criteria - put your clustering key on a narrow, stable, unique, hopefully ever-increasing column. If your foreign key column matches those - perfect!

However, I would not under any circumstances put my clustering key on a wide or even compound foreign key. Remember: the value(s) of the clustering key are being added to each and every non-clustered index entry on that table! If you have 10 non-clustered indices, 100'000 rows in your table - that's one million entries. It makes a huge difference whether that's a 4-byte integer, or a 200-byte VARCHAR - HUGE. And not just on disk - in server memory as well. Think very very carefully about what to make your clustered index!

SQL Server might need to add a uniquifier - making things even worse. If the values will ever change, SQL Server would have to do a lot of bookkeeping and updating all over the place.

So in short:

putting an index on your foreign keys is definitely a great idea - do it all the time!
I would be very very careful about making that a clustered index. First of all, you only get one clustered index, so which FK relationship are you going to pick? And don't put the clustering key on a wide and constantly changing column

An index on the FK column will help the JOIN because the index itself is ordered: clustered just means that the data on disk (leaf) is ordered rather then the B-tree.

If you change it to a covering index, then clustered vs non-clustered is irrelevant. What's important is to have a useful index.

It depends on the database implementation.

For SQL Server, a clustered index is a data structure where the data is stored as pages and there are B-Trees and are stored as a separate data structure. The reason you get fast performance, is that you can get to the start of the chain quickly and ranges are an easy linked list to follow.

Non-Clustered indexes is a data structure that contains pointers to the actual records and as such different concerns.

Refer to the documentation regarding Clustered Index Structures.

An index will not help in relation to a Foreign Key relationship, but it will help due to the concept of "covered" index. If your WHERE clause contains a constraint based upon the index. it will be able to generate the returned data set faster. That is where the performance comes from.

The performance gains usually come if you are selecting data sequentially within the cluster. Also, it depends entirely on the size of the table (data) and the conditions in your between statement.