开发者

Index on multiple bit fields in SQL Server

开发者 https://www.devze.com 2023-03-29 00:08 出处:网络
We currently have a scenario where one table effectively has several (10 to 15) boolean flags (not nullable bit fields).Unfortunately, it is not really possible to simplify this too much on a logical

We currently have a scenario where one table effectively has several (10 to 15) boolean flags (not nullable bit fields). Unfortunately, it is not really possible to simplify this too much on a logical level, because any combination of the boolean values is permissi开发者_JAVA百科ble.

The table in question is a transactional table which may end up having tens of millions of rows, and both insert and select performance is fairly critical. Although we are not quite sure of the distribution of the data at this time, the combination of all flags should provide relative good cardinality, i.e. make it a "worthwhile" index for SQL Server to make use of.

Typical select query scenarios might be to select records based on 3 or 4 of the flags only, e.g. WHERE FLAG3=1 AND FLAG7=0 AND FLAG9=1. It would not be practical to create separate indexes for all combinations of the flags used by these select queries, as there will be many of them.

Given this situation, what would be the recommended approach to effectively index these fields? The table is new, so there is no existing data to worry about yet, and we have a fair amount of flexibility in the actual implementation of the table.

There are two main options that we are considering at the moment:

  • Create a single index which includes all the bit fields (this would probably include 1 or 2 other int fields which would always used). My concern is that given the typical usage of only including a few of the fields, this approach would skip the index and resort to a table scan. Let's call this Option A (Having read some of the replies, it seems that this approach would not work well, since the order of the fields in the index would make a difference making it impossible to index effectively on ALL the fields).
  • Effectively do what I believe SQL Server is doing internally, and encode the bit fields into a single int field using binary operators (AND-ing and OR-ing numbers together: 1, 2, 4, 8, etc). My concern here is that we'd need to do some kind of calculation to query on this encoded field, which would skip the index again. Maintenance and the complexity of this solution is also a concern. Let's call this Option B. Additional info: The argument for this approach is that we could have a relatively simple and short index which includes one or two other fields from the table and this field. The other fields would narrow down the number of records needing to be evaluated, and since the encoded field would contain all of our bit fields, SQL Server would be able to perform the calculation using the data retrieved from the index directly (i.e. an index scan) as opposed to the table (i.e. a table scan).

At the moment, we are heavily leaning towards Option B. For completeness, this would be running on SQL Server 2008.

Any advice would be greatly appreciated.

Edit: Spelling, clarity, query example, additional info on Option B.


A single BIT column typically is not selective enough to be even considered for use in an index. So an index on a single BIT column really doesn't make sense - on average, you'd always have to search about half the entries in the table (50% selectiveness) and so the SQL Server query optimizer will instead use a table scan.

If you create a single index on all 15 bit columns, then you don't have that problem - since you have 15 yes/no options, your index will become quite selective.

Trouble is: the sequence of the bit columns is important. Your index will only ever be considered if your SQL statement uses at least 1-n of the left-most BIT columns.

So if your index is on

Col1,Col2,Col3,....,Col14,Col15

then it might be used for a query that uses

  • Col1
  • Col1 and Col2
  • Col1 and Col2 and Col3 ....

and so on. But it cannot be used for a query that specifies Col6,Col9 and Col14.

Because of that, I don't really think an index on your collection of BIT columns really makes a lot of sense.

Are those 15 BIT columns the only columns you use for querying? If not, I would try to combine those BIT columns that you use most for selection with other columns, e.g. have an index on Name and Col7 or something (then your BIT columns can add some additional selectivity to another index)


Whilst there are probably ways to solve your indexing problem against your existing table schema, I would reduce this to a normalisation problem:

e.g I would highly recommend creating a series of new tables:

  1. Lookup table for the names of this bit flags. e.g. CREATE TABLE Flags (id int IDENTITY(1,1), Name varchar(256)) (you don't have to make id an identity-seed column if you want to manually control the id's - e.g. 2,4,8,16,32,64,128 as binary flags.)
  2. Create a new link-table that contains the id's of the original data table and the new link table e.g. CREATE TABLE DataFlags_Link (id int IDENTITY(1,1), MyFlagId int, DataId int)

You could then create an index on the DataFlags_Link table and write queries like:

SELECT Data.*
FROM Data
INNER JOIN DataFlags_Link ON Data.id = DataFlags_Link.DataId
WHERE DataFlags_Link.MyFlagId IN (4,7,2,8)

As for performance, that's where good DBA maintenance comes in. You'll want to set the INDEX fill-factor and padding on your tables appropriately and run regular index defragmentation or rebuild your indexes on a schedule.

Performance and maintenance go hand-in-hand with databases. You can't have one without the other.


Whilst I think Neil Fenwick's answer is probably right, I think the real answer is to try out the different options and see which one is fast enough.

Option 1 is probably the most straightforward solution, and therefore likely the most maintainable - and it may well be fast enough.

I would build a prototype database, with the "option 1" schema, and use something like http://www.red-gate.com/products/sql-development/sql-data-generator/ or http://sourceforge.net/projects/dbmonster/ to create twice as much data as you expect to need, and then build the queries you expect to need. Agree an acceptable response time, and only consider a "faster" schema if you exceed those response times (and you can't throw hardware at the problem).

Neil's solution is probably just as obvious and maintainable as "option 1" - and it should be easy to index. However, I'd still test it by creating a prototype schema and generating a lot of test data...

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号