Is this the suitable scenario for Multi-column Indexes?_问答_开发者

My programming environment is Rails 2.3 and PostgreSQL 8 (shared Database on Heroku): I have re开发者_运维技巧ad this http://devcenter.heroku.com/articles/postgresql-indexes#multicolumn_indexes and other related resources on the Internet before I started building my app in the generic way:

My table has two columns A and B and are both indexed. (The rows are unique in terms of (A,B) pair) But after I built my app, I found that I only query the table with two types of call: myTable.find_by_A_and_B(a,b) and myTable.find_by_A(a)

We are expecting to have 10000+ entries in the table, the ratio of distinct A and distinct B is around 3:1. We expect that for each unique value in A, there would be more than 1000+ rows that have different value in B; and for each unique value in B, there would be no more than 300 rows that have different value in A.

My question is: Whether the current database setup (with two separate indexes) can be classified as "efficient" in respect to the myTable.find_by_A_and_B(a,b) call (as I have no idea on the inner working of PostgreSQL). And whether replacing the two indexes with just one multi-column indexes of (A,B) will provide significant speed up?

Thank you.

P.S. In response to the comment, here is a bit more information: According to this page, http://devcenter.heroku.com/articles/database It is running PostgreSQL 8.3

And the follow is the migration schema for myTable:

create_table :myTable do |t|
    t.string :b
    t.integer:a
    t.boolean :c, :default => false
end

add_index :mytable, :b 
add_index :mytable, :a

In recent versions of PostgreSQL multi-column indexes can be used efficiently to filter on just one of the columns. This works best on the first column, but reasonably well for the others, too.

Also, 10.000 rows is a piece of cake for PostgreSQL. Tables with millions of rows are not uncommon.

Assuming we talk about btree indexes (default) on integer (int4) columns ...
... the answer is: just use one multi-column index on (a,b).

Due to the page layout on disk (similar for tables and indexes), there is quite a bit of overhead per index row. Also, due to data alignment restrictions, one index (a,b) will use the exact same amount of disk space as an index on just (a) - on machines with MAXALIGN = 8 bytes (most 64-bit OS).
So, especially if you have lots of writes or limited disk space and/or RAM, your best bet is to just use one multi-column index on (a,b). Maintaining indexes on heavily written tables carries quite a cost, too.

Edit in response to the update on the question:

With a being integer, my answer is mostly valid. The index on (a,b) will be all or most of what you need.
Get rid of the separate index on b as you obviously don't have queries on just b.
As b is text, the multi-column index on (a,b) cannot profit from data-alignment as much as described above, but still. The greater the medium length of b, the more likely you will profit from an additional index on just a. With short b it probably doesn't pay. Else I would expect it to speed up myTable.find_by_A(a) by just a bit.
This will likely be faster then two separate indexes on a and b, but not by a huge margin, as Postgres can combine two indexes in a bitmap index scan. This has improved since v.8.3.
Be aware that btree indexes on text only help queries with '=' (more if you run on the C locale). Read the manual about operator classes.

You don't have to take my word, run some tests with EXPLAIN ANALYZE. It is very simple and informative and index creation for 10.000 rows is a matter of a second or so. Repeat each query a couple of times to populate the cache and get comparable results.