Large Join Tables and Scaling_问答_开发者_运维开发者技术经验分享

The Problem

We have a rapidly growing database with several large join tables (currently in the billions of rows), but as these tables have grown the query time has suffered. The concern is that as more data is added to the tables linked by these join tables, the join tables will continue to grow at a faster pace and adversely impact query speed.

The Background

I am dealing with a database that is storing genomic information. A number of markers (~3 million) corresponding to loci where there are DNA variations are linked to individuals that have had their genotype determined at these loci. Every marker has a number of possible genotypes of which every individual must have one.

The Current Implementation

When the database (postgresql) was still small, there were no problems in linking the genotypes to the the markers using foreign keys and then linking the individuals to their genotypes through a join table. That way, it would be easy to look up all of an individual's genotypes or look up all the individuals having a specific genotype.

A slimmed down version of these tables are listed below:

                                        Table "public.genotypes"
      Column      |            Type             |                       Modifiers                        
------------------+-----------------------------+--------------------------------------------------------
 id               | integer                     | not null default nextval('genotypes_id_seq'::regclass)
 ref_variation_id | integer                     | 
 value            | character varying(255)      |  
Indexes:
    "genotypes_pkey" PRIMARY KEY, btree (id)
    "index_genotypes_on_ref_variation_id" btree (ref_variation_id)


Table "public.genotypes_individuals"
    Column     |  Type   | Modifiers 
---------------+---------+-----------
 genotype_id   | integer | 
 individual_id | integer | 
Indexes:
    "index_genotypes_individuals_on_genotype_id_and_individual_id" UNIQUE, btree (genotype_id, individual_id)
    "index_genotypes_individuals_on_genotype_id" btree (genotype_id)

                                       Table "public.individuals"
    Column     |            Type             |                        Modifiers                         
---------------+-----------------------------+----------------------------------------------------------
 id            | integer                     | not null default nextval('individuals_id_seq'::regclass)
 hap_id 开发者_高级运维       | character varying(255)      | 
 population_id | integer                     | 
 sex           | character varying(255)      | 
Indexes:
    "individuals_pkey" PRIMARY KEY, btree (id)
    "index_individuals_on_hap_id" UNIQUE, btree (hap_id)

The bottleneck right now is looking up all of the genotypes for an individual and having them sorted by their positions. This is used frequently and much more important than looking up individuals from a genotype. Examples of some of these queries are:

A simple lookup of all of an individual's genotypes

SELECT * FROM "genotypes" INNER JOIN "genotypes_individuals" ON "genotypes".id = "genotypes_individuals".genotype_id WHERE ("genotypes_individuals".individual_id = 2946 )
Normally, though this gets limited, because there are a lot of genotypes. We're often only interested in those on a specific chromosome.

SELECT * FROM "genotypes" INNER JOIN "genotypes_individuals" ON "genotypes".id = "genotypes_individuals".genotype_id WHERE ("genotypes_individuals".individual_id = 2946 ) AND ("genotypes".ref_variation_id IN (37142, 37143, ...))
We also still need to occasionally go the other way.

SELECT * FROM "individuals" INNER JOIN "genotypes_individuals" ON "individuals".id = "genotypes_individuals".individual_id WHERE ("genotypes_individuals".genotype_id = 53430)

Every time a new individual is added to the db, the join table grows by about 3 million rows. Intuitively from a design perspective, this seems bad because adding new individuals will slow down the performance on any process using the existing data.

I understand that databases are designed to handle large tables efficiently, but we are already hitting bottlenecks due to the drive IO. An individual query is still inconsequential, but 1000s of them add up quickly. We can alleviate this problem somewhat by spreading the db across multiple drives. However, I wanted to see if there are any other alternatives out there. I have been wondering if it is somehow possible to segregate the join table entries by individual_id, which would maybe leave lookups from individuals to genotypes unimpacted by adding additional individual-genotype rows to the join table. Or do indices already do that?

Have you looked at table partitioning?

I would consider testing a schema that used natural keys instead of id numbers.

Your lookup of all of an individual's genotypes

SELECT * 
FROM "genotypes" 
INNER JOIN "genotypes_individuals" 
        ON "genotypes".id = "genotypes_individuals".genotype_id 
WHERE ("genotypes_individuals".individual_id = 2946 )

becomes

SELECT * 
FROM genotypes_individuals
WHERE (individual_id = 2946)

Sometimes that's faster. Sometimes it's not.

Switching to natural keys on our production system increased median performance by a factor of 10. Some queries ran 100 times faster with natural keys, because natural keys eliminated a lot of joins. Some queries ran slower, too. But the median speed-up was impressive anyway.