Optimal database table optimization method_问答_开发者

I have a database table that is growing too big (few hundred million rows) that needs to be optimized, but before I get into partitioning it, I thought I'd ask about suggestions.

Here is the usage:

0 . Table contains about 10 columns of length about 20 bytes each.

INSERTS are performed at a rate of hundreds of times per second.
SELECT statements are performed based on column 'a' (where a='xxxx' ) a few times per hour.
DELETE statements are performed based on a DATE column. (delete where date older than 1 year) usually once per day.

The key requirement is to speed up INSERT and SELECT statements, and be able to keep history data of 1 year back without locking the whole table down while deleting.

I would guess that I must have two indexes, one for column 'a', and the other for the date field. or is it possible to optimize both?

Will there be a necessary trade-off between speed on select and speed of delete?

Is partitioning the only solution? What are good strategies for partitioning开发者_开发问答 such table?

I'm using a PostgreSQL 8.4 database.

Rather than keeping it a single physical table, have you looked into PostgreSQL partitioning? It's supported as of version 8.1.

Partitioning can help you avoid the problem of choosing between fast INSERT vs fast DELETE performance. You can always partition the table by Year/Month, and just drop the partitions that you no longer need. Dropping partitions is extremely fast, and inserting into small partitions is also extremely fast.

From the manual:

Partitioning refers to splitting what is logically one large table into smaller physical pieces. Partitioning can provide several benefits:

Query performance can be improved dramatically for certain kinds of queries.

Update performance can be improved too, since each piece of the table has indexes smaller than an index on the entire data set would be. When an index no longer fits easily in memory, both read and write operations on the index take progressively more disk accesses.

Bulk deletes may be accomplished by simply removing one of the partitions, if that requirement is planned into the partitioning design. DROP TABLE is far faster than a bulk DELETE, to say nothing of the ensuing VACUUM overhead.

Seldom-used data can be migrated to cheaper and slower storage media.

The benefits will normally be worthwhile only when a table would otherwise be very large. The exact point at which a table will benefit from partitioning depends on the application, although a rule of thumb is that the size of the table should exceed the physical memory of the database server.

Currently, PostgreSQL supports partitioning via table inheritance. Each partition must be created as a child table of a single parent table. The parent table itself is normally empty; it exists just to represent the entire data set. You should be familiar with inheritance (see Section 5.8) before attempting to implement partitioning.

Partitioning is your answer, as others stated, but:

I'd partition on some hash(a). If a is an integer then a%256 would be good. If it is a text then something like substring(md5(a) for 2).

It will speed up inserts and selects.

For deletes I'd make them run more often but smaller and also partitioned. I'd run them every hour (at XX:30) and like this:

delete from table_name
where date<(current_date - interval '1 year')
and
  hash(a)
  =
  (extract(doy from current_timestamp) * 24
    + extract(hour from current_timestamp))::int % 256;

EDIT: I've just tested this:

create function hash(a text) returns text as $$ select substring(md5($1) for 1) $$ language sql immutable strict;
CREATE TABLE tablename (id text, mdate date);
CREATE TABLE tablename_partition_0 ( CHECK ( hash(id) = '0' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_1 ( CHECK ( hash(id) = '1' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_2 ( CHECK ( hash(id) = '2' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_3 ( CHECK ( hash(id) = '3' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_4 ( CHECK ( hash(id) = '4' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_5 ( CHECK ( hash(id) = '5' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_6 ( CHECK ( hash(id) = '6' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_7 ( CHECK ( hash(id) = '7' ) ) INHERITS (tablename); 
CREATE TABLE tablename_partition_8 ( CHECK ( hash(id) = '8' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_9 ( CHECK ( hash(id) = '9' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_a ( CHECK ( hash(id) = 'a' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_b ( CHECK ( hash(id) = 'b' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_c ( CHECK ( hash(id) = 'c' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_d ( CHECK ( hash(id) = 'd' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_e ( CHECK ( hash(id) = 'e' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_f ( CHECK ( hash(id) = 'f' ) ) INHERITS (tablename);
analyze;
explain select * from tablename where id='bar' and hash(id)=hash('bar');

                                         QUERY PLAN                                          
---------------------------------------------------------------------------------------------
 Result  (cost=0.00..69.20 rows=2 width=36)
   ->  Append  (cost=0.00..69.20 rows=2 width=36)
         ->  Seq Scan on tablename  (cost=0.00..34.60 rows=1 width=36)
               Filter: ((id = 'bar'::text) AND ("substring"(md5(id), 1, 1) = '3'::text))
         ->  Seq Scan on tablename_partition_3 tablename  (cost=0.00..34.60 rows=1 width=36)
               Filter: ((id = 'bar'::text) AND ("substring"(md5(id), 1, 1) = '3'::text))
(6 rows)

You'd need to add hash(id)=hash('searched_value') to your queries or Postgres will search all tables.

EDIT: You can also use rule system for automatic insertions to correct tables:

create rule tablename_rule_0 as
  on insert to tablename where hash(NEW.id)='0'
  do instead insert into tablename_partition_0 values (NEW.*);
create rule tablename_rule_1 as
  on insert to tablename where hash(NEW.id)='1'
  do instead insert into tablename_partition_1 values (NEW.*);
-- and so on
insert into tablename (id) values ('a');
select * from tablename_partition_0;
 id | mdate 
----+-------
 a  | 
(1 row)

One solution is to partition based on insert date.

Namely, your application (or DAO) decides which table to insert into based on some logic which combines current date (or rather time since last partition slice was started) and/or te size of "last" partition. Or offload such logic into a daily script, and have the script populate some "this is the partition to use" for DAO's use.

This immediately removes your need to delete "old" rows (just drop the old partition); it also ensures that your inserts periodically start populating small table, which, among other things, speeds up the "average" INSERT/SELECT speed (worst case scenarios are still just as slow of course)

If you were to break this table up into proper partitioning, you would be able to use truncate instead of delete which would decrease your maintenance costs as it does not create dead space.

I am no expert but it seems that partitioning on the column "a" would speed up your selects but partitioning on the date (as all the other answers are suggesting) would speed up the deleting (drop the table) but would be useless for your select.

It seems, both cases would enhance the insert performance.

Any expert care to weight in on the issue ? Is it possible / useful to partition on both fields ?