HI
I have a table which holds files and their types such as
CREATE TABLE files (
id SERIAL PRIMARY KEY,
name VARCHAR(255),
filetype VARCHAR(255),
...
);
and another table for holding file properties such as
CREATE TABLE properties (
id SERIAL PRIMARY KEY,
file_id INTEGER CONSTRAINT fk_files REFERENCES files(id),
size INTEGER,
... // other property fields
);
The file_id field has an index.
The file table has around 800k lines, and the properties table around 200k (not all files necessarily have/need a properties).
I w开发者_如何学运维ant to do aggregating queries, for example find the average size and standard deviation for all file types. But it's very slow - around 70 seconds for the latter query. I understand it needs a sequential scan, but still it seems too much. Here's the query
SELECT f.filetype, avg(size), stddev(size) FROM files as f, properties as pr
WHERE f.id = pr.file_id GROUP BY f.filetype;
and the explain
HashAggregate (cost=140292.20..140293.94 rows=116 width=13) (actual time=74013.621..74013.954 rows=110 loops=1)
-> Hash Join (cost=6780.19..138945.47 rows=179564 width=13) (actual time=1520.104..73156.531 rows=179499 loops=1)
Hash Cond: (f.id = pr.file_id)
-> Seq Scan on files f (cost=0.00..108365.41 rows=1140941 width=9) (actual time=0.998..62569.628 rows=805270 loops=1)
-> Hash (cost=3658.64..3658.64 rows=179564 width=12) (actual time=1131.053..1131.053 rows=179499 loops=1)
-> Seq Scan on properties pr (cost=0.00..3658.64 rows=179564 width=12) (actual time=0.753..557.171 rows=179574 loops=1)
Total runtime: 74014.520 ms
Any ideas why it is so slow/how to make it faster?
Have you defined reasonable settings for server parameters like shared_buffers, work_mem and effective_cache_size? http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
In particular, I think work_mem will affect how much of the hash table for the join can be kept in memory as opposed to on-disk. Also, a reduced random_page_cost might influence the planner towards using a merge join instead- try temporarily setting "enable_hashjoin" to off and see if that produces a plan that works better?
I don't know about postgressql but I'd
- make sure
filetype
has an index, perhaps a covering index on filetype and id. - try rewriting the query like this
SQL Statement
SELECT f.filetype
, avg_size
, stddev_size
FROM files as f
INNER JOIN (
SELECT file_id
, avg(size) as avg_size
, stddev(size) as stddev_size
FROM properties
GROUP BY
file_id
) p ON p.file_id = f.id
精彩评论