Postgres - Slow simple join with where-clause_问答_开发者

I am having some trouble optimizing a query, and was hoping that someone here might be able to provide a few pointers.

I have two tables:

CREATE TABLE "blog_cached_posts" (
    "id" int4 NOT NULL DEFAULT nextval('blog_cached_posts_id_seq'::regclass),
    "title" varchar(255),
    "content" text,
    "content_encoded" text,
    "published_at" timestamp(6) NULL,
    "written_by" varchar(255),
    "link" varchar(255),
    "blog_id" int4,
    "guid" varchar(255),
    "created_at" timestamp(6) NULL,
    "updated_at" timestamp(6) NULL,
    "is_highlighted_post" bool DEFAULT false
)

With an index on blog_cached_posts.blog_id

CREATE TABLE "blogs" (
    "id" int4 NOT NULL DEFAULT nextval('blogs_id_seq'::regclass),
    "site_id" int4,
    "image_id" int4,
    "name" varchar(255),
    "description" text,
    "url" varchar(255),
    "rss_feed_url" varchar(255),
    "active" bool DEFAULT true,
    "created_at" timestamp(6) NULL,
    "updated_at" timestamp(6) NULL,
    "date_highlighted" date,
    "highlighted_category_feed_url" varchar(255),
    "position" int4
)

With an index on blogs.site_id

This is the query:

SELECT "blog_cached_posts".*
FROM "blog_cached_posts"
join blogs on blogs.id = blog_cached_posts.blog_id
WHERE ((published_at IS NOT NULL) AND (blogs.site_id = 80))
ORDER BY published_at desc
LIMIT 5

Here is an EXPLAIN ANALYZE:

Limit  (cost=9499.16..9499.17 rows=5 width=1853) (actual time=118.538..118.539 rows=5 loops=1)
  ->  Sort  (cost=9499.16..9626.31 rows=50861 width=1853) (actual time=118.536..118.537 rows=5 loops=1)
        Sort Key: blog_cached_posts.published_at
        Sort Method:  top-N heapsort  Memory: 33kB
        ->  Hash Join  (cost=16.25..8654.38 rows=50861 width=1853) (actual time=0.186..82.910 rows=48462 loops=1)
              Hash Cond: (blog_cached_posts.blog_id = blogs.id)
              ->  Seq Scan on blog_cached_posts  (cost=0.00..7930.94 rows=52954 width=1853) (actual time=0.042..56.635 rows=52950 loops=1)
                    Filter: (published_at IS NOT NULL)
              ->  Hash  (cost=13.21..13.21 rows=243 width=4) (actual time=0.135..0.135 rows=243 loops=1)
                    ->  Seq Scan on blogs  (cost=0.00..13.21 rows=243 width=4) (actual time=0.007..0.089 rows=243 loops=1)
                          Filter: (site_id = 80)
Total runtime: 118.591 ms

Is there any way to optimize this beyond the ~120ms it is currently taking?

EDIT

Here is what I ended up doing. (After reading the comment by @ypercube)

I added an index to blog_cached_posts:

CREATE INDEX \"blog_cached_posts_published_at\" ON \"public\".\"blog_cached_posts\" USING bt开发者_如何学运维ree(published_at DESC NULLS LAST);
COMMENT ON INDEX \"public\".\"blog_cached_posts_published_at\" IS NULL;

And I changed the select to the following:

SELECT "blog_cached_posts".*
FROM "blog_cached_posts"
join blogs on blogs.id = blog_cached_posts.blog_id
WHERE published_at is not null and blogs.site_id = 80
ORDER BY published_at desc nulls last
LIMIT 5

This brought the execution time down to ~3ms.

Here is the new execution plan:

Limit  (cost=0.00..3.85 rows=5 width=1849) (actual time=0.027..0.047 rows=5 loops=1)
  ->  Nested Loop  (cost=0.00..39190.01 rows=50872 width=1849) (actual time=0.026..0.046 rows=5 loops=1)
        ->  Index Scan using blog_cached_posts_published_at on blog_cached_posts  (cost=0.00..24175.16 rows=52965 width=1849) (actual time=0.017..0.023 rows=5 loops=1)
              Filter: (published_at IS NOT NULL)
        ->  Index Scan using blogs_pkey on blogs  (cost=0.00..0.27 rows=1 width=4) (actual time=0.003..0.004 rows=1 loops=5)
              Index Cond: (blogs.id = blog_cached_posts.blog_id)
              Filter: (blogs.site_id = 80)
Total runtime: 0.086 ms

Your issue is you cannot realistically use an index to pull the needed 5 posts outright. Hop over to index dos and donts for a moment.

(blog_id, published_at) (suggested in comments) would arguably help if were querying for a specific blog, but your most selective constraint in that query is the one on the site_id -- i.e. on a separate table altogether.

Seq Scan on blogs  (cost=0.00..13.21 rows=243 width=4) (actual time=0.007..0.089 rows=243 loops=1)
  Filter: (site_id = 80)

The above means that either you've no index on site_id, or that this particular site_id is all over the place and Postgres goes through the whole table since it'll need to open it regardless.

This then leads to multiple blog ids, and these are used to retrieve all of the relevant posts using a hash join. But since multiple blogs are involved, the best PG can do is to grab all applicable posts and subsequently top-n sort them.

Even if you were to change it so that you pass the blog ids directly in an IN() clause, an index on (blog_id, published_at) would not yield the needed rows in order. So it would still grab all posts for all of the applicable blogs, and top-n sort the mess.

One way to work around the problem would be to change your schema slightly:

CREATE TABLE "blog_cached_posts" (
    "id" int4 NOT NULL DEFAULT nextval('blog_cached_posts_id_seq'::regclass),
    ...
    "published_at" timestamp(6) NULL,
    "site_id" int4,
    "blog_id" int4,
    ...
)

Note the extra site_id. This allows to subsequently create an index on (site_id, published_at desc nulls last) and rewrite your query like:

SELECT "blog_cached_posts".*
FROM "blog_cached_posts"
WHERE site_id = 80
ORDER BY published_at desc nulls last
LIMIT 5

An alternative approach, per comments, is to maintain a latest_site_posts table using triggers. You'll end up with something similar to the above suggestion, but with a smaller table (i.e. less duplication of the site_id).

As I mentioned in a comment, I would first try adding a simple index on published_at. It seems that if there wasn't the ORDER BY and LIMIT 5 clauses, the query would be quite efficient and all other needed indexes existed.

Therefore, adding an index on the field that is used for the final sorting is usually quite efficient.

As Dems explained in his answer:

Because the index ( blog_id, published_at ) is in a state that is good for the join, it turns out to be less good for the sort. On those grounds, you might see value in two indexes instead of one (on blog_id and published_at separately.)

As per my comment on Denis' answer, and Thomas Mueller's comment...

An index on blog_cached_posts is needed to avoid a sequential scan of that table. By making the the index cover blog_id, published_at, the query can follow logic similar to the following...
1. Filter the blogs table for the site_id (80)
2. For each record in there, join to blog_cached_posts
3. Use the blog_id, published_at index to identify the records needed
4. Possible merge that index on itself to allow a fast sort

Because the index is in a state that is good for the join, it turns out to be less good for the sort. On those grounds, you might see value in two indexes instead of one (on blog_id and published_at separately.)

EDIT

In a comment on Denis' answer you say that adding a column could prove problematic due to legacy apps. I can think of a couple of tips that may help there...

Try creating a new table with the fields you need, and replacing the original table with a view on to the new table. The new table can have, for example, default values in the additional columns.

Or create a mapping table with the fiels you need to relate together for your query. This mapping table can then be maintained via a trigger.

Because your best search predicate is site_id, switch your table select order around to select from blogs first, so it jumps straight in via the index. Also, put the published_at not null into the join so rows are cut off asap, like this:

SELECT blog_cached_posts.*
FROM blogs
join blog_cached_posts on blogs.id = blog_cached_posts.blog_id AND published_at IS NOT NULL
WHERE blogs.site_id = 80
ORDER BY published_at desc
LIMIT 5

Note that this solution does not require any new indexes, nor would it benefit from them.

Please let us know how this query performs in comparison

use LIMIT ['it fetch limited data from table']

Insded of blog_cached_posts.* use Required data only e.i blog_cached_posts.name,blog_cached_posts.email

Avoid Unwanted "" quotes in select statement SELECT blog_cached_posts.* FROM blog_cached_posts