Optimizing a category filter_问答_开发者_运维开发者技术经验分享

This recent question had me thinking about optimizing a category filter.

Suppose we wish to create a database referencing a huge number of audio tracks, with their release date and a list of wor开发者_运维问答ld locations from which the audio track is downloadable.

The requests we wish to optimize for are:

Give me the 10 most recent tracks downloadable from location A.
Give me the 10 most recent tracks downloadable from locations A or B.
Give me the 10 most recent tracks downloadable from locations A and B.

How would one go about structuring that database ? I have a hard time coming up with a simple solution that doesn't require reading through all the tracks for at least one location...

To optimise these queries, you need to slightly de-normalise the data.

For example, you may have a track table that contains the track's id, name and release date, and a map_location_to_track table that describes where those tracks can be down-loaded from. To answer "10 most recent tracks for location A" you need to get ALL of the tracks for Location A from map_location_to_track, then join them to the track table to order them by release date, and pick the top 10.

If instead all the data is in a single table, the ordering step can be avoided. For example...

CREATE TABLE map_location_to_track (
  location_id   INT,
  track_id      INT,
  release_date  DATETIME,
  PRIMARY KEY (location_id, release_date, track_id)
)

SELECT * FROM map_location_to_track
WHERE location_id = A
ORDER BY release_date DESC LIMIT 10

Having location_id as the first entry in the primary key ensures that the WHERE clause is simply an index seek. Then there is no requirement to re-order the data, it's already ordered for us by the primary key, but instead just pick the 10 records at the end.

You may indeed still join on to the track table to get the name, price, etc, but you now only have to do that for 10 records, not everything at that location.

To solve the same query for "locations A OR B", there are a couple of options that can perform differently depending on the RDBMS you are using.

The first is simple, though some RDBMS don't play nice with IN...

SELECT track_id, release_date FROM map_location_to_track
WHERE location_id IN (A, B)
GROUP BY track_id, release_date
ORDER BY release_date DESC LIMIT 10

The next option is nearly identical, but still some RDBMS don't play nice with OR logic being applied to INDEXes.

SELECT track_id, release_date FROM map_location_to_track
WHERE location_id = A or location_id = B
GROUP BY track_id, release_date
ORDER BY release_date DESC LIMIT 10

In either case, the algorithm being used to rationalise the list of records down to 10 is hidden from you. It's a matter of try it and see; the index is still available such that this CAN be performant.

An alternative is to explicitly determine part of the approach in your SQL statement...

SELECT
  *
FROM
(
  SELECT track_id, release_date FROM map_location_to_track
  WHERE location_id = A
  ORDER BY release_date DESC LIMIT 10

  UNION

  SELECT track_id, release_date FROM map_location_to_track
  WHERE location_id = B
  ORDER BY release_date DESC LIMIT 10
)
  AS data
ORDER BY
  release_date DESC
LIMIT 10

-- NOTE: This is a UNION and not a UNION ALL
--       The same track can be available in both locations, but should only count once
--       It's in place of the GROUP BY in the previous 2 examples

It is still possible for an optimiser to realise that these two unioned data sets are ordered, and so make the external order by very quick. Even if not, however, ordering 20 items is pretty quick. More importantly, it's a fixed overhead: it doesn't matter if you have a billion tracks in each location, we're just merging two lists of 10.

The hardest to optimise is the AND condition, but even then the existance of the "TOP 10" constraint can help work wonders.

Adding a HAVING clause to the IN or OR based approaches can solve this, but, again, depending on your RDBMS, may run less than optimally.

SELECT track_id, release_date FROM map_location_to_track
WHERE location_id = A or location_id = B
GROUP BY track_id, release_date
HAVING COUNT(*) = 2
ORDER BY release_date DESC LIMIT 10

The alternative is to try the "two queries" approach...

SELECT
  location_a.*
FROM
(
  SELECT track_id, release_date FROM map_location_to_track
  WHERE location_id = A
)
  AS location_a
INNER JOIN  
(
  SELECT track_id, release_date FROM map_location_to_track
  WHERE location_id = B
)
  AS location_b
    ON  location_a.release_date = location_b.release_date
    AND location_a.track_id     = location_b.track_id
ORDER BY
  location_a.release_date DESC
LIMIT 10

This time we can't restrict the two sub-queries to just 10 records; for all we know the most recent 10 in location a don't appear in location b at all. The primary key rescues us again though. The two data sets are orgnised by release date, the RDBMScan just start at the top record of each set and merge the two until it has 10 records, then stop.

NOTE: Because the release_date is in the primary key, and before the track_id, one should ensure that it is used in the join.

Depending on the RDBMS, you don't even need the sub-queries. You may be able to just self-join the table without altering the RDBMS' plan...

SELECT
  location_a.*
FROM
  map_location_to_track AS location_a
INNER JOIN  
  map_location_to_track AS location_b
    ON  location_a.release_date = location_b.release_date
    AND location_a.track_id     = location_b.track_id
WHERE
      location_a.location_id = A
  AND location_b.location_id = B
ORDER BY
  location_a.release_date DESC
LIMIT 10

All in all, the combination of three things makes this pretty efficient:
- Partially De-Normalising the data to ensure it's in a friendly order for our needs
- Knowing we only ever need the first 10 results
- Knowing we're only ever dealing with 2 locations at the most

There are variations that can optimise to any number of records and any number of locations, but these are significantly less performant than the problem stated in this question.

In a classic relational schema you would have a many-to-many relationship between tracks and locations in order to avoid redundancy:

CREATE TABLE tracks (
  id   INT,
  ...
  release_date  DATETIME,
  PRIMARY KEY (id)
)

CREATE TABLE locations (
  id   INT,
  ...
  PRIMARY KEY (id)
)

CREATE TABLE tracks_locations (
  location_id   INT,
  track_id      INT,
  ...
  PRIMARY KEY (location_id, track_id)
)

SELECT tracks.* FROM tracks_locations LEFT JOIN tracks ON tracks.id = tracks_locations.location_id
WHERE tracks_locations.location_id = A
ORDER BY tracks.release_date DESC LIMIT 10

You could modify that schema using table partitions by location. Problem is that it depends on implementation issues or usage constraints. For example, AFAIK in MySQL you cannot have foreign keys in partitioned tables. To solve this you could also have a collection of tables (call it "partitioning by hand") like tracks_by_location_#, where # is the ID of a known location. These tables could store filtered results and be created/updated/deleted using triggers.