MySQL tagging question: how to select an item that has been tagged as X, Y, and Z?_问答_开发者

I'm dealing with a database where items are "tagged" a certain number of times.

item (100k rows)

id
name
other stuff

tag (10k rows)

id
name

item2tag (1,000,000 rows)

item_id
tag_id
count

I'm looking for the fastest solution to:

Select items that have been tagged as X, Y, and Z (where X, Y, and Z correspond to (possibly) tag names) ?

Here's what I have so far... I'd just like to make sure I'm doing it in the best way possible:

First get the tag_ids from the names:

SELECT tag.id WHERE name IN ("X","Y","Z");

Then I group by those tag_ids and use Having to filter the result:

SELECT ite开发者_高级运维m2tag.*, count(tag_id)
  FROM item2tag
  WHERE tag_id=1 or tag_id=2 or tag_id=3
GROUP BY item_id
HAVING count(tag_id)=3;

Then I can just select from item with those ids.

SELECT * FROM item WHERE id IN ([results from prior query])

I have millions of rows in item2tag, with an index on (item_id, tag_id). Is this going to be the fastest solution?

The method you have suggested is probably the most common way to perform the query but might not be the fastest. Using joins can be faster:

SELECT T1.item_id
FROM item2tag T1
JOIN item2tag T2 ON T1.item_id = T2.item_id
JOIN item2tag T3 ON T2.item_id = T3.item_id
WHERE T1.tag_id = 1 AND T2.tag_id = 2 AND T3.tag_id = 3

You should ensure that you have the following indexes:

Primary key on (item_id, tag_id)
Index on (tag_id).

I performance tested this query against the original in a few different scenarios.

For the case where nearly every item in the table is tagged with at least one of the tags being searched for, the original query takes about 5 seconds and the JOIN version takes about 10 seconds - slightly slower.
For the case where two of the tags occur very frequently and one of the tags occurs only very rarely the original query takes about 0.9 seconds, whereas the JOIN query takes just 0.003 seconds - a considerable performance improvement.

The SQL I used to make performance test is pasted below. You can run this test yourself or modify it slightly and test other queries, or different scenarios.

Warning: Don't run this script on your production database as it modifies the contents of the item2tag table. Running the script can take a few minutes as it creates a lot of data.

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$
CALL prc_filler(1000000);

CREATE TABLE item2tag (
    item_id INT NOT NULL,
    tag_id INT NOT NULL,
    count INT NOT NULL
);

INSERT INTO item2tag (item_id, tag_id, count)
SELECT  id % 150001, id % 10, 1
FROM    filler;
ALTER TABLE item2tag ADD PRIMARY KEY (item_id, tag_id);
ALTER TABLE item2tag ADD KEY (tag_id);

-- Make tag 3 occur rarely.    
UPDATE item2tag SET tag_id = 10 WHERE tag_id = 3 AND item_id > 0;

SELECT T1.item_id
FROM item2tag T1
JOIN item2tag T2 ON T1.item_id = T2.item_id
JOIN item2tag T3 ON T2.item_id = T3.item_id
WHERE T1.tag_id = 1 AND T2.tag_id = 2 AND T3.tag_id = 3;

SELECT item_id
FROM item2tag
WHERE tag_id=1 or tag_id=2 or tag_id=3
GROUP BY item_id
HAVING count(tag_id)=3;

You'll be better placed having an index that has tag_id as the first column - otherwise finding all items with tag_id 1 will require a full table scan (same for any tag_id, of course).

Depending on how many items are tagged with individual tags, you might do it by getting list of items tagged with one tag, and then filtering it for occurences of other tags, like this:

select item_id from item2tag
where item_id in (
    select item_id from item2tag
    where item_id in (
        select item_id from item2tag where tag_id = TID1
    ) and tag_id = TID2
) and tag_id = TID3