Finding duplicate records in mysql based on a bitmask_问答_开发者

I have a mysql table which stores maintenance logs for sensors. I'd like to design a query that finds instances where a given sensor was repaired/maintained for the same reason. (Recurring problem finder.)

My table (simplified) looks like this:

id    name     mask
==    ====     ====
11    alpha    0011
12    alpha    0010
13    alpha    0100
14    beta     0001

The mask field is a bitmask where each position represents a particular type of repair. I was able to successfully figure out how to compare the bitmask (per this question) but trying to incorporate it into a query is proving more difficult than I thought.

Given the above sample records, only id's 11 and 12 apply, since they both have a 1 in the third mask position.

Here's what I've tried and why it didn't work:

1. Never finishes...

This query seems to run forever, I don't think it is working the way I want.

SELECT t1.id, t1.name
FROM data t1
LEFT OUTER JOIN dat开发者_运维知识库a t2
ON (CONV(t1.mask,2,10) & CONV(t2.mask,2,10) > 0)
GROUP BY t1.name
HAVING COUNT(*) >1;

2. Incomplete query...

I was thinking of creating a view, to only parse the sensors that actually have more than one entry in the table. I wasn't sure where to go from here.

SELECT COUNT(t1.name) AS times, t1.name, t1.id, t1.mask
FROM data AS t1
GROUP BY t1.name ASC
HAVING times > 1;

Any suggestions on this?

Since the database structure was not designed with the realities of RDBMs in mind (probably not your doing, I just have to make the point anyway…), the performance will always be poor, though it is possible to write a query that will finish.

Jim is correct in that the query results in a cartesian product. If that query were to be returned ungrouped and unfiltered, you could expect (SELECT POW(COUNT(*), 2) FROM data) results. Also, any form of outer join is unnecessary, so a standard inner join is what you want here (not that it ought to make a difference in terms of performance, it's just more appropriate).

Also another condition of the join, t1.id != t2.id is necessary, lest each record match itself.

SELECT t1.id, t1.name
FROM data t1
JOIN data t2
     ON t1.name = t2.name
     AND t1.id != t2.id     //
WHERE CONV(t1.mask, 2, 10) & CONV(t2.mask, 2, 10) > 0
GROUP BY t1.name
HAVING COUNT(*) > 1;

Your incomplete query:

SELECT t1.id, t1.name, t1.mask
FROM data t1
WHERE t1.name IN (SELECT t2.name FROM data t2 GROUP BY t2.name HAVING COUNT(*) > 1);

SELECT t1.id, t1.name, t1.mask
FROM data t1
WHERE EXISTS (SELECT 1 FROM data t2 WHERE t2.name = t1.name GROUP BY t2.name HAVING COUNT(*) > 1);

Off the top of my head I can't tell you which of those would perform best. If data.name is indexed (and I would hope it is), the cost for either query ought to be rather low. The former will cache a copy of the subselect, whereas the latter will perform multiple queries against the index.

One very basic optimization (while leaving the table structure as a whole untouched) would be to convert the mask field to an unsigned integer data type, thereby saving many calls to CONV().

WHERE CONV(t1.mask, 2, 10) & CONV(t2.mask, 2, 10) > 0

becomes

WHERE t1.mask & t2.mask > 0

Of course, breaking the data down further does make more sense. Instead of storing a bitmask in one record, break out all the ones bits into separate records

id    name     mask
==    ====     ====
11    alpha    1101

would become

id    name    value
==    ====    =====
11    alpha       1
12    alpha       4
13    alpha       8

Now, a strategically placed index on name and value makes the query a piece of cake

SELECT name, value
FROM data
GROUP BY name, value
HAVING COUNT(*) > 1;

I hope that this helps.

Break the mask bits out in real columns. RDMBs don't like bit fields.

Your join results in a cartesian product of the table with itself. Add `t1.name=t2.name' to the join, giving a bunch of (much) smaller cartesian products, one per unique name, which will speed things up considerably.