I have a table "test" containing millions of entries. Each row contains a floating point "feature" and a "count" how often this feature is present in item "id". The primary key for this table is the combination of "id" and "feature", i.e. every item may have multiple features. There are usually a couple of hundred to a couple of thousand feature entries per item id.
create table test
(
id int not null,
feature double not null,
count int not null
);
The task is to find the 500 most similar items to a given reference item. Similarity is measured in number of identical feature values in both items. The query I have come up with is quoted below, but despite properly using indices its execution plan still contains "using temporary" and "using filesort", giving unacceptable performance for my use case.
select
t1.id,
t2.id,
sum( least( t1.count, t2.count )) as priority
from test as t1
inner join test as t2
on t2.feat开发者_StackOverflow中文版ure = t1.feature
where t1.id = {some user supplied id value}
group by t1.id, t2.id
order by priority desc
limit 500;
Any ideas on how to improve on this? The schema can be modified and indices added as needed.
With the current schema, this query hardly can be improved.
You already have an index on feature
and this is the best you can do with the current schema design.
The problem is more similar than is not a relationship of order. If a
is more similar to b
than it is to c
, it does not imply that c
is less similar to a
than it is to b
. Hence, you cannot build a single index describing this relationship, and need to do it for each item separately, which would make your index N^2
entries long, where N
is the number of items.
If you always need only top 500
items, you can limit your index to that figure (in which case it will hold 500 * N
entries).
MySQL
does not support indexed or materialized views, so you will have to do it yourself:
Create a table like this:
CREATE TABLE similarity ( id1 INT NOT NULL, id2 INT NOT NULL, similarity DOUBLE NOT NULL, PRIMARY KEY (id1, id2), KEY (id1, similarity) )
Whenever you insert a new feature into the table, reflect the changes in the
similarity
:INSERT INTO similarity SELECT @newid, id, LEAST(@newcount, count) AS ns FROM test WHERE feature = @newfeature AND id <> @newid ON DUPLICATE KEY UPDATE SET similarity = similarity + ns; INSERT INTO similarity SELECT @newid, id, LEAST(@newcount, count) AS ns FROM test WHERE feature = @newfeature AND id <> @newid ON DUPLICATE KEY UPDATE SET similarity = similarity + ns;
On a timely basis, remove the excess similarities:
DELETE s FROM ( SELECT id1, ( SELECT similarity FROM similarity si WHERE si.id1 = s.id1 ORDER BY si.id1 DESC, si.similarity DESC LIMIT 499, 1 ) AS cs FROM ( SELECT DISTINCT id1 FROM similarity ) s ) q JOIN similarity s ON s.id1 = q.id1 AND s.similarity < q.cs
Query your data:
SELECT id2 FROM similarity WHERE id1 = @myid ORDER BY similarity DESC LIMIT 500
Having a floating point number as part of Primary Key (PK) is a killer. For that matter it should not be a part of any constraint - Unique Key (UK), Foreign Key (FK) etc.
To improve the performance of your SQL query many fold, try changing your schema as below:
CREATE TABLE test (
item_id INTEGER,
feature_id INTEGER,
count INTEGER );
CREATE TABLE features (
id INTEGER, feature_value double not null );
CREATE TABLE items (
id INTEGER, item_description varchar2(100) not null );
ALTER TABLE test ADD CONSTRAINT fk_test_item_id foreign key (item_id) references items(id);
ALTER TABLE test ADD CONSTRAINT fk_test_feature_id foreign key(feature_id) references features(id);
With your test table normalized as above, I have separated items and feature to its own separate tables and this becomes more than a mere mapping table bearing the count of each mapping.
Should you now fire the SQL query you have fired earlier with little modifications as mentioned below, you should see a significant/drastic improvement in the SQL query performance.
select t1.id, t2.id, sum( least( t1.count, t2.count )) as priority
from test as t1 inner join test as t2 on t2.feature_id = t1.feature_id
where t1.id = {some user supplied id value}
group by t1.id, t2.id
order by priority desc
limit 500;
Cheers!
One optimization would be to exclude the item itself from the self-join:
inner join test as t2
on t2.feature = t1.feature and t2.id <> t1.id
^^^^^^^^^^^^^^
For further speedup, create a covering index on (feature, id, count)
.
I would start with this... love to hear back on performance you are looking at. I don't think you needed the LEAST( of t1 vs t2 counts ). If you are first qualifying the where based on ID = {some value}, you will obviously get all those "features". Then via a self-join to itself only concerned with the matching "features", you get a count. Since you are breaking it down by by ID1 and ID2, each respective "feature" will be counted once. At the end of this query, since I'm not expclicitly excluding t2.ID equal to the {some user value}, It's count should be the EXACT SAME count of features in t1, and anything else under that would be your other closest matches.
I would ensure I had an index on ID and FEATURE.
select STRAIGHT_JOIN
t1.id,
t2.id,
count(*) as MatchedInBoth
from
test as t1,
test as t2
where
t1.id = {some user value}
and t1.feature = t2.feature
group by
t1.id,
t2.id
order by
MatchedInBoth desc
limit
500;
The result might give something like
t1 t2 MatchedInBoth
{user value} {user value} 275
{user value} Other ID 1 270
{user value} Other ID 2 241
{user value} Other ID 3 218
{user value} Other ID 4 197
{user value} Other ID 5 163, etc
Can you knock it down to just one table? Usinq subqueries you might be able to avoid the join and it will be a win if the subqueries are faster, indexed, and executed exactly once. Something like this (untested).
select
t2.id,
SUM( t2.count ) as priority
from test as t2
where t2.id = {some user supplied id value} AND
t2.count > (SELECT MIN(count) FROM test t1 WHERE id= {some user supplied value} ) AND
t2.feature IN (SELECT feature FROM test t1 WHERE id= {some user supplied value} )
group by t1.id
order by priority desc
limit 500;
If that doesnt work Mysql is terrible at realizing the inner selects are constant tables and will re-execute them for each row. Wrapping them in a select again forces a constant table lookup. Heres a hack:
select
t1.id,
SUM( t2.count ) as priority
from test as t2
where t2.id = {some user supplied id value} AND
t2.count > (
SELECT * FROM (
SELECT MIN(count) FROM test t1 WHERE id= {some user supplied
value} ) as const ) AND
t2.feature IN ( SELECT * from (
SELECT feature FROM test t1 WHERE id= {some user supplied value}
) as const )
group by t1.id
order by priority desc
limit 500;
精彩评论