MySQL, need some performance suggestions on my match query_问答_开发者

I need some performance improvement guidance, my query takes several seconds to run and this is causin开发者_开发百科g problems on the server. This query runs on the most common page on my site. I think a radical rethink may be required.

~ EDIT ~ This query produces a list of records whose keywords match those of the program (record) being queried. My site is a software download directory. And this list is used on the program listing page to show other similar programs. PadID is the primary key of the program records in my database.

~ EDIT ~

Heres my query

 select match_keywords.PadID, count(match_keywords.Word) as matching_words 
 from keywords current_program_keywords 
 inner join keywords match_keywords on
       match_keywords.Word=current_program_keywords.Word 
 where match_keywords.Word IS NOT NULL 
 and current_program_keywords.PadID=44243 
 group by match_keywords.PadID 
 order by matching_words DESC 
 LIMIT 0,11;

Heres the query explained.

MySQL, need some performance suggestions on my match query

Heres some sample data, however I doubt you'd be able to see the effects of any performance tweaks without more data, which I can provide if you'd like.

 CREATE TABLE IF NOT EXISTS `keywords` (
   `Word` varchar(20) NOT NULL,
   `PadID` bigint(20) NOT NULL,
   `LetterIdx` varchar(1) NOT NULL,
   KEY `Word` (`Word`),
   KEY `LetterIdx` (`LetterIdx`),
   KEY `PadID_2` (`PadID`,`Word`)
 ) ENGINE=MyISAM DEFAULT CHARSET=latin1;

 INSERT INTO `keywords` (`Word`, `PadID`, `LetterIdx`) VALUES
 ('tv', 44243, 'T'),
 ('satellite tv', 44243, 'S'),
 ('satellite tv to pc', 44243, 'S'),
 ('satellite', 44243, 'S'),
 ('your', 44243, 'X'),
 ('computer', 44243, 'C'),
 ('pc', 44243, 'P'),
 ('soccer on your pc', 44243, 'S'),
 ('sports on your pc', 44243, 'S'),
 ('television', 44243, 'T');

I've tried adding an index, but this doesn't make much difference.

 ALTER TABLE `keywords` ADD INDEX ( `PadID` )

You might find this helpful if I understood you correctly. The solution takes advantage of innodb's clustered primary key indexes (http://pastie.org/1195127)

EDIT: here's some links that may prove of interest:

http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html

http://dev.mysql.com/doc/refman/5.0/en/innodb-adaptive-hash.html

drop table if exists programmes;
create table programmes
(
prog_id mediumint unsigned not null auto_increment primary key,
name varchar(255) unique not null
)
engine=innodb;

insert into programmes (name) values 
('prog1'),('prog2'),('prog3'),('prog4'),('prog5'),('prog6');


drop table if exists keywords;
create table keywords
(
keyword_id mediumint unsigned not null auto_increment primary key,
name varchar(255) unique not null
)
engine=innodb;

insert into keywords (name) values 
('tv'),('satellite tv'),('satellite tv to pc'),('pc'),('computer');


drop table if exists programme_keywords;
create table programme_keywords
(
keyword_id mediumint unsigned not null,
prog_id mediumint unsigned not null,
primary key (keyword_id, prog_id), -- note clustered composite primary key
key (prog_id)
)
engine=innodb;

insert into programme_keywords values 

-- keyword 1
(1,1),(1,5),

-- keyword 2
(2,2),(2,4),

-- keyword 3
(3,1),(3,2),(3,5),(3,6),

-- keyword 4
(4,2),

-- keyword 5
(5,2),(5,3),(5,4);

/*
efficiently list all other programmes whose keywords match that of the 
programme currently being queried (for instance prog_id = 1) 
*/


drop procedure if exists list_matching_programmes;

delimiter #

create procedure list_matching_programmes
(
in p_prog_id mediumint unsigned
)
proc_main:begin

select
 p.*
from
 programmes p
inner join
(
 select distinct -- other programmes with same keywords as current
  pk.prog_id
 from
  programme_keywords pk
 inner join
 (
  select keyword_id from programme_keywords where prog_id = p_prog_id
 ) current_programme -- the current program keywords
 on pk.keyword_id = current_programme.keyword_id
 inner join programmes p on pk.prog_id = p.prog_id 

) matches 
on matches.prog_id = p.prog_id
order by
 p.prog_id;

end proc_main #


delimiter ;

call list_matching_programmes(1);
call list_matching_programmes(6); 


explain
select
 p.*
from
 programmes p
inner join
(
 select distinct
  pk.prog_id
 from
  programme_keywords pk
 inner join
 (
  select keyword_id from programme_keywords where prog_id = 1
 ) current_programme
 on pk.keyword_id = current_programme.keyword_id
 inner join programmes p on pk.prog_id = p.prog_id 

) matches 
on matches.prog_id = p.prog_id
order by
 p.prog_id;

EDIT: added char_idx functionality as requested

alter table keywords add column char_idx char(1) null after name;

update keywords set char_idx = upper(substring(name,1,1));

select * from keywords;

explain
select
 p.*
from
 programmes p
inner join
(
 select distinct
  pk.prog_id
 from
  programme_keywords pk
 inner join
 (
  select keyword_id from keywords where char_idx = 'P' -- just change the driver query
 ) keywords_starting_with
 on pk.keyword_id = keywords_starting_with.keyword_id
) matches 
on matches.prog_id = p.prog_id
order by
 p.prog_id;

Try this approach, not sure if it will help but at least is different:

select PadID, count(Word) as matching_words
from keywords k
where Word in (
  select Word 
  from keywords
  where PadID=44243 )
group by PadID 
order by matching_words DESC 
LIMIT 0,11

Anyway the job you want to get done is heavy, and full of string comparison, maybe exporting keywords and storing only numeric ids in the keyword table can reduce the times.

Ok after reviewing you database I think there is not a lot of room to improve in the query, in fact on my test server with index on Word it only takes about 0.15s to complete, without the index it is almost 4x times slower.

Anyway I think that implementing the change in database sctructure f00 and I have told you it will improve the response time.

Also drop the index PadID_2 as it is now it is futile and it will only slow your writes. What you should do but it requise to clean the database is to avoid duplicate keyword-prodId pair first removing al duplicate ones currently in DB (around 90k in my test with 3/4 of your DB) that will reduce query time and give meaningfull results. If you ask for a progId that has the keyword ABC that is duplicated for progdID2 then progID2 will be on top o other progIDs with the same ABC keyword but not duplicated, on my tests I have seen a progID that get several more matches that the same progID I am querying. After dropping duplicates from the DB you will need to change your application to avoid this problem again in the future and just for being safe you could add a primary key (or index with unique activated) to Word + ProgID.