select rows that all exist in other table - how to improve performance?_问答_开发者

I've three tables in SQL Server 2008 R2:

PRODUCTS (id int, title varchar(100), ....)  
WORDS (id int,word varchar(100) )  
WORDS_IN_TITLES (product_id int, word_id int)

and now I want to select all products that have certain words used in its title.

Now i do it that way:

declare  @words tp_intList  
insert into @words values(154)  
insert into @words values(172)  
declare @wordsCnt int = (select count(*) from @words)    

select * from products where id IN  
(
 select product_id from WORDS_IN_TITLES inner join 
 (select id from @words) wrds ON wrds.id=WORDS_IN_TITLES.word_id 
 group by product_id开发者_运维知识库 HAVING count(*)=@wordsCnt
)

It works but it's slow. Table contains 600k rows, and returning 3.5k rows takes around 4 seconds. I need it to go much below 1 second. How can I improve the performance?

select products.*
    from products
        inner join (select p.id
                        from products p
                            inner join words_in_titles wit
                                on p.id = wit.product_id
                        where wit.word_id in (154,172)
                        group by p.id
                        having count(distinct wit.word_id) = 2) q
           on products.id = q.id

It doesn't look like your query can be improved much.

Below is a sample table constructed with 600k rows for products and nearly 600k rows for words_in_titles. For every 2 word_ids picked randomly, there should be about 3 to 10 products matching the combination.

Create tables and populate with data. Create index also on words_in_titles (word_id)

create table products (id int identity primary key clustered, title varchar(100))
insert into products
select convert(varchar(max),NEWID())
from master..spt_values a
inner join master..spt_values b on b.type='p' and b.number between 0 and 999
where a.type='P' and a.number between 0 and 600

create table words_in_titles (product_id int, word_id int,
    primary key clustered(product_id, word_id))
insert words_in_titles
select distinct a,b
from
(
select floor(convert(bigint,convert(varbinary(max),newid())) % 60000) a, floor(convert(bigint,convert(varbinary(max),newid())) % 1000) b
from master..spt_values a
inner join master..spt_values b on b.type='p' and b.number between 0 and 999
where a.type='P' and a.number between 0 and 600
) x

create index ix_words_in_titles on words_in_titles(word_id)

Next time the different approaches. We use SET STATISTICS to see internal stats. You should also check the execution plan (but not while checking statistics - it pollutes the stats). DBCC commands are used to flush buffers and clear plans, set the @clean bit to 1 to clear between runs, and 0 to emulate during the day runs where data could already be in the buffer.

declare @clean bit set @clean = 1
if(@clean=1) exec ('dbcc dropcleanbuffers dbcc freeproccache')

set statistics io off
set statistics time off

-- pick two random word_id's as generated (@word1 and @word2 used below)
declare @word1 int, @word2 int
select top 1 @word1 = word_id from words_in_titles order by NEWID()
select top 1 @word2 = word_id from words_in_titles where word_id <> @word1 order by NEWID()

declare  @words table (id int)  
insert into @words values(@word1)  
insert into @words values(@word2)  
declare @wordsCnt int = (select count(*) from @words)    

set statistics io on
set statistics time on

if(@clean=1) exec ('dbcc dropcleanbuffers dbcc freeproccache')

select *
from
(
select w.product_id
from words_in_titles w
where w.word_id = @word1
  and exists (select * from words_in_titles t where t.word_id=@word2 and t.product_id=w.product_id)
  -- expand with more EXISTS clauses
) q inner join products p on p.id = q.product_id

if(@clean=1) exec ('dbcc dropcleanbuffers dbcc freeproccache')

select *
from
(
select w1.product_id
from words_in_titles w1
where w1.word_id = @word1
intersect
select w2.product_id
from words_in_titles w2
where w2.word_id = @word2
) q inner join products p on p.id = q.product_id

if(@clean=1) exec ('dbcc dropcleanbuffers dbcc freeproccache')

select * from products where id IN  
(
 select product_id from WORDS_IN_TITLES inner join 
 (select id from @words) wrds ON wrds.id=WORDS_IN_TITLES.word_id 
 group by product_id HAVING count(*)=@wordsCnt
)

if(@clean=1) exec ('dbcc dropcleanbuffers dbcc freeproccache')

select products.*
    from products
        inner join (select p.id
                        from products p
                            inner join words_in_titles wit
                                on p.id = wit.product_id
                        where wit.word_id in (@word1,@word2)
                        group by p.id
                        having count(distinct wit.word_id) = 2) q
           on products.id = q.id

The Execution Plans are not shown here, but if you view it, both alternatives and the original query are all quite similar.
The first two queries collect from words_in_titles twice, but each for a separate word_id, then use different strategies for working out the intersection
The original query streams only once from words_in_titles and aggregates instead but requires an additional I/O operation to sort in tempdb.
Joe's query uses a more complex plan.

Statistics: Look for batches that start with Table and the very next SQL Server Execution Times after it. Four such fragments will represent the 4 queries timed.

Table 'products'. Scan count 0, logical reads 30, physical reads 0, read-ahead reads 51, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'words_in_titles'. Scan count 2, logical reads 8, physical reads 2, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#4D5F7D71'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 47 ms.

(a shown above, the original query uses a temporary sort).
If you run it enough times, you will see that the last query always performs slower than the rest, 2nd one is normally faster than the first and the 3rd (original) sometimes comes before or after the 1st and 2nd.