Super Slow Query - sped up, but not perfect... Please help_问答_开发者

I posted a query yesterday (see here) that was horrible (took over a minute to run, resulting in 18,215 records):

SELECT DISTINCT 
    dbo.contacts_link_emails.Email, dbo.contacts.ContactID, dbo.contacts.First AS ContactFirstName, dbo.contacts.Last AS ContactLastName, dbo.contacts.InstitutionID, 
    dbo.institutionswithzipcodesadditional.CountyID, dbo.institutionswithzipcodesadditional.StateID,  dbo.institutionswithzipcodesadditional.DistrictID
FROM         
    dbo.contacts_def_jobfunctions AS contacts_def_jobfunctions_3 
INNER JOIN
    dbo.contacts 
INNER JOIN
    dbo.contacts_link_emails 
        ON dbo.contacts.ContactID = dbo.contacts_link_emails.ContactID 
        ON contacts_def_jobfunctions_3.JobID = dbo.contacts.JobTitle 
INNER JOIN
    dbo.institutionswithzipcodesadditional 
        ON dbo.contacts.InstitutionID = dbo.institutionswithzipcodesadditional.InstitutionID 
LEFT OUTER JOIN
    dbo.contacts_def_jobfunctions 
INNER JOIN
    dbo.contacts_link_jobfunctions 
        ON dbo.contacts_def_jobfunctions.JobID = dbo.contacts_link_jobfunctions.JobID 
        ON dbo.contacts.ContactID = dbo.contacts_link_jobfunctions.ContactID
WHERE     
        (dbo.contacts.JobTitle IN
        (SELECT     JobID
        FROM          dbo.contacts_def_jobfunctions AS contacts_def_jobfunctions_1
        WHERE      (ParentJobID <> '1841'))) 
    AND
        (dbo.contacts_link_emails.Email NOT IN
        (SELECT     EmailAddress
        FROM          dbo.newsletterremovelist)) 
OR
        (dbo.contacts_link_jobfunctions.JobID IN
        (SELECT     JobID
        FROM          dbo.contacts_def_jobfunctions AS contacts_def_jobfunctions_2
        WHERE      (ParentJobID <> '1841')))
    AND 
        (dbo.contacts_link_emails.Email NOT IN
        (SELECT     EmailAddress
        FROM          dbo.newsletterremovelist AS newsletterremovelist)) 
ORDER BY EMAIL

With a lot of coaching and research, I've tuned it up to the following:

SELECT  contacts.ContactID,
        contacts.InstitutionID,
        contacts.First,
        contacts.Last,
        institutionswithzipcodesadditional.CountyID, 
        institutionswithzipcodesadditional.StateID,
        institutionswithzipcodesadditional.DistrictID
FROM    contacts 
    INNER JOIN contacts_link_emails ON 
    contacts.ContactID = contacts_link_emails.ContactID
    INNER JOIN institutionswithzipcodesadditional ON
    contacts.InstitutionID = institutionswithzipcodesadditional.InstitutionID
WHERE
    (contacts.ContactID IN
        (SELECT contacts_2.ContactID
        FROM contacts AS contacts_2
        INNER JOIN contacts_link_emails AS contacts_link_emails_2 ON
            contacts_2.ContactID = contacts_link_emails_2.ContactID
        开发者_如何学JAVALEFT OUTER JOIN contacts_def_jobfunctions ON 
            contacts_2.JobTitle = contacts_def_jobfunctions.JobID
        RIGHT OUTER JOIN newsletterremovelist ON 
            contacts_link_emails_2.Email = newsletterremovelist.EmailAddress
        WHERE (contacts_def_jobfunctions.ParentJobID <> 1841)
        GROUP BY contacts_2.ContactID
    UNION
        SELECT contacts_1.ContactID
        FROM contacts_link_jobfunctions
        INNER JOIN contacts_def_jobfunctions AS contacts_def_jobfunctions_1 ON
            contacts_link_jobfunctions.JobID = contacts_def_jobfunctions_1.JobID 
            AND contacts_def_jobfunctions_1.ParentJobID <> 1841 
        INNER JOIN contacts AS contacts_1 ON 
            contacts_link_jobfunctions.ContactID = contacts_1.ContactID
        INNER JOIN contacts_link_emails AS contacts_link_emails_1 ON
            contacts_link_emails_1.ContactID = contacts_1.ContactID
        LEFT OUTER JOIN newsletterremovelist AS newsletterremovelist_1 ON
        contacts_link_emails_1.Email = newsletterremovelist_1.EmailAddress
        GROUP BY contacts_1.ContactID))

While this query is now super fast (about 3 seconds), I've blown part of the logic somewhere - it only returns 14,863 rows (instead of the 18,215 rows that I believe is accurate).

The results seem near correct. I'm working to discover what data might be missing in the result set.

Can you please coach me through whatever I've done wrong here?

Thanks,

Russell Schutte

The main problem with your original query was that you had two extra joins just to introduce duplicates and then a DISTINCT to get rid of them.

Use this:

SELECT  cle.Email,
        c.ContactID,
        c.First AS ContactFirstName,
        c.Last AS ContactLastName,
        c.InstitutionID, 
        izip.CountyID,
        izip.StateID, 
        izip.DistrictID
FROM    dbo.contacts c
INNER JOIN
        dbo.institutionswithzipcodesadditional izip
ON      izip.InstitutionID = c.InstitutionID
INNER JOIN
        dbo.contacts_link_emails cle
ON      cle.ContactID = c.ContactID 
WHERE   cle.Email NOT IN
        (
        SELECT  EmailAddress
        FROM    dbo.newsletterremovelist
        )
        AND EXISTS
        (
        SELECT  NULL
        FROM    dbo.contacts_def_jobfunctions cdj
        WHERE   cdj.JobId = c.JobTitle
                AND cdj.ParentJobId <> '1841'
        UNION ALL
        SELECT  NULL
        FROM    dbo.contacts_link_jobfunctions clj
        JOIN    dbo.contacts_def_jobfunctions cdj
        ON      cdj.JobID = clj.JobID
        WHERE   clj.ContactID = c.ContactID
                AND cdj.ParentJobId <> '1841'
        )
ORDER BY
        email

Create the following indexes:

newsletterremovelist (EmailAddress)
contacts_link_jobfunctions (ContactID, JobID)
contacts_def_jobfunctions (JobID)

Do you get the same results when you do:

SELECT count(*)
FROM          
    dbo.contacts_def_jobfunctions AS contacts_def_jobfunctions_3  
INNER JOIN 
    dbo.contacts  
INNER JOIN 
    dbo.contacts_link_emails  
        ON dbo.contacts.ContactID = dbo.contacts_link_emails.ContactID  
        ON contacts_def_jobfunctions_3.JobID = dbo.contacts.JobTitle  
SELECT COUNT(*)        
FROM        
    contacts 
INNER JOIN contacts_link_jobfunctions 
    ON contacts.ContactID = contacts_link_jobfunctions.ContactID 
INNER JOIN  contacts_link_emails 
    ON contacts.ContactID = contacts_link_emails.ContactID

If so keep adding each join conditon on until you don't get the same results and you will see where your mistake was. If all the joins are the same, then look at the where clauses. But I will be surprised if it isn't in the first join because the syntax you have orginally won't even work on SQL Server and it is pretty nonstandard SQL and may have been incorrect all along but no one knew.

Alternatively, pick a few of the records that are returned in the orginal but not the revised. Track them through the tables one at a time to see if you can find why the second query filters them out.

I'm not directly sure what is wrong, but when I run in to this situation, the first thing I do is start removing variables.

So, comment out the where clause. How many rows are returned?

If you get back the 11,604 rows then you've isolated the problems to the joins. Work though the joins, commenting each one out (remove the associated columns too) and figure out how many rows are eliminated.

As you do this, aim to find what is causing the desired rows to be eliminated. Once isolated, consider the join differences between the first query and the second query.

In looking at the first query, you could probably just modify that to eliminate any INs and instead do a EXISTS instead.

Consider your indexes as well. Any thing in the where or join clauses should probably be indexed.