Efficiently selecting latest charge per patient_问答_开发者

I've got three tables in SQL Server 2005 (relevant rows included):

Patient (560K rows, has many Appts)
    ID (PK, Clustered)
Appt (5.8M rows, has many ApptCPTs)
    ID (PK, Unique, Non-clustered)
    PatientID (Indexed, Non-unique, Non-clustered)
ApptCPT (13.4M rows)
    ID (PK, Clustered)
    ApptID (Indexed, Non-unique, Non-clustered)
    CPTID (Indexed, Non-unique, Non-clustered)

I want to get the latest ApptCPT for each CPT code, for each patient, but the query I've got so far takes about a minute to return ~7M rows.

My current query:

SELECT
    p.ID AS PatientID,
    MAX(ac.ID) AS ApptCPTID,
    ac.CPTID
FROM 
    Patient p
INNER JOIN Appt a ON a.PatientID = p.ID
INNER JOIN ApptCPT ac ON ac.ApptID = a.ID
GROUP BY 
    p.ID, ac.CPTID
ORDER BY 
    p.ID, ac.CPTID

Example output:

PatientID    ApptCPTID    CPTID    
123456789    18627724     3088    
123456789    7647238      3388    
123456789    18627723     3419    
123456789    9989220      3419    
123456789    12190141     3448    
123456789    18627721     3551    
123456789    17264224     71020    
123456789    15933265     77052    
123456789    10095897     77057    
123456789    5258166      77080    
123456789    18627813     80053    
123456789    18627814     80061

If I take the MAX out and put ac.ID in the GROUP BY clause, I can see that I'm getting the correct data with my original query, because it's getting the last ApptCPT containing that CPTID for that patient, but it's horribly slow. For reference, here's the same output not using the MAX clause:

PatientID    ApptCPTID    CPTID
123456789    18126508     3088
123456789    4596004      3088
123456789    18627724     3088
123456789    7647238      3388
123456789    18627723     3419
123456789    9989220      3419
123456789    12190141     3448
123456789    4595928      3551
123456789    9989218      3551
123456789    18627721     3551
123456789    17264224     71020
123456789    15933265     77052
123456789    10095897     77057
123456789    5258050      77057
123456789    5258166      77080
123456789    4595932      80053
123456789    18126505     80053
123456789    9989223      80053
123456789    18627813开发者_开发百科     80053
123456789    18627814     80061

Anyone have any ideas for making it (hopefully) much, much faster?

You want 7 million records at 12 bytes each (assuming 32-bit ints for each ID column -- your CPT codes are probably larger since my experience there says they can have text components in certain cases). That's 84 megabytes of raw data, forgetting any management overhead in the protocol. I think one minute for that data is pretty good -- it will take that long just to transfer that much information to your computer over the network line from the server, which means your query must be handing out results almost instantly. Pare down your result set and your times will improve.

A minute to get back 7m rows? I doubt very much that you'll be able to do better unless you can convince your DBAs to add some indices to these tables and/or do some refactoring.

And while I'm pretty sure you're not able to change things, I'd argue that your data model is incorrect. From an E-R modelling perspective:

Appointments are entities dependent upon the Patient (e.g., the existence of an appointment is predicated upon the existence of the patient who made the appointment. If a particular patient is deleted from the patient table, then any appointments tied to that patient must also be deleted.)
It appears to me that the ApptCPT table is actually an associative entity, dependent on both Appt and CPT, that implements a many:many relationship between Appt and CPT (there is a CPT table, correct?, as defined in the entity-relationship model.

The primary key of a dependent entity is a composition of the primary key of the entity/entities upon which it depends, plus whatever other attributes contribute to the uniqueness of the dependent entity.

This means that PatientID get distributed down to Appt and becomes part of its primary key, and all the components of the new Appt primary key get distributed down to ApptCPT, to become a part of its primary key. Once that refactoring is done, your primary keys should be as follows:

Patient Table. PatientID
Appt. PatientID+ID.
ApptCPT. PatientID+ID+CptID.
This assumes that the combination of appt+cpt is unique. If a given CPT code can be associated multiple times with a single Appt, then, of course, you'll need to add in the ID column.

Now your query is a lot simpler: you just have to look at a single table (ApptCpt) and your query will almost certainly have a covering index.

Cheers!

Have you tried using ROW_NUMBER() instead of the GROUP BY to find the first row? Sometimes I find it gives me a faster result, especially if the table has the right indexes.

SELECT *
  FROM
    (
      SELECT p.ID AS PatientID,
            ApptCPTID
          , ac.CPTID
          , RowId = ROW_NUMBER() OVER (PARTITION BY PatientId, CptId, ORDER BY PatientId, CptId, ApptCptId desc  )
       FROM Patient p
      INNER JOIN Appt a
         ON a.PatientID = p.ID
      INNER JOIN ApptCPT ac
         ON ac.ApptID = a.ID
    ) qq
 WHERE qq.RowId = 1
 ORDER BY Id, CPTId

First of all, if you can possibly change the database, I would encourage the removal of the ID column from ApptCPT. It's useless, and even if kept as a PK it's a bad idea to make it the clustered index. Instead, I think the table should be clustered on ApptID and CPTID (whichever one first that is accessed most often). Then put an index on the one that is second in the list.

I do not think that you should add Patient to the Appt table if ApptID already fully implies patient. The reason for this is that doing so violates normalization rules, allowing an ApptCPT row to simultaneously point to a patient and an appointment that are not related. You would be mixing two different levels/granularities of data in doing so.

It also seems to me that a ton of pain can be avoided if you don't name columns "ID" because then you don't have to deal with aliasing and column confusion. It took some real staring to figure out the below query for you because there were so many ID columns. Name columns the same everywhere they are used!

Your query is probably already the fastest it can be. If you had an index on ApptCPT with both CPTID and ApptID in it, then it would be a covering index and the main table could be completely avoided, which could speed up your query (because as it is now the engine has to access indexes twice from that table). What does your execution plan look like? Are you getting seeks or scans or what?

Also, why do you need to return 7 million rows? This can't be for a user sitting there ready to consume it, so I don't see why a 1 minute response time is all that bad.

If you were only going to be selecting certain patients or certain CPTs, then this query could be an improvement over the entire-table query. But if you do need all 7 million rows, this is probably going to perform abysmally:

SELECT
   P.ID PatientID,
   X.ID ApptCPTID,
   X.CPTID
FROM
   Patient P
   CROSS JOIN CPTs C -- your master CPT table listing all CPTs
   CROSS APPLY (
      SELECT TOP 1 AC.ID, AC.CPTID
      FROM
         Appt A
         INNER JOIN ApptCPT AC ON A.ID = AC.ApptID
      WHERE
         A.PatientID = P.ID
         AND C.CPTID = AC.CPTID
      ORDER BY AC.ID DESC
   ) X
WHERE
   P.Something = 'Z'
   AND CPT.Code IN ('A', 'B')

However, one benefit of this query is that the column you use to choose the latest item doesn't have to be the one in the SELECT clause. You could say ORDER BY AC.Date DESC if you wanted. I don't know enough about what you're doing to really know for sure, but that could be important to you.

Note: I just realized that the CROSS JOIN with OUTER APPLY would have returned a bunch of spurious rows. So I switched it to CROSS APPLY and everything should be okay. There's probably something else that could be done to make nonqualifying patients included but I don't have time to think about it right now.