I've got three tables in SQL Server 2005 (relevant rows included):
Patient (560K rows, has many Appts)
ID (PK, Clustered)
Appt (5.8M rows, has many ApptCPTs)
ID (PK, Unique, Non-clustered)
PatientID (Indexed, Non-unique, Non-clustered)
ApptCPT (13.4M rows)
ID (PK, Clustered)
ApptID (Indexed, Non-unique, Non-clustered)
CPTID (Indexed, Non-unique, Non-clustered)
I want to get the latest ApptCPT for each CPT code, for each patient, but the query I've got so far takes about a minute to return ~7M rows.
My current query:
SELECT
p.ID AS PatientID,
MAX(ac.ID) AS ApptCPTID,
ac.CPTID
FROM
Patient p
INNER JOIN Appt a ON a.PatientID = p.ID
INNER JOIN ApptCPT ac ON ac.ApptID = a.ID
GROUP BY
p.ID, ac.CPTID
ORDER BY
p.ID, ac.CPTID
Example output:
PatientID ApptCPTID CPTID
123456789 18627724 3088
123456789 7647238 3388
123456789 18627723 3419
123456789 9989220 3419
123456789 12190141 3448
123456789 18627721 3551
123456789 17264224 71020
123456789 15933265 77052
123456789 10095897 77057
123456789 5258166 77080
123456789 18627813 80053
123456789 18627814 80061
If I take the MAX out and put ac.ID in the GROUP BY clause, I can see that I'm getting the correct data with my original query, because it's getting the last ApptCPT containing that CPTID for that patient, but it's horribly slow. For reference, here's the same output not using the MAX clause:
PatientID ApptCPTID CPTID
123456789 18126508 3088
123456789 4596004 3088
123456789 18627724 3088
123456789 7647238 3388
123456789 18627723 3419
123456789 9989220 3419
123456789 12190141 3448
123456789 4595928 3551
123456789 9989218 3551
123456789 18627721 3551
123456789 17264224 71020
123456789 15933265 77052
123456789 10095897 77057
123456789 5258050 77057
123456789 5258166 77080
123456789 4595932 80053
123456789 18126505 80053
123456789 9989223 80053
123456789 18627813开发者_开发百科 80053
123456789 18627814 80061
Anyone have any ideas for making it (hopefully) much, much faster?
You want 7 million records at 12 bytes each (assuming 32-bit ints for each ID column -- your CPT codes are probably larger since my experience there says they can have text components in certain cases). That's 84 megabytes of raw data, forgetting any management overhead in the protocol. I think one minute for that data is pretty good -- it will take that long just to transfer that much information to your computer over the network line from the server, which means your query must be handing out results almost instantly. Pare down your result set and your times will improve.
A minute to get back 7m rows? I doubt very much that you'll be able to do better unless you can convince your DBAs to add some indices to these tables and/or do some refactoring.
And while I'm pretty sure you're not able to change things, I'd argue that your data model is incorrect. From an E-R modelling perspective:
- Appointments are entities dependent upon the Patient (e.g., the existence of an appointment is predicated upon the existence of the patient who made the appointment. If a particular patient is deleted from the patient table, then any appointments tied to that patient must also be deleted.)
- It appears to me that the ApptCPT table is actually an associative entity, dependent on both Appt and CPT, that implements a many:many relationship between Appt and CPT (there is a CPT table, correct?, as defined in the entity-relationship model.
The primary key of a dependent entity is a composition of the primary key of the entity/entities upon which it depends, plus whatever other attributes contribute to the uniqueness of the dependent entity.
This means that PatientID get distributed down to Appt and becomes part of its primary key, and all the components of the new Appt primary key get distributed down to ApptCPT, to become a part of its primary key. Once that refactoring is done, your primary keys should be as follows:
- Patient Table. PatientID
- Appt. PatientID+ID.
- ApptCPT. PatientID+ID+CptID.
This assumes that the combination of appt+cpt is unique. If a given CPT code can be associated multiple times with a single Appt, then, of course, you'll need to add in the ID column.
Now your query is a lot simpler: you just have to look at a single table (ApptCpt) and your query will almost certainly have a covering index.
Cheers!
Have you tried using ROW_NUMBER() instead of the GROUP BY to find the first row? Sometimes I find it gives me a faster result, especially if the table has the right indexes.
SELECT *
FROM
(
SELECT p.ID AS PatientID,
ApptCPTID
, ac.CPTID
, RowId = ROW_NUMBER() OVER (PARTITION BY PatientId, CptId, ORDER BY PatientId, CptId, ApptCptId desc )
FROM Patient p
INNER JOIN Appt a
ON a.PatientID = p.ID
INNER JOIN ApptCPT ac
ON ac.ApptID = a.ID
) qq
WHERE qq.RowId = 1
ORDER BY Id, CPTId
First of all, if you can possibly change the database, I would encourage the removal of the ID column from ApptCPT. It's useless, and even if kept as a PK it's a bad idea to make it the clustered index. Instead, I think the table should be clustered on ApptID and CPTID (whichever one first that is accessed most often). Then put an index on the one that is second in the list.
I do not think that you should add Patient to the Appt table if ApptID already fully implies patient. The reason for this is that doing so violates normalization rules, allowing an ApptCPT row to simultaneously point to a patient and an appointment that are not related. You would be mixing two different levels/granularities of data in doing so.
It also seems to me that a ton of pain can be avoided if you don't name columns "ID" because then you don't have to deal with aliasing and column confusion. It took some real staring to figure out the below query for you because there were so many ID columns. Name columns the same everywhere they are used!
Your query is probably already the fastest it can be. If you had an index on ApptCPT with both CPTID and ApptID in it, then it would be a covering index and the main table could be completely avoided, which could speed up your query (because as it is now the engine has to access indexes twice from that table). What does your execution plan look like? Are you getting seeks or scans or what?
Also, why do you need to return 7 million rows? This can't be for a user sitting there ready to consume it, so I don't see why a 1 minute response time is all that bad.
If you were only going to be selecting certain patients or certain CPTs, then this query could be an improvement over the entire-table query. But if you do need all 7 million rows, this is probably going to perform abysmally:
SELECT
P.ID PatientID,
X.ID ApptCPTID,
X.CPTID
FROM
Patient P
CROSS JOIN CPTs C -- your master CPT table listing all CPTs
CROSS APPLY (
SELECT TOP 1 AC.ID, AC.CPTID
FROM
Appt A
INNER JOIN ApptCPT AC ON A.ID = AC.ApptID
WHERE
A.PatientID = P.ID
AND C.CPTID = AC.CPTID
ORDER BY AC.ID DESC
) X
WHERE
P.Something = 'Z'
AND CPT.Code IN ('A', 'B')
However, one benefit of this query is that the column you use to choose the latest item doesn't have to be the one in the SELECT clause. You could say ORDER BY AC.Date DESC
if you wanted. I don't know enough about what you're doing to really know for sure, but that could be important to you.
Note: I just realized that the CROSS JOIN with OUTER APPLY would have returned a bunch of spurious rows. So I switched it to CROSS APPLY and everything should be okay. There's probably something else that could be done to make nonqualifying patients included but I don't have time to think about it right now.
精彩评论