I tried to execute the following query:
update ms
set user_B_total_duration = amc.total_duration
from monthly_statistics ms
inner join aggregate_monthly_conversations amc
on ms.user_B = amc.user_B
but the query has been executing for over 10 hours. There are about 23M records in each table (monthly_statistics and agg开发者_如何学运维regate_monthly_conversations). The database engine is SQL Server 2008 and the PC is quad core 2.66 GHz, 4GB RAM.
Does anyone know whether it is possible to optimize the query above or perform the same task by any workaround?
If I was troubleshooting this, these are the things I'd be on the lookout for:
- If practical, make sure no one's monopolizing the table (i.e., locking it)
- Make sure the join columns are indexed (i.e.,
ms.user_B
,amc.user_B
) - Update the columns in batches using
UPDATE TOP (100) ms SET ...
Item 3 is actually pretty important when making large inserts/updates/deletes. SQL Server generates log to undo this operation if it fails part way in, and this gets increasingly expensive. If you have to update 1m wide rows, it could well be much faster to operate on 20 batches of 50k rows. I've seen advice claiming that this makes a dramatic difference (and it does AFAICT). Plus, this prevents requests for the table from queuing up as much.
But there are two caveats: 1. You will be committing every batch separately, so you want to make sure your operation can tolerate "partially done". (I'm guessing this one can just be restarted.) 2. You need to be able to tell which columns are updated.
So, in your case, maybe:
declare @update_date datetime;
set @update_date = getdate();
while 1 = 1
begin
update top(10000) ms set
user_B_total_duration = amc.total_duration,
last_updated = @update_date
from
monthly_statistics ms
inner join aggregate_monthly_conversations amc
on ms.user_B = amc.user_B
where
ms.last_updated < @update_date;
if @@rowcount = 0 break;
end
You could also throw in a print to tell you how far along you are.
You could for example insert all data from the tables other than ms in an ad-hoc table, and so your update would be easier to process: no more joins, and a lesser amount of data.
Indexes on monthly_statistics.User_B
and aggregate_monthly_conversations.User_B
would be an excellent start, perhaps including total_duration
on the aggregate_monthly_conversations.User_B
index.
It appears that you have totally maxed out the memory on this machine and SQL Server is swapping memory out to disk. 4GB Ram is not that much.
How long does it take to run this query? How many rows are returned?
select
'update monthly_statistics set user_B = ' +
CAST(amc.total_duration as varchar) + ' ' +
'where user_B = ' +
CAST(ms.user_B as varchar) + ' GO'
from monthly_statistics ms
inner join aggregate_monthly_conversations amc
on ms.user_B = amc.user_B
You could use this output to update the table.
A more basic question: why are you even engaging in the first place in this expensive month-end denormalizing batch processing, when you can get any user's total_duration with an ad-hoc query? What is the specific rationale for taking this non-RDBMS approach? Normally people resort to batch processing of month-end data when an ad hoc query is too expensive and slow for ad hoc reporting purposes. Is that so in your case?
With indexes on the joined columns, ms.user_b and amc.user_b, you should be able to get any user's total_duration with a simple join of your two tables. How many distinct users are there to be found among the 23M records? If column ms.user_b has low cardinality, perhaps a composite index such as (ms.user_b, timeperiod) or something comparable (we don't know your schema) would yield the required ad hoc performance without an unacceptable performance degradation on inserts/updates?
If you have to leave things as they are, you could try a stored procedure in which you select the distinct set of AMC.user_ids into a cursor, and process the update of table MS one id at a time:
...
from monthly_statistics ms
inner join aggregate_monthly_conversations amc
on ms.user_B = amc.user_B and ms.user_b = @currentuserid
This too would require at least one simple index : on ms.user_b column, or a composite index on (ms.user_b, {some other column(s)}).
精彩评论