开发者

Fastest way to do mass update

开发者 https://www.devze.com 2023-01-02 03:02 出处:网络
Let’s say you have a table with about 5 million records and a nvarchar(max) column populated with large text data. You want to set this column to NULL if SomeOtherColumn = 1 in the fastest possible w

Let’s say you have a table with about 5 million records and a nvarchar(max) column populated with large text data. You want to set this column to NULL if SomeOtherColumn = 1 in the fastest possible way.

The brute force UPDATE does not work very well here because it will create large implicit transaction and take forever.

Doing updates in small batches of 50K records at a time works but it’s still taking 47 hours to complete on beefy 32 core/64GB server.

Is there any way to do this update faster? Are there any magic query hints / table options that sacrifices something else (like concurrency) in exchange for speed?

NOTE: Creating temp table or temp column is not an option because this nvarc开发者_StackOverflowhar(max) column involves lots of data and so consumes lots of space!

PS: Yes, SomeOtherColumn is already indexed.


From everything I can see it does not look like your problems are related to indexes.

The key seems to be in the fact that your nvarchar(max) field contains "lots" of data. Think about what SQL has to do in order to perform this update.

Since the column you are updating is likely more than 8000 characters it is stored off-page, which implies additional effort in reading this column when it is not NULL.

When you run a batch of 50000 updates SQL has to place this in an implicit transaction in order to make it possible to roll back in case of any problems. In order to roll back it has to store the original value of the column in the transaction log.

Assuming (for simplicity sake) that each column contains on average 10,000 bytes of data, that means 50,000 rows will contain around 500MB of data, which has to be stored temporarily (in simple recovery mode) or permanently (in full recovery mode).

There is no way to disable the logs as it will compromise the database integrity.

I ran a quick test on my dog slow desktop, and running batches of even 10,000 becomes prohibitively slow, but bringing the size down to 1000 rows, which implies a temporary log size of around 10MB, worked just nicely.

I loaded a table with 350,000 rows and marked 50,000 of them for update. This completed in around 4 minutes, and since it scales linearly you should be able to update your entire 5Million rows on my dog slow desktop in around 6 hours on my 1 processor 2GB desktop, so I would expect something much better on your beefy server backed by SAN or something.

You may want to run your update statement as a select, selecting only the primary key and the large nvarchar column, and ensure this runs as fast as you expect.

Of course the bottleneck may be other users locking things or contention on your storage or memory on the server, but since you did not mention other users I will assume you have the DB in single user mode for this.

As an optimization you should ensure that the transaction logs are on a different physical disk /disk group than the data to minimize seek times.


Hopefully you already dropped any indexes on the column you are setting to null, including full text indexes. As said before, turning off transactions and the log file temporarily would do the trick. Backing up your data will usually truncate your log files too.


You could set the database recovery mode to Simple to reduce logging, BUT do not do this without considering the full implications for a production environment.

What indexes are in place on the table? Given that batch updates of approx. 50,000 rows take so long, I would say you require an index.


Have you tried placing an index or statistics on someOtherColumn?


This really helped me. I went from 2 hours to 20 minutes with this.

/* I'm using database recovery mode to Simple */
/* Update table statistics */

set transaction isolation level read uncommitted     

/* Your 50k update, just to have a measures of the time it will take */

set transaction isolation level READ COMMITTED

In my experience, working in MSSQL 2005, moving everyday (automatically) 4 Million 46-byte-records (no nvarchar(max) though) from one table in a database to another table in a different database takes around 20 minutes in a QuadCore 8GB, 2Ghz server and it doesn't hurt application performance. By moving I mean INSERT INTO SELECT and then DELETE. The CPU usage never goes over 30 %, even when the table being deleted has 28M records and it constantly makes around 4K insert per minute but no updates. Well, that's my case, it may vary depending on your server load.

READ UNCOMMITTED

"Specifies that statements (your updates) can read rows that have been modified by other transactions but not yet committed." In my case, the records are readonly.

I don't know what rg-tsql means but here you'll find info about transaction isolation levels in MSSQL.


Try indexing 'SomeOtherColumn'...50K records should update in a snap. If there is already an index in place see if the index needs to be reorganized and that statistics have been collected for it.


If you are running a production environment with not enough space to duplicate all your tables, I believe that you are looking for trouble sooner or later.

If you provide some info about the number of rows with SomeOtherColumn=1, perhaps we can think another way, but I suggest:

0) Backup your table 1) Index the flag column 2) Set the table option to "no log tranctions" ... if posible 3) write a stored procedure to run the updates

0

精彩评论

暂无评论...
验证码 换一张
取 消