开发者

Efficiently bulk import data in SSIS with occasional PK duplicate content?

开发者 https://www.devze.com 2023-02-06 00:52 出处:网络
Am regularly loading a flat file with 100k records into a table after some transformations. The table has a PK on two columns. The data on the whole does not contain duplicate PK information but occas

Am regularly loading a flat file with 100k records into a table after some transformations. The table has a PK on two columns. The data on the whole does not contain duplicate PK information but occasionally, there are duplicates.

I naively didn't understand why SSIS was rejecting all my records when only some of them violated the PK constraint. I believe the problem is that during a bulk load, if even 1 of the rows violates the PK constraint, all rows in that batch get rejected.

If I alter the FastLoadMaxInsertCommitSize property of the OLE Db Destination to 1, if fixes开发者_JAVA百科 the problem but it then runs like a dog as it's committing every 1 row.

In MySQL, the bulk load facility allows you to ignore PK errors and skip those rows without sacrificing performance. Does anyone know of a way to achieve this in SQL Server.

Any help much appreciated.


It sounds like you may be looking for IGNORE_DUP_KEY?

Using the IGNORE_DUP_KEY Option to Handle Duplicate Values

When you create or modify a unique index or constraint, you can set the IGNORE_DUP_KEY option ON or OFF. This option specifies the error response to duplicate key values in a multiple-row INSERT statement after the index has been created. When IGNORE_DUP_KEY is set to OFF (the default), the SQL Server Database Engine rejects all rows in the statement when one or more rows contain duplicate key values. When set to ON, only the rows that contain duplicate key values are rejected; the nonduplicate key values are added.

For example, if a single statement inserts 20 rows into a table with a unique index, and 10 of those rows contain duplicate key values, by default all 20 rows are rejected. However, if the index option IGNORE_DUP_KEY is set to ON, only the 10 duplicate key values will be rejected; the other 10 nonduplicate key values will be inserted into the table.


You can up the FastLoadMaxInsertCommitSize to say 5k...this will speed up your inserts greatly. Then, set the Error Output to redirect the rows - on the error output from there, send a batch of 5k rows that contains an error row to another Destination. (This next bit is from memory!) If you set this up to not be fast load, it will then insert the good rows and you can pass the error output to an error table or something like a row count task.

You can play with the FastLoadMaxInsertCommitSize figures until you find something that works well for you.

0

精彩评论

暂无评论...
验证码 换一张
取 消