Compare an array with a "very large" table of a SQL Server database_问答_开发者

Compare an array with a "very large" table of a SQL Server database

开发者 https://www.devze.com 2023-03-22 21:47 出处：网络

In an C# program I have an array with about 100.000 elements. Then I have a SQL Server 2008 table where the primary key column contains more or less nearly all elements of the array (but a few not).

In an C# program I have an array with about 100.000 elements.

Then I have a SQL Server 2008 table where the primary key column contains more or less nearly all elements of the array (but a few not). The table can have up to 30.000.000 rows.

Now I want to determine which elements of 开发者_如何学Gothe array do not exist in the table. How can this be achieved efficiently?

The most efficient method would probably be to bulk-insert those 100,000 elements into a temp table and then perform the comparison within the database itself.

(Note that I haven't tested this theory; it's just an educated guess.)

Query the table with a

select <primarykey> where <primarykey> in (<primary key of ur list of elements in c#>)

This should be faster than inserting all rows into a table and then checking with an except/minus command for missing elements, because it does not involve any write operation.

Once you have the list of primary keys which are common..pull it back into c# and compare.

A way to avoid creating temp tables would be to use a stored procedure which accepts a table valued parameter of a user-defined table type (udtt). This table would have a schema of one column of a data type matching that in your array.

If you populate a DataTable (with a schema matching the udtt schema) with your array values and supply the data table as your stored proc's parameter, you can pass up all 100,000 of your items in their sql binary format. The proc can just do a join between the 30M row table and the table-valued parameter, returning the items in the TVP table with no matches in the master table.

This avoids needing to build massive IN statements.

EDIT Regarding the comment from @Kyro below

I'm now less confident in this approach. I found an article showing the under-the-covers row-by-row inserts that Kyro describes. What you might gain in sending binary data over the network rather than a large TSQL where in() statement, may well be taken away by the performance SQL side. However, it's a fairly simple code approach, so might just be worth a quick test. Let us know how you get on?