开发者

Find binary duplicate records SQL Server 2008 (Data type image)

开发者 https://www.devze.com 2023-03-06 17:32 出处:网络
I have inherited a database that has a table (300gb) full of SQL datatype Image. I understand this datatype is depreciated.

I have inherited a database that has a table (300gb) full of SQL datatype Image. I understand this datatype is depreciated.

As a routine cleanup I want to delete all duplicate Image's from the table where certain conditions are met.

How do I compare binary data using SQL efficiently? Is the = equality operator sufficient?

Here is a scenario:

Table 'Paperwork'
  int ID
  int EmployeeID
  int AnotherID
  int AnotherFKID
  image Attachment

I want to find all rows where the Attachment, EmployeeID, AnotherID and AnotherFKID a开发者_如何学Cre the same. It needs to be done with minimal impact on the database as there are over 1,116,313 rows.

Edit

The SQL Server Image data type does not support LIKE or the usual comparison operators.

Edit

Thanks to @Martin who suggested the Image be casted to varbinary. I have added to this to get the MD5 checksum using Hashbytes

HASHBYTES('MD5',CAST(cast([Attachment] as varbinary(max))as varbinary)) AS AttachmentMD5


Jeremy,

Any all in one script will kill the buffer cache when it reads in the 300g. Break the job down into several tasks.

Task 1

  • create a table with the ID and a grouping to show duplicates of the three int columns

Table example

    TableID  PaperWorkID GroupID
       1        14          1
       2        15          1
       3        21          2
       4        55          2

Now we know PaperWorkIDs 14 and 15 share the same the three int columns because they are in the same group.

Task 2

  • add a column (bigint) to the table and populate the column with the DATALENGTH of the Image column in table Paperwork based on the PaperWorkID in the table
  • remove all non duplicates based on the datalength and GroupID

Task 3

  • Add a column varbinary(max) to the table.
  • fill the column with the MD5 hash of the image column based on the PaperWorkID in the table
  • Remove all non duplicates from the table based on the MD5 hash and the GroupID

Task 4

  • make 2 backups of the PaperWork table
  • remove duplicate records in Paperwork based on the items remaining in the table.

If the data for the image column was scanned from paper there is very little chance two scans will produce the exact same image. If the data was uploaded twice then you are in luck.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号