How would you architect this message processing system in .NET/SQL Server?_问答_开发者

How would you architect this message processing system in .NET/SQL Server?

开发者 https://www.devze.com 2023-01-16 21:53 出处：网络

Let\'s say I\'ve got a SQL Server database table with X (> 1,000,000) records in it that need to be processed (get data, perform external action, update status in db) one-by-one by some worker开发者_C

Let's say I've got a SQL Server database table with X (> 1,000,000) records in it that need to be processed (get data, perform external action, update status in db) one-by-one by some worker开发者_C百科 processes (either console apps, windows service, Azure worker roles, etc). I need to guarantee each row is only processed once. Ideally exclusivity would be guaranteed no matter how many machines/processes were spun up to process the messages. I'm mostly worried about two SELECTs grabbing the same rows simultaneously.

I know there are better datastores for queuing out there, but I don't have that luxury for this project. I have ideas for accomplishing this, but I'm looking for more.

I've had this situation.

Add an InProcess column to the table, default = 0. In the consumer process:

UPDATE tbl SET Inprocess = @myMachineID WHERE rowID = 
    (SELECT MIN(rowID) WHERE InProcess = 0)

Now that machine owns the row, and you can query its data without fear. Usually your next line will be something like this:

SELECT * FROM tbl WHERE rowID = 
    (SELECT MAX(rowID) FROM tbl WHERE ProcessID = @myMachineID)

You'll also have to add a Done flag of some kind to the row, so you can tell if the row was claimed but processing was incomplete.

Edit

The UPDATE gets an exclusive lock (see MSDN). I'm not sure if the SELECT in the subquery is allowed to be split from the UPDATE; if so, you'd have to put them in a transaction.

@Will A posts a link which suggests that beginning your batch with this will guarantee it:

SET TRANSACTION ISOLATION LEVEL READ COMMITTED

...but I haven't tried it.

@Martin Smith's link also makes some good points, looking at the OUTPUT clause (added in SQL 2005).

One last edit

Very interesting exchange in the comments, I definitely learned a few things here. And that's what SO is for, right?

Just for color: when I used this approach back in 2004, I had a bunch of web crawlers dumping URLs-to-search into a table, then pulling their next URL-to-crawl from that same table. Since the crawlers were attempting to attract malware, they were liable to crash at any moment.

I'd consider having the process fetch the top N number of records whose "processed" flag is zero into a local collection. I would actually have three values for the processed flag: NotProcessed (0), Processing (2), Processed (1). Then, loop through your collection and issue the following sql:

update table_of_records_to_process
set processed = 2
where record_id = 123456
and processed = 0

...that way, if some other process has grabbed that record ID already, then it will not set the processed field to 2. You'll want to verify that record ID 123456 is truly set to 2:

select count(*)
from table_of_records_to_process
where record_id = 123456
and processed = 2

...then you can process that one. If the count returned is zero, then you'll move on to the next record in your collection and try again. If you get to the end of your collection and some other process already modified all those records, go fetch N more records.