Synchronizing Database Access in a Distributed App_问答_开发者

A common bit of programming logic I find myself implementing often is something like the following pseudo-code:

Let X = some value
Let Database = some external Database handle

if !Database.contains(X):
   SomeCalculation()
   Database.insert(X)

However, in a multi-threaded program we have a race condition here. Thread A might check if X is in Database, find that it's not, and then proceed to call SomeCalculation(). Meanwhile, Thread B will also check if X is in Database, find that it's not, and insert a duplicate entry.

So of course, this needs to be synchronized like:

Let X = some value
Let Database = some external Database handle

LockMutex()
if !Database.contains(X):
   SomeCalculation()
   Database.insert(X)
UnlockMutex()

This is fine, except what if the application is a distributed app, running across multiple computers, all of which communicate with the same back-end database machine? In this case, a Mutex is useless, because it only synchronizes a single instance of the app with other local threads. To make this work, we'd need some kind of "global" distributed synchronization technique. (Assume that simply disallowing duplicates in Database is not a feasible strategy.)

In general, what are some practical solutions to this problem?

I realize this question is very generic, but I don't want to make this a language-specific question because this is an issue that comes up across multiple languages and multiple Database technologies. 开发者_开发问答

I intentionally avoided specifying whether I'm talking about an RDBMS or SQL Database, versus something like a NoSQL Database, because again - I'm looking for generalized answers based on industry practices. For example, is this situation something that Atomic Stored Procedures might solve? Or Atomic Transactions? Or is this something that requires something like a "Distributed Mutex"? Or more generally, is this problem generally addressed by the Database system, or is it something the Application itself should handle?

If it turns out this question is impossible to answer at all without further information, please tell me so I can modify it.

One sure way to ensure against data stomping is to lock the data row. Many databases allow you to do that, via transactions. Some don't support transactions.

However, this is overkill for most cases, where contention is low in general. You might want to read up on Isolation levels to get more background on the topic.

A better general approach is often Optimistic Concurrency. The idea behind it is that each data row includes a signature, a timestamp works fine but the signature need not be time oriented. It could be a hash value, for example. This is a general concurrency management approach and is not limited to relational stores.

The app that changes data first reads the row, and then performs whatever calculations it requires, and then at some point, writes the updated data back to the data store. Via Optimistic concurrency, the app writes the update with the stipulation (expressed in SQL if it is a SQL database) that the data row must be updated only if the signature has not changed in the interim. And, each time a data row is updated, the signature must be updated as well.

The result is that updates don't get stomped on. But for a more rigorous explanation of the concurrency issues, refer to that article on DB Isolation levels.

All distributed updaters must follow the OCC convention (or something stronger, like transactional locking) in order for this to work.

You can obviously move the "synch" part to the DB layer itself, using an exclusive lock on a specific resource.

This is a bit extreme (in most cases, attempting the insert and managing the exception when you actually discover that someone already inserted the row) would be more adequate, I think.

Well, since you ask an general question, I will try to provide another option. Its not very orthodox, but may be useful: You could "define" a machine or a process responsible for doing that. For example:

Let X = some value
Let Database = some external Database handle

xResposible = Definer.defineResponsibleFor(X);

if(xResposible == me)
    if !Database.contains(X):
       SomeCalculation()
       Database.insert(X)

The trick here is to make defineResponsibleFor always return the same value independent of who is calling. So, if you have a fair distributed range of X and a fair Definer all machines will have work to do. And you could use simple threading mutex to avoid race conditions. Of course, now you have to take care of fail-tolerance (if a machine or process is out of business your Definer must know and not define any job for it). But you should do this anyawy... :)