Best way to handle concurrency issues_问答_开发者

i have a LAPP (linux, apache, postgresql and php) environment, but the question is pretty the same both on Postgres or Mysql.

I have an cms app i developed, that handle clients, documents (estimates, invoices, etc..) and other data, structured in 1 postgres DB with many schemas (one for each our customer using the app); let's assume around 200 schemas, each of them used concurrently by 15 people (avg).

EDIT: I do have an timestamp field named last_update on every table, and a trigger that update the timestamp every time the row is update.

The situation is:

People Foo and Bar are editing the document 0001, using a form with every document details.
Foo change the shipment details, for example.
Bar change the phone numbers, and some items in the document.
Foo press the 'Save' button, the app update the db.
Bar press the 'Save' button after bar, resending the form with the old shipment details.
In the database, the Foo changes have been lost.

The situation i want to have:

People Foo, Bar, John, Mary, Paoul are editing the document 0001, using a form with every document details.
Foo change the shipment details, for example.
Bar and the others change something else.
Foo press the 'Save' button, the app update the db.
Bar and the others get an alert 'Warning! this document has been changet by someone else. Click here to load the actuals data'.

I've wondered to use ajax to do this; simply using an hidden field with the id of the document and the last-updated timestamp, every 5 seconds check if the last-updated time is the same and do nothing, else, show the alert dialog box.

So, the page check-last-update.php should look something like:

<?php
//[connect to db, postgres or mysql]
$documentId = isset($_POST['document-id']) ? $_POST['document-id'] : 0;
$lastUpdateTime = isset($_POST['last-update-time']) ? $_POST['last-update-time'] : 0;
//in the real life i sanitize the data and use prepared statements;
$qr = pg_query("
    SELECT
        last_update_time
    FRO开发者_开发百科M
        documents
    WHERE
        id = '$documentId'
");
$ray = pg_fetch_assoc($qr);
if($ray['last_update_time'] > $lastUpdateTime){
    //someone else updated the document since i opened it!
    echo 'reload';
}else{
    echo 'ok';
}
?>

But i dont like to stress the db every 5 seconds for every user that have one (or more...) documents opened.

So, what can be another efficent solution without nuking the db?

I thought to use files, creating for example an empty txt file for each document, and everytime the document is updated, i 'touch' the file updating the 'last modified time' as well... but i guess that this would be slower than db and give problems when i have much users editing the same document.

If someone else have a better idea or any suggestion, please describe it in details!

* - - - - - UPDATE - - - - - *

I definitely choosen to NOT hit the db for check the 'last update timestamp', dont mind if the query will be pretty fast, the (main) database server has other tasks to fullfill, dont like the idea to increase his overload for that thing.

So, im taking this way:

Every time a document is updated by someone, i must do something to sign the new timestamp outside the db environment, e.g. without asking the db. My ideas are:
1. File-system: for each document i create an empry txt files named as the id of the document, everytime the document is update, i 'touch' the file. Im expecting to have thousands of those empty files.
2. APC, php cache: this will be probably a more flexible way than the first one, but im wondering if keeping thousands and thousands of data permanently in the apc wont slow down the php execution itself, or consume the server memory. Im little bit afraid to choose this way.
3. Another db, sqlite or mysql (that are faster and lighter with simple db structures) used to store just the documents ID and timestamps.
Whatever way i choose (files, apc, sub-db) im seriously thinking to use another web-server (lighttp?) on a sub-domain, to handle all those.. long-polling requests.

YET ANOTHER EDIT:

The file's way wouldnt work.

APC can be the solution.

Hitting the DB can be the solution too, creating a table just to handle the timestamps (with only two column, document_id and last_update_timestamp) that need to be as fast and light as possible.

Long polling: that's the way i'll choose, using lighttpd under apache to load static files (images, css, js, etc..), and just for this type of long-polling; This will lighten the apache2 load, specially for the polling.

Apache will proxy-up all those request to lighttpd.

Now, i only have to decide between db solution and APC solution..

p.s: thanks to all whom already answered me, you have been really usefull!

I agree that I probably wouldn't hit the database for this. I suppose I would use APC cache (or some other in-memory cache) to maintain this information. What you are describing is clearly optimistic locking at the detailed record level. The higher the level in the database structure the less you need to deal with. It sounds like you want to check with multiple tables within a structure.

I would maintain a cache (in APC) of the IDs and the timestamps of the last updated time keyed by the table name. So for example I might have an array of table names where each entry is keyed by ID and the actual value is the last updated timestamp. There are probably many ways to set this up with arrays or other structures but you get the idea. I would probably add a timeout to the cache so that entries in the cache are removed after a certain period of time - i.e., I wouldn't want the cache to grow and assume that 1 day old entries aren't useful anymore).

With this architecture you would need to do the following (in addition to setting up APC):

on any update to any (applicable) table, update the APC cache entry with the new timestamp.
within ajax just go as far "back" as php (to obtain the APC cache to check the entry) rather than all of the way "back" to the database.

I think you can use a condition in the UPDATE statement like WHERE ID=? AND LAST_UPDATE=?.

The idea is that you will only succeed in updating when you are the last one reading that row. If someone else has committed something, you will fail, and once you know you've failed, you can query the changes.

Hibernate uses a version field to do that. Give every table such a field and use a trigger to increment it on every update. When storing an update, compare the current version with the version when the data was read earlier. If those don't match, throw an exception. Use transactions to make the check-and-update atomic.

You will need some type of version stamp field for each record. What it is doesn't matter as long as you can guarantee that making any change to a record will result in that version stamp being different. Best practice is to then check and make sure the loaded record's version stamp is the same as the version stamp in the DB when the user clicks save, and if it's different handle it.

How you handle it is up to you. At the very least you'd want to offer to reload from the DB so the user can verify that they still want to save. One up from that would be to attempt to merge their changes into the new DB record and then ask them to verify that the merge worked correctly.

If you want to periodically poll any DB capable of handling your system should be able to take the poll load. 10 users polling once every 5 seconds is 2 transactions per second. This is a trivial load, and should be no problem at all. To keep the average load close to the actual load, just jitter the polling time slightly (instead of doing it exactly every 5 seconds, do it every 4-6 seconds, for example).

Donnie's answer (polling) is probably your best option - simple and works. It'll cover almost every case (its unlikely a simple PK lookup would hurt performance, even on a very popular site).

For completeness, and if you wanted to avoid polling, you can use a push-model. There's various ways described in the Wikipedia article. If you can maintain a write-through cache (everytime you update the record, you update the cache), then you can almost completely eliminate the database load.

Don't use a timestamp "last_updated" column, though. Edits within the same second aren't unheard of. You could get away with it if you add extra information (server that did the update, remote address, port, etc) to ensure that, if two requests came in at the same second, to the same server, you could detect the difference. If you need that precision, though, you might as well use a unique revision field (it doesn't necessarily have to be an incrementing integer, just unique within that record's lifespan).

Someone mentioned persistent connections - this would reduce the setup cost of the polling queries (every connection consumes resources on the database and host machine, naturally). You would keep a single connection (or as few as possible) open all the time (or as long as possible) and use that (in combination with caching and memoization, if desired).

Finally, there are SQL statements that allow you to add a condition on UPDATE or INSERT. My SQl is really rusting, but I think its something like UPDATE ... WHERE .... To match this level of protection, you would have to do your own row locking prior to sending the update (and all the error handling and cleanup that might entail). Its unlikely you'd need this; I'm just mentioning it for completness.

Edit:

Your solution sounds fine (cache timestamps, proxy polling requests to a another server). The only change I'd make is to update the cached timestamps on every save. This will keep the cache fresher. I'd also check the timestamp directly from the db when saving to prevent a save sneaking in due to stale cache data.

If you use APC for caching, then a second HTTP server doesn't make sense - you'd have to run it on the same machine (APC uses shared memory). The same physical machine would be doing the work, but with the additional overhead of a second HTTP server. If you want to off load the polling requests to a second server (lighttpd, in your case), then it would be better to setup lightttpd in front of Apache on a second physical machine and use a shared caching server (memcache) so that the lighttpd server can read the cached timestamps, and Apache can update the cached timestamps. The rationale for putting lighttpd in front of Apache is, if most requests are polling requests, to avoid the heavier-weight Apache process usage.

You probably don't need a second server at all, really. Apache should be able to handle the additional requests. If it can't, then I'd revisit your configuration (specifically the directives that control how many worker processes you run and how many requests they are allowed to handle before being killed).

Your approach of querying the database is the best one. If you do it every 5 seconds and you have 15 concurrent users then you're looking at ~3 queries a second. It should be a very small query too, returning only one row of data. If your database can't handle 3 transactions a second then you might have to look at a better database because 3 queries/second is nothing.

Timestamp the records in the table so you can quickly see if anything has changed without having to diff each field.

This is slightly off topic, but you can use the PEAR package (or PECL package, I forget which) xdiff to send back good user guidance when you do get a collision.

First off only update the fields that have changed on when writing to the database, this will decrease database load.

Second, query the timestamp of the last update, if you have a older timestamp then the current version in the database then throw the warning to the client.

Third is to somehow push this information to the client, though some kind of persistent connection with the server, enabling a concurrent two way connection.

Polling is rarely a nice solution.
You could do the timstamp check only when the user (with the open document) is doing something active with the document like scrolling, moving the mouse over it or starts to edit. Then the user gets an alert if the document has been changed.

.....
I know it was not what you asked for but ... why not a edit-singleton?
The singleton could be a userID column in the document-table.
If a user wants to edit the document, the document is locked for edit by other users.

Or have edit-singletons on the individual fields/groups of information.

Only one user can edit the document at a time. If another user has the document open and want to edit a single timestamp check reveal that the document has been altered and is reloaded.

With a singleton there is no polling and only one timestamp check when the user "touches" and/or wants to edit the document.

But perhaps a singleton mechanism doesn't fit your system.

Regards
Sigersted

Ahhh, i though it was easyer.

So, lets make the point: i have a generic database (pgsql or mysql doesn't matter), that contains many generic objects.

I have $x (actually $x = 200, but is growing, hoping will reach 1000 soon) of exact copy of this database, and for each of them up to 20 (avg 10) users for 9 hours at day.

If one of those users is viewing a record, any record, i must advice him if someone edit the same record.

Let's say Foo is watching the document 0001, sit up for a coffee, Bar open and edit the same document, when Foo come back he must see an 'Warning, someone else edited this document! click here to refresh tha page.'.

That'all i need atm, probably i'll extend this situation, adding a way to see the changes and rollback, but this is not the point.

Some of you suggested to check the 'last update' timestamp only when foo try to save the document; Can be a solution too, but i need something in real-time ( 10 sec deelay ).

Long polling, bad way, but seem to be the only one.

So, what i've done:

Installed Lighttp on my machine (and php5 as fastcgi);
Loaded apache2's proxy module (all, or 403 error will hit you);
Changed the lighttpd port from 80 (that is used by apache2) to 81;
Configured apache2 to proxying the request from mydomain.com/polling/* to polling.mydomain.com (served with Lighttp)
Now, i have another sub http-service that i'll use both for polling and load static content (images, etc..), in order to reduce the apache2's load.
Becose i dont want to nuke the database for the timestamp check, i've tryed some caches system (that can be called from php).
1. APC: quite simple to install and manage, very lightweight and faster, this would be my first choice.. if only the cache would be sharable between two cgi process (i need to store in cache a value from apache2's php process, and read it from lighttpd's php process)
2. Memcached: around 4-5 times slower than APC, but run as a single process that can be touched everywhere in my environment. I'll go with this one, atm. (even if is slower, the use i'll do of it is relatively simple).

Now, i just have to try this system loading some test datas to see ho will move 'under pressure' and optimize it.

I suppost this environment will work for other long-polling situations (chat?)

Thanks to everyone who gave me hear!

I suggest: when you first query the record that might be changed, hang onto a local copy. When "updating", compare the copy in the locked table/row against your copy, and if it's changed, kick it back to the user.