Patterns for Schema Changes in Document Databases_问答_开发者

befo开发者_Python百科re I start I'd like to apologize for the rather generic type of my questions - I am sure a whole book could be written on that particular topic.

Lets assume you have a big document database with multiple document schemas and millions of documents for each of these schemas. During the life time of the application the need arises to change the schema (and content) of the already stored documents frequently.

Such changes could be

adding new fields
recalculating field values (split Gross into Net and VAT)
drop fields
move fields into an embedded document

I my last project where we used a SQL DB we had some very similar challanges which resulted in some significant offline time (for a 24/7 product) when the changes became to drastic as SQL DBs usually do a LOCK on a table when changes occur. I want to avoid such a scenario.

Another related question is how to handle schema changes from within the used programming language environment. Usually schema changes happen by changing the Class definition (I will be using Mongoid a OR-Mapper for MongoDB and Ruby). How do I handle old versions of documents that do not conform any more to my latest Class definition.

That is a very good question.

The good part of document oriented databases as MongoDB is that documents from the same collection doesn't need to have the same fields. Having different fields do not raise an error, per se. It's called flexibility. It also a bad part, for the same reasons.

So the problem and also the solution comes from the logic of your application.

Let say we have a model Person and we want to add a field. Currently in the database we have 5.000.000 people saved. The problem is: How do we add that field and have the less downtime?

Possible solution:

Change the logic of the application so that it can cope with both a person with that field and a person without that field.
Write a task that add that field to each person in the database.
Update the production deployment with the new logic.
Run the script.

So the only downtime is the few seconds that it takes to redeploy. Nonetheless, we need to spend time with the logic.

So basically we need to choose which is more valuable the uptime or our time.

Now let say we want to recalculate a field such as the VAT value. We can not do the same as before, because having some products with VAT A and other with VAT B doesn't make sense.

So, a possible solution would be:

Change the logic of the application so that it shows that the VAT values are being updated and disable the operations that could use it, such as buys.
Write the script to update all the VAT values.
Redeploy with the new code.
Run the script. When it finish:
Redeploy with the full operation code.

So there is not absolute downtime, but just partial shutdown of some specifics part. The user could keep seeing the description of products and using the other parts of the application.

Now let say, that we want to drop a field. The process would be pretty much the same as the first one.

Now, moving fields into embed documents; that's is a good one! The process would be similar to the first one. But instead of checking the existence of the field we need to check if it is a embedded document or a field.

The conclusion is that with a document oriented database you have a lot of flexibility. And so you have elegant options at your hands. Whether you use it or not depends or whether you value more you development time or your client's time.