is mongodb appropriate for sites like stackoverflow?
Put simply: Yes, it could be.
Let's break down the various pages/features and see how they could be stored/reproduced in MongoDB.
The whole information in this page could be stored in a single document under the collection questions
. This could include "sub-documents" for each answer to keep the retrieval of this page fast.
Edit: as @beagleguy pointed out, you could hit the document size limit of 4MB quite quickly this way, so it would be better to store answers in separate documents and link them to the question by storing the ObjectIDs in an array.
The votes
could be stored in a separate collection, with simple links to the question and to the user
who voted. A db.eval()
call could be executed to increment/decrement the vote count directly in the document when a vote is added (though it blocks so wouldn't be very performant), or a MapReduce call could be made regularly do offset that work. It could work the same way for favourites
.
Things like the "viewed" numbers, logging user's access times, etc. would generally be handled using a modifier operation to increment a counter. Since v1.3 there is a new "Find and Modify" command which can issue an update command when retrieving the document, saving you an extra call.
Any sort of statistical data (such as reputation, badges, unique tags) could be collected using MapReduce and pushed to specific collections. Things like notifications could be pushed to another collection acting as a job queue, with a number of workers listening for new items in the queue (think badge notifications, new answers since user's last access time, etc).
The Questions page and it's filters could all be handled with capped-collections rather than querying for that data immediately.
Ultimately, YMMV. As with all tools, there are advantages and costs. There are some SO features which would take a lot of work in an RDBMS but could be handled quite simply in Mongo, and vice-versa.
I think the main advantage of Mongo over RDBMSs is the schema-less approach and replication. Changing the schema regularly in a "live" RDMBS-based app can be painful, even impossible if it's heavily used with large amounts of data - those types of ops can lock the tables for far too long. In Mongo, adding new fields is trivial since you may not need to add them to every document. If you do its a relatively quick operation to run a map/reduce to update documents.
As for replication, Mongo has the advantage that the DB doesn't need to be paused to take a snapshot for slaves. Many RDBMSs can't set up replication without this approach, which on large DBs can take the master down for a long time (I'm looking at you, MySQL!). This can be a blessing for StackOverflow-type sites, where you need to scale over time - no taking the master down every time you need to add a node.
I think it is.
You can store the question itself, the answers and the comments on the question + answers as one mongo-document. The max doc size is 4 mb, so no document on stackoverflow will be too big for mongo. I've downloaded the content of stackoverflow (data dump) with bittorrent and I've been able to import this content into mongo.
Importing this data into mongo is not trivial because the dump of stackoverflow consists of multiple xml files and each xml file matches with one relational table, so have to recombine this data into document format.
I've also added the display name + reputation of the OP + answerers + commenters to this document. This does mean that if a user changes his/her displayname you have to update all the documents with his/her userid. There is a price to pay if you denormalize your data. Same if the reputation of a user changes.
The idea is that all the data that you see on a page like this is contained in one mongo-document. You have all the necessary information with one lookup and no joins.
Here you can download the data dump of stackoverflow: https://blog.stackoverflow.com/category/cc-wiki-dump/
I would say no, it's not a great fit, the more complicated your objects get the more an object/document database makes sense. But if you look at SO, most of it isn't complicated object relationships.
There's a questions table, with however many properties, then a collection of answers...but all these need to be accessed independently depending on which view your coming from, e.g. your activity screen or the question/answer screens. Since you're accessing it at so many angles and each piece is comparatively simple, a relational model works better.
There are queries running in the background for badges and such, you need to quickly check if you're hitting reputation caps for votes...a lot of relational queries that are simpler in a RDBMS given the complexity of the object model.
This is of course my opinion, maybe SO's structure is way more complicated than it appears to be
With RDBMS for OLTP side of your application and proper caching - it should work gracefully.
Actually - there's an open source stackoverflow clone that uses RoR & MongoDB. :)
You can also use $inc/$dec for vote tracking, so no need to use db.eval
I think it would be a good fit. There are a lot of reasons to use Nonrel databases like MongoDB on sites that function similarly to StackOverflow. Think about how RDBMs store data to disk and take filesystem block size and similar disk attributes into mind when planning your layout. I like taking advantage of documents that span multiple filesystem blocks and store lots of related information within itself nice and flattened. I find that the storage is less spread out and a single block can be written containing a LOT of information where multiple blocks would be written using other solutions.
For me MongoDB is really great for all website that don't need transaction.
精彩评论