DB design and optimization considerations for a social application_问答_开发者

The usual case. I have a simple app that will allow people to upload photos and follow other people. As a result, every user will have something like a "wall" or an "activity feed" where he or she sees the latest photos uploaded from his/her friends (people he or she follows).

Most of the functionalities are easy to implement. However, when it comes to this history activity feed, things can easily turn into a mess because of pure performance reasons.

I have come to the following dilemma here: i can easily design the activity feed as a normalized part of the database, which will save me writing cycles, but will enormously increase the complexity when selecting those results for each user (for each photo uploaded within a certain time period, select a certain number, whose uploaders I am following / for each person I follow, select his photos )

An optimization option could be the introduction of a series of threshold constraints which, for instance would allow me to order the people I follow on the basis of the date of their last upload, even exclude some, to save cycles, and for each user, select only the 5 (for example) last uploaded photos.

The second approach is to introduce a completely denormalized schema for the activity feed, in which every row represents a notification for one of my followers. This means that every time I upload a photo, the DB will put n rows in this "drop bucket", n meaning the number of people I follow, i.e. lots of writing cycles. If I have such a table, though, I could easily apply some optimization techniques such as clever indexing, as well as pruning entries older than a certain period of time (queue).

Yet, a third approach that comes to mind, is even a less denormalized s开发者_StackOverflow社区chema where the server side application will take some part of the complexity off the DB. I saw that some social apps such as friendfeed, heavily rely on the storage of serialized objects such as JSON objects in the DB.

I am definitely still mastering the skill of scalable DB design, so I am sure that there are many things I've missed, or still to learn. I would highly appreciate it if someone could give me at least a light in the right direction.

If your application is successful, then it's a good bet that you'll have more reads than writes - I only upload a photo once (write), but each of my friends reads it whenever they refresh their feed. Therefore you should optimize for fast reads, not fast writes, which points in the direction of a denormalized schema.

The problem here is that the amount of data you create could quickly get out of hand if you have a large number of users. Very large tables are hard on the db to query, so again there's a potential performance issue. (There's also the question of having enough storage, but that's much more easily solved).

If, as you suggest, you can delete rows after a certain amount of time, then this could be a good solution. You can reduce that amount of time (up to a point) as you grow and run into performance issues.

Regarding storing serialized objects, it's a good option if these objects are immutable (you won't change them after writing) and you don't need to index them or query on them. Note that if you denormalize your data, it probably means that you have a single table for the activity feed. In that case I see little gain in storing blobs. If you're going the serialized objects way, consider using some NoSQL solution, such as CouchDB - they're better optimized for handling that kind of data, so in principle you should get better performance for the same hardware setup. Note that I'm not suggesting that you move all your data to NoSQL - only for that part where it's a better solution.

Finally, a word of caution, spoken from experience: building an application that can scale is hard and takes time better spent elsewhere. You should spend your times worrying about how to get millions of users to your app before you worry about how you're going to serve those millions - the first is the more difficult problem. When you get to the point that you're hugely successful, you can re-architect and rebuild your application.

There are many options you can take

Add more hardware, Memory, CPU -- Enter cloud hosting
Hows 24GB of memory sound? Most of your importantly accessed DB information can fit just in memory.
Choose a host with expandable SSDs.
Use an events based system in your application to write the "history" of all users. So it will be like so: id, user_id, event_name, date, event_parameters' -- an example would be: 1, 8, CHANGED_PROFILE_PICTURE, 26-03-2011 12:34, <id of picture> and most important of all, this table will be in memory. No longer need to worry about write performance. After the records go past i.e. 3 days they can be purged into another table (in non-memory) and included into the query results, if the user chooses to go back that far. By having all this in one table you remove having to do multiple queries and SELECTs to build up this information.
Consider using INNODB for the history/feeds table.

Good Resources to read

Exploring the software behind Facebook, the world’s largest site
Digg: 4000% Performance Increase by Sorting in PHP Rather than MySQL
Caching & Performance: Lessons from Facebook

I would probably start with using a normalized schema so that you can write quickly and compactly. Then use non transactional (no locking) reads to pull the information back out making sure to use a cursor so that you can process the results as they're coming back as opposed to waiting for the entire result set. Since it doesn't sound like the information has any particular critical implications you don't really need to worry about a lock of the concerns that would normally push you away from transactional reads.

These kind of problems are why currently NOSql solutions used these days. What I did in my previos projecs is really simple. I don't keep user->wall user->history which contains purely feed'ids in memory stores(my favorite is redis). so in every insert I do 1 insert operation on database and (n*read optimization) insert operation in memory store. I design memory store to optimize my reads. if I want to filter user history (or wall) for videos I put a push feedid to a list like user::{userid}::wall::videos.

Well ofcourse you can purely build the system in memstores aswell but its nice to have 2 systems doing what they are doing the best.

edit : checkout these applications to get an idea:

http://retwis.antirez.com/

http://twissandra.com/

I'm reading more and more about NoSQL solutions and people suggesting them, however no one ever mentions drawbacks of such choice. Most obvious for me is lack of transactions - imagine if you lost a few records every now and then (there are cases reporting this happens often).

But, what I'm surprised with is that no one mentions MySQL being used as NoSQL - here's a link for some reading.

In the end, no matter what solution you choose (relational database or NoSQL storage), they scale in similar manner - by sharding data across network (naturally, there are more choices but this is the most obvious one). Since NoSQL does less work (no SQL layer so CPU cycles aren't wasted on interpreting SQL), it's faster, but it can hit the roof too.

As Elad already pointed out - building an app that's scalable from the get go is a painful process. It's better that you spend time focusing on making it popular and then scale it out.