This is just a design question regarding massive db design. For instance, if you were going to build a database that would hold 10 million users, how wo开发者_JAVA技巧uld you architect it?
My main curiousity are things like database replication, does this actually speed up anything?
When building a db of this size, say the fields are "username" "name" "company" "dob" "gender" other than making one table, on that scale what else should be considered? Indexes?
10 Million is not particularly huge, but it's large enough that you should consider your options carefully.
Replication can help - a lot. Assuming that you're reading your users table a lot more than you're writing to it, you might consider a master database which handles only writes. Any reads your application does will come from one of N slave boxes.
Indexes are massively important, of course. You'll want indexes on any columns that are frequently searched (either in WHERE clauses, or as the result of relationships with other tables (read: JOINS)). Much has been written about how to analyse the sorts of queries your application makes, and how to smartly define indexes based on that analysis. If you're just learning about that stuff, go do some reading, and come back to SO with more focused questions.
Beyond single-master replication (and careful indexing), as you start to get really big, you might start thinking about partitioning -- but that's something I've only ever read about, so I don't want to say too much about it.
As always this depends on the use cases.. what are the queries you are going to run on the database?
Some applications only retrieve users by username or uid, for that a key-value store is perfect and infinitely scalable.
If you have additional search queries then you can put the data into SQL (with indexes on the appropriate columns) or use an external search full-text search engine (lucene, sphinx). You can also build different indexes on different replicas so each of them can be used for specific queries but still achieve good insert performance (of course not for the user table but for user related data).
If you have complex queries, joining on multiple tables, then probably SQL is the only option to go, but you might still be able to scale with sharding on username and user related data.. this again, depends on what the queries are. For off-line queries (stats, reports) you can do the joins on each shard and merge the result-sets (map-reduce, gearman frameworks can help here).
Finally you can mix all these approaches, use key-value for login, use SQL for complex queries, and use replication for both durability and performance.
Ten million records is not necessarily a large database. Some people would consider a large database one that consists of hundreds of millions of rows or more and terabytes or petabytes of storage.
Beyond typical normalization, if nothing can be done to reduce the depth (number of rows) of the tables, then indexes are certainly going to be helpful.
精彩评论