mysql design: verbose tables_问答_开发者_运维开发者技术经验分享

Thinking about a re-design for a database. Wondering what the differences/caveats are to something like:

users    fields  
-----    ------
id       id
         user_id
         page_id
         field_id
         field_data

as opposed to

users    address  
-----    ------
id       id
         user_id
         street1
         street2
         etc...

Are there issues with using the former that I'm not considering? Is there a performance hit if/when the fields table were to become huge? Is this just bad practice? For some reason this makes me think of nosql开发者_开发百科, but I could be mistaken.

Thanks!

The first approach you outline is a "property bag"; it implies a lack of structure to the set of data that's going to be stored against your user. The second approach is much more of a structured approach, where you have a specific address table with specific entries, etc.

The type of data that you have really determines which approach is appropriate. For largely unstructured or indeterminable data, the property bag approach is quite appropriate; however, if you know that each user is definitely going to have an address, and the structure of the address is going to be in a precise form, it makes sense in that case to have an address table, with specific columns for the values.

Determining which of these types of structures is appropriate really is about having some knowledge or expectation of the data you'll be receiving, and the structure inherent in it (if any). I've frequently seen a mix of these two approaches, in situations where there was a set of well-known, well-structured data for each record, and there was also a (potential) broad set of un(der)structured data that could (but might not) be added for a record.

Edit: To address specifically performance concerns, yes, performance can become an issue when tables become large in the first case. And even before tables become large, performance can become an issue depending on the types of queries you'll need to run. Using the property bag approach handicaps your ability to use indexing to speed up queries (for example, indexing on zip code).

It's not entirely clear what you're showing, but I'll hazard a guess that the first design doesn't list the specific address columns, so the idea is to be more generic. The first design will be easier to extend - instead of adding columns, you're just adding a new field_id, and field data. Having said that you and others using it may find it more of a pain to work with...

Instead of something like:

SELECT street1, street2, city, postal_code FROM address where user_id = someval

you'll do:

SELECT field_id, field_data FROM fields WHERE field_id in 
('street1', street2', 'city', 'postal_code') 
AND user_id = someval

In the first case it may be easier in a program to assign variables to the results, in the second case it may be a bit more of a pain (depending on the language), since you may have to test each field_id to see what it contains.

In most cases, I think you should just go with the second approach - especially if you know in advance what most of the fields are. The first approach is more for when you think you'll need to add many previously-unknown, and unknowable fields later.

The second approach is better - reason is:

Your address table defines what an address is so it's easy to change without changing every table that uses an address,
if you find you need multiple addresses per user, it's much tidier and lets the user alias each address. Amazon does this. Here's what your tables could look like:

Here's an example db structure

create table user (
user_id int,
home_address_id,
postal_address_id
...
);

create table address (
address_id int,
alias text, -- what the user calls this - eg "Uncle bob's house"
street text,
suburb text,
...
);

create table sale (
sale_id int,
user_id int,  -- actually, this is not required since you can go via address, but leave it
delivery_address_id int,
...
}

This would let users send stuff to other addresses that are reusable - maybe I always send uncle bob a gift at xmas - no problem - I have his address in my pick list.

Regarding the performance question, there are a few things to think about.

One is on INSERT: Using an auto_increment ID will not scale horizontally, if/when you move from one database server to a replicated (especially multiple-master) configuration. So, table size will start to cause problems earlier with your first approach.

MyISAM also does table-level locking on INSERT and UPDATE, so if you are using MyISAM (for instance, for read speed) instead of InnoDB, the first approach is also more problematic. Basically, it will require you to use InnoDB.

Of course, the biggest thing to consider is how reads will be impacted. If you are doing SELECTs by field properties, understand that the indexes on these columns will be larger if you use the first approach. And ORDER BY will be more problematic -- if you have ORDER BY address.city in a query and there's an index on the city column, then that's just a sequential read off the disk of the index file in the second scenario, but a lot of seeking in the first to skip over the non-city rows (where the equivalent query is WHERE field_id = "city" ORDER BY field_data).

So, as Paul says, it mostly depends on your data. If you need to be able to store a lot of arbitrary data sets (i.e. "long tail"), then the NoSQL-style may be appropriate, and on a single database server, using InnoDB, its performance disadvantages can be minimized. I've built systems (i.e. for storing arbitrary user-created HTML form data) that work this way. A simple example application that uses both approaches is !WordPress -- for known, core data types that it stores over and over (users, blog posts, comments) it uses standard normalized relational tables, but it also stores a lot of arbitrary user-defined metadata, for which it uses the denormalized approach.

If you're considering a fully denormalized structure and need high performance, I'd strongly advise you consider a NoSQL database like CouchDB or MongoDB. MySQL and its indexes are not optimal for these kinds of queries (you will end up with a lot of self-joins, and with the discontinous index problem I explained), whereas NoSQL is built for property-based indexes or even indexes of nested sub-properties. You can do full Map/Reduce or, with MongoDB for instance, index the result of an arbitrary Javascript function run on your data records.