Database: Foreign keys or denormalized tables for immutable data?_问答_开发者

Given:

Person[id, name, city, state, country]

When am I better off storing city names directly in the main table versus using foreign keys to a separate cities table? For the purpose of this discussion, assume that city names are immutable. Here is what I was thinking:

Option 1: Store values inline.

Person[id, name, city, state, country]

Pro: Easy inserts. Fast queries.
Cons: Increased diskspace/memory usage
Ideal for small values, infrequent duplicates.

Option 2: Foreign keys to a separate table

Person[id, name, city_id, state_id, country_id]
Cities[id, name]
States[id, name]
Countries[id, name]

Pro: Decreased diskspace/memory usage. Potentially easier to cache frequently-used values.
Cons: Inserts are more complicated (does a city already exist or should it be 开发者_如何学Pythoninserted?)
Ideal for large values, frequent duplicates.

To quote http://en.wikipedia.org/wiki/Database_normalization: "The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships." It seems to me that these benefits fly out the window when dealing with immutable data since it never needs to be updated. As for insertion and deletion anomalies, we can use nullable columns (if needed).

What is the best-practice for this case?

Option 2: Foreign keys to a separate table

Person[id, name, city_id, state_id, country_id]
Cities[id, name]
States[id, name]
Countries[id, name]

"Normalization" doesn't mean "replace names with id numbers". You should hunt down whoever taught you that, and poke them in the eye with your finger. (Or better yet, both eyes. Two fingers.)

"Normalization" involves identifying associations between columns ("functional dependencies"), and isolating them in a different table ("projection"). Normalization increases data integrity by reducing or eliminating certain kinds of INSERT, UPDATE, and DELETE anomalies.

Later . . .

Some example data for the table of people . . .

id  name          city   state           country
--
1   John Smith    York   Alabama         United States of America
2   John Doe      York   Maine           United States of America
3   Jane Smith    York   Nebraska        United States of America
4   Jane Doe      York   South Carolina  United States of America

You're right, the original table is in 2NF. It's not in 3NF, because "country" is a fact about {city, state}. So you can replace the original table of people with these two tables.

people
id  name          city   state           
--
1   John Smith    York   Alabama         
2   John Doe      York   Maine           
3   Jane Smith    York   Nebraska        
4   Jane Doe      York   South Carolina  

cities -- key is (city, state)
city   state           country  
--
York   Alabama         United States of America
York   Maine           United States of America
York   Nebraska        United States of America
York   South Carolina  United States of America

Two things to watch: 1) Decomposition removed a column from the original table. 2) Decomposition didn't involve adding an arbitrary id number.

What normal form is each of these tables in now?

You can reduce the storage space needed by replacing country names with country codes. You might begin by storing the ISO country codes along with the other attributes in the "cities" table.

cities -- key is (city, state)
city   state           country                    iso_cc
--
York   Alabama         United States of America    US
York   Maine           United States of America    US
York   Nebraska        United States of America    US
York   South Carolina  United States of America    US

But by adding one column, we've increased the number of functional dependencies from one to four.

{city, state} -> country
{city, state} -> iso_cc
country -> iso_cc
iso_cc -> country

We can remove the two transitive dependencies by creating a table of countries. It makes sense to remove the column "country", and retain the column "iso_cc" for two reasons. The column "iso_cc" is shorter, and humans can read it. Since humans can read it, we usually won't have to join the table "countries".

cities -- key is (city, state)
city   state           iso_cc
--
York   Alabama         US
York   Maine           US
York   Nebraska        US
York   South Carolina  US

countries -- keys are iso_cc and country
iso_cc   country
--
US       United States of America

Note that the table "countries" has two candidate keys. Each column is unique. In my experience, most databases don't enforce both those constraints. Developers who simply replace names with id numbers frequently miss that second one. (That "country" is unique, not only the id number.)

A normalized database will, in addition to your points, standardize the naming of the cities. However, then you need to populate your city table with all the cities in the world. Otherwise, you'll end up with something like New York and New-York.

Unless you are planning on having city information beyond just the name (e.g. size, geo-coordinates, etc), I say stick with string city names. If your code is good, you'll be able to normalize if ever the need comes up.

You could certainly have a Countries table, with PK = ISO code. Just search the net, and you will find a ready made list. I would not care about the rest for adresses, unless your row number is really high (> 10^6), or you plan some specialised usage as Dimitry pointed.
If you decide to go that way, you could end up with a Locations table, with PK = (CountryISO, PostalCode) and extra field = City.
But are you going to save much ?