Data normalization and writing queries_问答_开发者

I'm a jr. developer (5 months on the job), and I have a question about data normalization. Now, as I understand it, the general principle behind data normalization is to create a RDBMS where data redundancy is kept to a minimum. In my project, one of the DB people created a DB. We have 50+ tables, and the tables in the DB are usually very fragmented, ie. a table has two or three columns and that's it. Now, when it comes to writing sql queries, it has become something of a minor hassle since each query involves combing through several different tables and joining them 开发者_高级运维together. I was wondering if this is a a side effect of data normalization? Or does this point to something else?

I know that the easiest thing to do, for me, would be to write tables based off the queries I have to write. This will create a DB with a lot of redundant data, but I was curious if there is a happy medium?

Just as a postscript, I don't want to come across like I'm whining about my work, but I'm genuinely curious to learn more about this. My work environment is not the most friendly so I don't feel comfortable posing this question with my colleagues. However, I would appreciate any thoughts, books, tutorials or opinions from more experienced people.

Thanks.

general principle behind data normalization is to create a RDBMS where data redundancy is kept to a minimum.

Only partly true.

Normalization is not about "redundancy".

It's about "update anomalies".

1NF is the "don't use arrays" rules. Breaking 1NF means a row isn't atomic, but a collection and independent updates in the collection wouldn't work out well. There'd be locking and slowness.

2NF is the "one key" rule. Each row has exactly one key and everything in the row depends on the key. There are no dependencies on part of the key. Some folks like to talk about candidate keys and natural keys and foreign keys; they may exist or they may not. 2NF is satisfied when all attributes depend on one key. If the key is a single-column surrogate key, this normal form is trivially satisfied.

If 2NF is violated, you've got columns which depend on part of a key, but not the whole key. If you had a table with (Part Number, Revision Number) as a key, and attributes of color and weight, where weight depends on the whole key, but color only depends on the part number. You have a 2NF problem where you could update some part colors but not others, creating data anomalies.

3NF is the "only the key" rule. If you put derived data in a row, and change the derived result, it doesn't match the source columns. If you change a source column without updating the derived value, you have a problem, too. Yes, triggers are a bad hackaround to allow 3NF design violations. That's not the point. The point is merely to define 3NF and show that it prevents an update problem.

each query involves combing through several different tables and joining them together. I was wondering if this is a a side effect of data normalization?

It is.

Now, as I understand it, the general principle behind data normalization is to create a RDBMS where data redundancy is kept to a minimum.

Ummm, ok.

In my project, one of the DB people created a DB. We have 50+ tables, and the tables in the DB are usually very fragmented, ie. a table has two or three columns and that's it.

The number of tables doesn't say anything about whether the design is good or bad. Some businesses need one or two. Others need more. I've worked on databases at Fortune 500s that had thousands of tables.

The number of columns doesn't say anything about whether the design is good or bad. And the number of columns has nothing to do with fragmentation. I will say that tables that have relatively few columns is generally a good sign. Not always a good sign, but generally a good sign.

Now, when it comes to writing sql queries, it has become something of a minor hassle since each query involves combing through several different tables and joining them together. I was wondering if this is a a side effect of data normalization? Or does this point to something else?

There are two different, common reasons for that.

When you normalize a table, you reduce redundancy (and increase data integrity) by identifying functional dependencies, isolating the functionally dependent columns in one or more new tables, and removing them from the original table. So normalizing a table, in the sense of moving from a lower normal form to a higher normal form

always increases the number of tables,
always reduces the number of columns in the original table, and
sometimes requires a join to retrieve data for humans.

Another common practice is to replace strings with id numbers. This has nothing to do with normalization. (There's no such thing as "id number normal form".) Replacing strings with id numbers

always increases the number of tables,
doesn't change the number of columns in the original table (unless done at the same time as normalization),
always requires a join to retrieve data for humans.

There seems to be some confusion in other parts of this thread. I realize that, strictly speaking, none of the following is directly related to the OP's question.

1NF is the "one value" principle. It doesn't have anything to do with a row being "atomic". In the relational model, atomic doesn't refer to rows; it refers to values.

"One value" means that each intersection of a row and a column contains a single value. (In other words, the value is "atomic". But the word atomic has some unfortunate connotations, so most modern practitioners avoid it.) That value doesn't need to be simple; it can be arbitrarily complex. But if it has parts that themselves have meaning, the dbms either completely ignores those parts, or it provides functions to manipulate them. (You don't have to write functions to manipulate the parts.)

I think the simplest example is a date. Dates have parts, consisting of a year, month, and day. The dbms either ignores those parts (as in SELECT CURRENT_DATE) or it provides functions to manipulate them (as in SELECT EXTRACT(YEAR FROM CURRENT_DATE)).

Attempts to dodge the "one value" principle lead to a corollary: the "no repeating groups" principle.

A repeating group involves multiple values from one domain, all values having the same meaning. So a table like the following one is an example of one kind of repeating group. (There are other kinds.) Values for both "phone_1" and "phone_2" come from the same domain, and they have the same meaning--user 'n' has phone numbers (phone_1 and phone_2). (Primary key is "user_id".)

user_id    phone_1           phone_2    
1          (111) 222-3333    (111) 222-3334
2          (111) 222-3335    (111) 222-3336

But the next table, although it's very similar, doesn't have a repeating group. The values come from the same domain, but they don't have the same meaning. (Primary key is "user_id".)

user_id    home_phone        work_phone    
3          (111) 222-3333    (111) 222-3334
4          (111) 222-3335    (111) 222-3336

2NF is the "whole key" principle. It doesn't have anything to do with the number of keys; a table having 'n' columns could have 'n' keys. (See, for example, this other SO answer.) In the relational model (and, by extension, when you're doing normalization exercises), if you see the word key by itself, think "candidate key".

Instead, 2NF has to do with candidate keys that have multiple columns. When a candidate key has multiple columns, 2NF requires that every non-prime attribute be functionally dependent on all the columns of every candidate key, not on just some of the columns of any candidate key. (A non-prime attribute is an attribute that's not part of any candidate key.)

The following example is adapted from the Wikipedia entry on 2nf. (Primary key is {employee, skill}.)

Table: employee_skills
employee        skill            current_work_location
--
Jones           Typing           114 Main Street
Jones           Shorthand        114 Main Street
Jones           Whittling        114 Main Street
Bravo           Light Cleaning   73 Industrial Way
Ellis           Alchemy          73 Industrial Way
Ellis           Flying           73 Industrial Way
Harrison        Light Cleaning   73 Industrial Way

While it's true that the non-prime column current_work_location is functionally dependent on the primary key {employee, skill}, it's also functionally dependent on just part of the primary key, "employee". That table isn't in 2NF.

You can't dodge a 2NF problem by assigning each row a surrogate key. (Primary key is es_id; there's a UNIQUE constraint on the former primary key, {employee, skill}).

Table: employee_skills
es_id   employee        skill            current_work_location
--
1       Jones           Typing           114 Main Street
2       Jones           Shorthand        114 Main Street
3       Jones           Whittling        114 Main Street
4       Bravo           Light Cleaning   73 Industrial Way
5       Ellis           Alchemy          73 Industrial Way
6       Ellis           Flying           73 Industrial Way
7       Harrison        Light Cleaning   73 Industrial Way

It should be obvious that adding the id number did nothing to remove the partial dependency employee->current_work_location. Without removing the partial dependency, this table is still not in 2NF.

3NF is the "no transitive dependencies" principle. It doesn't necessarily have anything to do with derived or calculated data, as you can tell from the Wikipedia example, adapted here. (Primary key is {tournament, year}. This table is not in 3NF.)

Table: tournament_winners
tournament             year  winner            winner_date_of_birth
--
Indiana Invitational   1998  Al Fredrickson    21 July 1975
Cleveland Open         1999  Bob Albertson     28 September 1968
Des Moines Masters     1999  Al Fredrickson    21 July 1975
Indiana Invitational   1999  Chip Masterson    14 March 1977

Two dependencies show that this table has a transitive dependency.

The values in winner_date_of_birth appear to be functionally dependent on the primary key. Each primary key value determines one and only one value for winner_date_of_birth. But . . .
The values in winner_date_of_birth also appear to be functionally dependent on winner. Each value for winner determines one and only one value for winner_date_of_birth.

Given those two apparent functional dependencies and an understanding of what the words tournament, winner, and date of birth mean, we can say that

winner -> winner_date_of_birth is a functional dependency, and
{tournament, year} -> winner is a functional dependency, and
{tournament, year} -> winner_date_of_birth is a transitive dependency.

Database views are a critically important tool in this dilemma. This excellent introduction says:

Here's the good news: you don't have to work with the normalized tables! ... It is very easy (at least for DBAs) to create an abstraction layer of joined views on top of the normalized data tables, putting the base tables completely "behind the scenes", and out of sight.

It sounds like data normalization, but I would have to know more about the schema, the business case, etc to make that call reliably. If you had control of the database, you could write a view that represents common queries that link tables. In order to increase performance, you could create an indexed or materialized view (the name depends on the database platform, in this case, Oracle vs. Sql Server).

Just about any database primer will help you along with these concepts. If you are using Sql Server and are really interested in learning more, SQL Server Books Online is an excellent resource.

Having large numbers of tables is definitely a symptom of a well normalized database design. This can be a pain when writing queries, but it is much better then getting data out of sync.

I sometimes write reports that run off of a database with thousands of tables. Each night, we have a program that runs and dumps data from the production tables into a data warehouse so that we can report against it more easily. The data warehouse tables are much less normalized, and this makes writing queries much simpler. You may want to consider something like this if it makes sense in your situation.

Without seeing the data, it is hard to say if your data is over-normalized (or just not normalized correctly - spreading fields over several tables doesn't mean it's normalized) Generally speaking though, you will probably have to join several tables to see useful data in a well normalized database.

You can create views that join tables together, then you can query the view. This will probably help when selecting data.

In a well designed database the joins you need in your queries should be pretty easy to code up. The downside is you have verbose SQL. The upsides are enormous:-

Consistent easy to update tables.
Quickly change according to business needs. Well designed DBs can usually handle queries that were not even thought of during the original design.
Quickly accommodate new entities. Its relatively easy to add new data entities and attributes to a well designed database. It can be a nightmare incorporating seemingly simple changes to a de-normalised database.