Is normalization required in this case?_问答_开发者

Let's say I have a database with a table like this:

id | fundraiser_id | donation_amount | name | sex | university

This is an analogous version of my real table. This table tracks donations during a fundraiser. It is very likely that the same person will donate multiple times for each fundraiser (they are very generous).

A user would enter this data like a spreadsheet. They won't necessarily care that they entering the same name, sex, and university for multiple rows. After all, they aren't paying for the storage space.

My question is this: Should I normalize this table by attempting, on table update, to extract distinct values for the set of name, sex, university and save them in a People table. Then, my new table would look like this:

id | fundraiser_id | donation_amount | people_id

Having the user predefine the people they will use in their table is not an option.

I discussed this situation in a previous question, but I felt that I didn't give enough detail. Here are the pros and cons that I (and the others that helped me in that previous question) came up with:

Pros:

Less wasted space
More "normal"
Faster to ans开发者_运维问答wer some queries

Cons:

Takes resources to ensure that I don't add duplicate people to the People table
Will take a Cron job to kill orphaned entries in the People table
Extra queries required no matter how I look at it

Thank you for any advice.

Just one more clarification:

I'm not concerned at all about two people having the same name, sex, and university. That information is just displayed and not acted on, so two different people are still one in the same to my application. I am only suggesting this normalization as a way to save space by creating a lookup table.

Given your updates question, an updated answer:

~~I really don't see any benefit in splitting up this table~~

id | some_unique_field | name | sex | university

You should definitely split up this table:

id | fundraiser_id | donation_amount | name | sex | university

Into this:

donation
id | fundraiser_id | donation_amount | donator_id

fundraiser
id | charity | ....

donator
id | name | sex | university

This will make it easier to get the amount donated per person.

If you want to import your spreadsheet, I'm recommend doing something like this:

Step 1: create a blackhole table like so:

DROP TABLE IF EXISTS `test`.`bh_donations`;
CREATE TABLE  `test`.`bh_donations` (
  `fundraiser_name` varchar(45) NOT NULL,
  `donation_amount` decimal(10,2) NOT NULL,
  `name` varchar(45) NOT NULL,
  `sex` char(1) NOT NULL,
  `university` varchar(45) NOT NULL
) ENGINE=BLACKHOLE DEFAULT CHARSET=latin1;

You don't need an id here, but if it simplifies your code, by all means add it.

Step 2, add a trigger to the blackhole table to process it.

DELIMITER $$

CREATE TRIGGER bi_bh_donations BEFORE INSERT ON bh_donations FOR EACH ROW
BEGIN
  DECLARE mydonater_id integer;
  DECLARE myfundraiser_id integer;

  SELECT f.id INTO myfundraiser_id FROM fundraiser f 
    WHERE f.name = new.fundraiser_name LIMIT 1;

  IF f.id IS NULL THEN BEGIN
    SELECT error_fundraiser_is_unknown FROM table_error;
  END; END IF;

  SELECT d.id INTO mydonator_id FROM donator d
    WHERE d.name = new.name AND d.sex = new.sex AND d.university = new.university
  LIMIT 1;

  IF mydonator_id IS NULL THEN BEGIN 
    INSERT INTO donator (name, sex, university)
    VALUES (new.name, new.sex, new,university);
  END; END IF;

  SELECT LAST_INSERT_ID() INTO mydonator_id;

  INSERT INTO donation (fundraiser_id, donation_amount, donator_id)
    VALUES (myfundraiser_id, new.amount, mydonater_id); 
END$$

DELIMITER ;

Step 3 use LOAD DATA INFILE to import the data

LOAD DATA INFILE 'data.csv' INTO TABLE bh_donations
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES;

Now you can just read the excel lines into the blackhole table by saving the Excel file as a CSV.
And using LOAD DATA INFILE to read the data into the blackhole table.

See: blackhole tables: http://dev.mysql.com/doc/refman/5.0/en/blackhole-storage-engine.html
more blackhole: Creative uses for the blackhole engine
triggers: http://dev.mysql.com/doc/refman/5.5/en/triggers.html
load data infile: http://dev.mysql.com/doc/refman/5.5/en/load-data.html

Hope this helps.

Let's try to put together some reasonable sample data. Assume that the column 'id' is the primary key, and that people don't make spelling mistakes.

Table: donations
id   fundraiser_id   donation_amount   name              sex   university
--
1    100             $100              Kim Stack         M     Rivier College
2    100             $150              Kim Stack         M     Rivier College
3    100             $45               Marguerite Meade  F     Rivier College
4    100             $100              Marie Dew         F     Rivier College
5    100             $100              Kim Stack         F     Midway College
6    100             $100              Kim Stack         F     Mars Hill College
...
98   200             $135              Kim Stack         M     Rivier College
99   200             $400              Kim Stack         M     Midway College

Let's try to find a transitive dependency having to do with people.

name->sex: No, there are males and females named "Kim Stack".
name->university: No.
name, sex->university: No, two females named "Kim Stack" go to different colleges.
sex, university->name: No, there are two different females at Rivier College.
name, university->sex: No, there's a male and female student named "Kim Stack" at Midway College.

(You can keep going.)

This table has no transitive dependencies, and "id" is the only candidate key: this table is already in 5NF.

As useful as it might be, replacing a name with an id number—or replacing {name, sex, university} with an id number—has nothing to do with normalization.

Your design violates the database normalization principle that: Non-key fields should depend solely on the key column

Your data can be looked up using either the id column or the some_unique_field column; in effect you have two key columns.

In database design, if natural data is available to use as a key, you should use that (eg passport number). In your case, there is no need to have the id column. However, if you must have it (for example for performance reasons you want a small-sized key, instead of some very long value so you can get more index entries per I/O page and thus faster index look ups), then do it.

Your "normalization" attempt actually doesn't address the fundamental problem of having "two keys" - your new table still has this problem.

You should allways try to normalize. The only case in wich you wouldn't do so is in case you have a significant gain from not doing so. In this case, you only need to do a lookup to check if the data is already in the database, and since those fields would likely be indexed, it'd be a quite cheap query. Also, in case of using MySQL (don't know about other databases atm), you could use INSERT IGNORE to avoid that lookup.

About the orphaned children... you could use a TRIGGER to ensure that you don't leave any orphaned entry, unless rows from the People table are commonly deleted.

You could create a UNIQUE index on (name, sex, university). That would prevent duplicate entries from entering into the data base to start with.

There's a separate problem, however, which may be of concern to you: detect spelling variations (e.g., "My U." vs. "My Univ." vs. "My University").