I am currently setting up a database which has a large number of many-to-many relations. Every relationship was modeled via a link table. Example:
A person has a number of jobs, jobs are fulfilled by a number of persons. A person has a number of houses, houses are occupie开发者_C百科d by a number of persons. A person has a number of restaurants he likes, restaurants have a number of persons who like the restaurant.
First I designed this as follows:
Tables: Person, Job, House, Restaurant, Person_Job, Person_House, Person_Restaurant.
Relationships 1 - n: Person -> Person_Job, Person -> Person_House, Person -> Person_Restaurant, Job -> Person_Job, House -> Person_House, Restaurant -> Person_Restaurant.
This leads pretty quickly to a crowded and complex ER model.
Trying to simplify this I modeled it as follows:
Tabels: Person, Job, House, Restaurant, Person_Attributes
Relationships 1 - n: Person -> Person_Attributes, Job -> Person_Attributes, House -> Person_Attributes, Restaurant -> Person_Attributes
The Person_Attributes table should look something like this: personId jobId houseId restaurantId
If a person - job relationship exists, I'll add an entry looking like:
P1, J1, NULL, NULL
If a person - house relationship exists, I'll add an entry looking like:
P1, NULL, H1, NULL
So the attributes table in the second example will have the same number of entries as the link tables of the first examples added up.
This simplyfies the ER Model a lot, and as long as I build indexes for personId + jobId, personId + houseId and personId + restaurantId, there won't be a lot of performance impact, I think.
My questions are: Is the second method a correct way of modelling this? If not, why? Am I right about performance impact? If not, why?
MySQL Workbench example of what I mean can be found here:
http://www.2shared.com/file/3GBnodEZ/example.html
Your design violates Fourth Normal Form. You're trying to store multiple "facts" in one table, and it leads to anomalies.
The Person_Attributes table should look something like this: personId jobId houseId restaurantId
So if I associate with one job, one house, but two restaurants, do I store the following?
personId jobId houseId restaurantId
1234 42 87 5678
1234 42 87 9876
And if I add a third restaurant, I copy the other columns?
personId jobId houseId restaurantId
1234 123 87 5678
1234 123 87 9876
1234 42 87 13579
Done! Oh, wait, what happened there? I changed jobs at the same time as adding the new restaurant. Now I'm incorrectly associated with two jobs, but there's no way to distinguish between that and correctly being associated with two jobs.
Also, even if it is correct to be associated with two jobs, shouldn't the data look like this?
personId jobId houseId restaurantId
1234 123 87 5678
1234 123 87 9876
1234 123 87 13579
1234 42 87 5678
1234 42 87 9876
1234 42 87 13579
It starts looking like a Cartesian product of all distinct values of jobId, houseId, and restaurantId. In fact, it is -- because this table is trying to store multiple independent facts.
Correct relational design requires a separate intersection table for each many-to-many relationship. Sorry, you have not found a shortcut.
(Many articles about normalization say the higher normal forms past 3NF are esoteric, and one never has to worry about 4NF or 5NF. Let this example disprove that claim.)
Re your comment about using NULL: Then you have a problem enforcing uniqueness, because a PRIMARY KEY
constraint requires that all columns be NOT NULL.
personId jobId houseId restaurantId
1234 123 87 5678
1234 NULL NULL 9876
1234 NULL NULL 13579
Also, if I add a second house or a second jobId to the above table, which row do I put it in? You could end up with this:
personId jobId houseId restaurantId
1234 123 87 5678
1234 NULL NULL 9876
1234 42 NULL 13579
Now if I disassociate restaurantId 9876, I could update it to NULL. But that leaves a row of all NULLs, which I really should just delete.
personId jobId houseId restaurantId
1234 123 87 5678
1234 NULL NULL NULL
1234 42 NULL 13579
Whereas if I had disassociated restaurant 13579, I could update it to NULL and leave the row in place.
personId jobId houseId restaurantId
1234 123 87 5678
1234 NULL NULL 9876
1234 42 NULL NULL
But shouldn't I consolidate rows, moving the jobId to another row, provided there's a vacancy in that column?
personId jobId houseId restaurantId
1234 123 87 5678
1234 42 NULL 9876
The trouble is, now it's getting more and more complex to add or remove associations, requiring multiple SQL statements for changes. You're going to have to write a lot of tedious application code to handle this complexity.
However, all the various changes are easy if you define one table per many-to-many relationship. You do need the complexity of having that many more tables, but by doing that you will simplify your application code.
Adding an association to a restaurant is simply an INSERT
to the Person_Restaurant table. Removing that association is simply a DELETE
. It doesn't matter how many associations there are to jobs or houses. And you can define a primary key constraint in each of these intersection tables to enforce uniqueness.
Your simplified version does not represent a proper relational model. It's more of a metadata model.
The number of tables in your database should represent the number of logical entities in your domain. That should not change based on some arbitrary idea of how many entities is too many.
I do not think the second method is correct because your Person_Attributes table would contain redundant data. For example: say a person likes 10 restaurants and works on 2 jobs, has 3 houses you would have as many as 10*2*3 entries where it should be 10 + 2 + 3(in 3 link tables...as per approach#1). Think of drawbacks having million users and if you had more than 3 attributes in Person_Attributes table to handle... so I would go with approach 1 in your question.
Say for example your Person_Attributes table has following entry:
personId | houseId | jobId | restaurantId
------------------------------------------
P1 H1 J1 R1
now if the person likes restaurants R2 and R3...table looks like
P1 H1 J1 R1
P2 H1 J1 R2
P2 H1 J1 R3
table already has redundant data he adds Job J2 at a later point.. your table will look like
P1 H1 J1 R1
P2 H1 J1 R2
P2 H1 J1 R3
P1 H1 J2 R1
P2 H1 J2 R2
P2 H1 J2 R3
Now consider he adds another home H2.. so on and so forth...Do you see my point?
In my humble opinion I would go for the first model. It's probably a more complex model but in the end it will make things easier when you're extracting info from tables and the application code could get dirtier or more unreadable for other programmers. Beside, there are some authors that wouldn't reccommend to use multipurpose tables like that.
In the end you must go with whatever suits you better. We don't know the whole context so can't help you too much to decide. But, for what you're saying and I'd definitely go for option number one.
The second model is a problem from several perspectives. First it is likely to create blocking issues as everything goes to the one meta table. Second it is far more likely to have data integrity issues as you can't enforce the foreign key constraints. It is a SQL antipattern to model that way. The first model was correct.
精彩评论