How to model a database with many m:n relations on a table_问答_开发者

I am currently setting up a database which has a large number of many-to-many relations. Every relationship was modeled via a link table. Example:

A person has a number of jobs, jobs are fulfilled by a number of persons. A person has a number of houses, houses are occupie开发者_C百科d by a number of persons. A person has a number of restaurants he likes, restaurants have a number of persons who like the restaurant.

First I designed this as follows:

Tables: Person, Job, House, Restaurant, Person_Job, Person_House, Person_Restaurant.

Relationships 1 - n: Person -> Person_Job, Person -> Person_House, Person -> Person_Restaurant, Job -> Person_Job, House -> Person_House, Restaurant -> Person_Restaurant.

This leads pretty quickly to a crowded and complex ER model.

Trying to simplify this I modeled it as follows:

Tabels: Person, Job, House, Restaurant, Person_Attributes

Relationships 1 - n: Person -> Person_Attributes, Job -> Person_Attributes, House -> Person_Attributes, Restaurant -> Person_Attributes

The Person_Attributes table should look something like this: personId jobId houseId restaurantId

If a person - job relationship exists, I'll add an entry looking like:

P1, J1, NULL, NULL

If a person - house relationship exists, I'll add an entry looking like:

P1, NULL, H1, NULL

So the attributes table in the second example will have the same number of entries as the link tables of the first examples added up.

This simplyfies the ER Model a lot, and as long as I build indexes for personId + jobId, personId + houseId and personId + restaurantId, there won't be a lot of performance impact, I think.

My questions are: Is the second method a correct way of modelling this? If not, why? Am I right about performance impact? If not, why?

MySQL Workbench example of what I mean can be found here:

http://www.2shared.com/file/3GBnodEZ/example.html

Your design violates Fourth Normal Form. You're trying to store multiple "facts" in one table, and it leads to anomalies.

The Person_Attributes table should look something like this: personId jobId houseId restaurantId

So if I associate with one job, one house, but two restaurants, do I store the following?

personId jobId houseId restaurantId
    1234    42      87         5678
    1234    42      87         9876

And if I add a third restaurant, I copy the other columns?

personId jobId houseId restaurantId
    1234   123      87         5678
    1234   123      87         9876
    1234    42      87        13579

Done! Oh, wait, what happened there? I changed jobs at the same time as adding the new restaurant. Now I'm incorrectly associated with two jobs, but there's no way to distinguish between that and correctly being associated with two jobs.

Also, even if it is correct to be associated with two jobs, shouldn't the data look like this?

personId jobId houseId restaurantId
    1234   123      87         5678
    1234   123      87         9876
    1234   123      87        13579 
    1234    42      87         5678
    1234    42      87         9876
    1234    42      87        13579

It starts looking like a Cartesian product of all distinct values of jobId, houseId, and restaurantId. In fact, it is -- because this table is trying to store multiple independent facts.

Correct relational design requires a separate intersection table for each many-to-many relationship. Sorry, you have not found a shortcut.

(Many articles about normalization say the higher normal forms past 3NF are esoteric, and one never has to worry about 4NF or 5NF. Let this example disprove that claim.)

Re your comment about using NULL: Then you have a problem enforcing uniqueness, because a PRIMARY KEY constraint requires that all columns be NOT NULL.

personId jobId houseId restaurantId
    1234   123      87         5678
    1234  NULL    NULL         9876
    1234  NULL    NULL        13579

Also, if I add a second house or a second jobId to the above table, which row do I put it in? You could end up with this:

personId jobId houseId restaurantId
    1234   123      87         5678
    1234  NULL    NULL         9876
    1234    42    NULL        13579

Now if I disassociate restaurantId 9876, I could update it to NULL. But that leaves a row of all NULLs, which I really should just delete.

personId jobId houseId restaurantId
    1234   123      87         5678
    1234  NULL    NULL         NULL
    1234    42    NULL        13579

Whereas if I had disassociated restaurant 13579, I could update it to NULL and leave the row in place.

personId jobId houseId restaurantId
    1234   123      87         5678
    1234  NULL    NULL         9876
    1234    42    NULL         NULL

But shouldn't I consolidate rows, moving the jobId to another row, provided there's a vacancy in that column?

personId jobId houseId restaurantId
    1234   123      87         5678
    1234    42    NULL         9876

The trouble is, now it's getting more and more complex to add or remove associations, requiring multiple SQL statements for changes. You're going to have to write a lot of tedious application code to handle this complexity.

However, all the various changes are easy if you define one table per many-to-many relationship. You do need the complexity of having that many more tables, but by doing that you will simplify your application code.

Adding an association to a restaurant is simply an INSERT to the Person_Restaurant table. Removing that association is simply a DELETE. It doesn't matter how many associations there are to jobs or houses. And you can define a primary key constraint in each of these intersection tables to enforce uniqueness.

Your simplified version does not represent a proper relational model. It's more of a metadata model.

The number of tables in your database should represent the number of logical entities in your domain. That should not change based on some arbitrary idea of how many entities is too many.

I do not think the second method is correct because your Person_Attributes table would contain redundant data. For example: say a person likes 10 restaurants and works on 2 jobs, has 3 houses you would have as many as 10*2*3 entries where it should be 10 + 2 + 3(in 3 link tables...as per approach#1). Think of drawbacks having million users and if you had more than 3 attributes in Person_Attributes table to handle... so I would go with approach 1 in your question.

Say for example your Person_Attributes table has following entry:

personId | houseId | jobId | restaurantId
------------------------------------------
P1      H1  J1  R1

now if the person likes restaurants R2 and R3...table looks like

P1      H1      J1      R1
P2      H1      J1      R2
P2      H1      J1      R3

table already has redundant data he adds Job J2 at a later point.. your table will look like

P1      H1      J1      R1
P2      H1      J1      R2
P2      H1      J1      R3
P1      H1      J2      R1
P2      H1      J2      R2
P2      H1      J2      R3

Now consider he adds another home H2.. so on and so forth...Do you see my point?

In my humble opinion I would go for the first model. It's probably a more complex model but in the end it will make things easier when you're extracting info from tables and the application code could get dirtier or more unreadable for other programmers. Beside, there are some authors that wouldn't reccommend to use multipurpose tables like that.

In the end you must go with whatever suits you better. We don't know the whole context so can't help you too much to decide. But, for what you're saying and I'd definitely go for option number one.

The second model is a problem from several perspectives. First it is likely to create blocking issues as everything goes to the one meta table. Second it is far more likely to have data integrity issues as you can't enforce the foreign key constraints. It is a SQL antipattern to model that way. The first model was correct.