In our database model we have a Beneficiary entity. A beneficiary can be a physical person or a corporate beneficiary; a phisical beneficiary has a number of attributes such as name, surname, sex, etc.; in addition, a beneficiary (either corporate or physical person) can either be foreign or not; this further distinction translate into different domain values for a "common" set of attributes (for example in Italy, where I live, tax ids may have a different data format than UK's tax ids).
We are now re-engineering our Beneficiary table, since the developer who initially worked on DB analysis & modeling did a (IMO) short-sighted choice. He put the primary key constraint on attribute BeneficiaryName, wich has been used to store either the Corporate name (e.g. "Microsoft Corporation") in case of Corporate beneficiary or the surname (e.g. Smith) for the physical beneficiary. This way we have the (unacceptable) constraint that we CAN'T have more than 1 beneficiary with surname "Smith" (or a corporate named "Smith") in our DB.
My approach for this "re-factoring" would introduce a generalization for the Beneficiary entity; I would
- Clean Beneficiary table, keeping only common data;
- Add a surrogate primary key to Beneficiary table, let's call it BeneficiaryID;
- Split Beneficiary table, creating two sub-entityes (CorporateBeneficiary & PhysicalBeneficiary, discriminated by a flag in master Beneficiary table), with a 1..1 association to Beneficiary table (a foreign key will reference BeneficiaryID)
- Find (significative) primary keys for CorporateBeneficiary & PhysicalBeneficiary;
This should address the aforementioned problem of uniqueness on BeneficiaryName. Seems ok so far?
The real pro开发者_C百科blem I have is: how can/should I handle the further complication added by "foreign" attribute in this model? Should I leave Foreign as it tis, i.e. a flag attribute in Beneficiary? If so, how can I handle the need for different attributes' for a conceptually similar piece of information (i.e. zipcode, tax id) withoud duplicating the attributes (zipcode_foreign, zipcode, taxid_foreign, taxid etc.)? Should I really strive to accomodate different domain values into one field?
Any suggestion would be welcome...
"Clean Beneficiary table, keeping only common data;"
Exactly what there is to do.
"Add a surrogate primary key to Beneficiary table, let's call it BeneficiaryID;"
May be useful, but don't forget that IF there exists a "natural" identifier, then the uniqueness of this should be enforced too.
"Split Beneficiary table, creating two sub-entityes (CorporateBeneficiary & PhysicalBeneficiary"
Yup. Observe that it will be hard to enforce "absolute" data integrity (enforcing at the same time that all NaturalBeneficiaries are Beneficiaries, that all NonNaturalBeneficiaries are Beneficiaries too, and that all Beneficiaries are either Natural or NonNatural Beneficairies).
"discriminated by a flag in master Beneficiary table"
Nope. Wouldn't do that. The flag is redundant, and redundancy adds complexity without adding value. If you want to know whether a Beneficiary is Natural or NonNatural, check the table where that fact is recorded.
"Find (significative) primary keys for CorporateBeneficiary & PhysicalBeneficiary;"
If You introduce a surrogate for Benficiaries in general, you don't need to replicate the natural identifiers in these other tables. That's once again redundancy, adding complexity without adding value.
"The real problem I have is: how can/should I handle the further complication added by "foreign" attribute in this model?""
You could apply the same approach, distinguishing National and ExtraNational (for both Corporate and Physical Benficiaries), and that might be anything from advisable to absolutely required if data integrity is of key importance when it concerns, say, at least the National Benficiaries. For example, legislation might apply that forces you to verify that National SSN numbers or National corporation identifying numbers are "valid" according to the National rules. If such legislation applies, it is likely to be crucial that such rules are checked in and by the DBMS, not just your app. Of course for Non-Nationals, similar checks are typically nor required, or even not possible in general.
If you take such a distinction between National and Non-National into account in your database structure, you will very likely also want to create a view that "unions" the two (National and Non-National) together, and then you will have to "transform" your data to a "unified" "common" format, which will likely be just CHAR (even if you know that, say, for the National PhysicalBeneficiaries, the contents will be their SSN number which you know consists of some fixed number of digits).
If you don't have to take such a distinction between National and Non-National into account in your database structure, then you will be forced to use that same "unified" "common" format in your single table that will be holding the data for both National and Foreign.
Accomodating different domain values in one field will replicate the type of problem you are now having with the main beneficiary table - at some point in the future. I would adopt this approach:
Amend your proposed keys in the CorporateBeneficiary/PhysicalBeneficiary table as follows:
- Add the surrogate primary key as you suggested (BeneficiaryID)
- Use the appropriate database or site mechanisms to generate BeneficiaryID as unique, generated value.
- Make the primary key of CorporateBeneficiary (CB) and PhysicalBenificiary (PB) the same - ie BeneficiaryID.
- Make the primary key of CB and PB a foreign key referencing the Beneficiary table.
Then for the foreign details adopt a similar approach:
- Add a table per country with common columns, but referencing the appropriate domain.
- Setup the new tables in a similar manner to CB/PB - ie as relationships to the Beneficiary table.
Then create views based on left joins to create "virtual tables" - Beneficiary/PB/Foreign detail. These views will be the basis for access to the information.
If you have very large datasets (> 10^7 rows), you may find that my answer to a problem of validating postal codes is of interest. This is a very long response; is not simple; and is massive overkill unless volumes are extremely large.
EDIT
If your data volumes are lower, the approach suggested by @James Anderson, of multiple (for eg address) lines will be perfectly appropriate.
You are thinking along the correct lines.
I would suggest you have a separate "address" table to be shared by all beneficiaries. (you are really describing a maildrop which is the same for people and companies!). But the simplest way to handle the i18n problems here is to keep it simple Line1, Line2, Line3, Line4 and country code.
I would not even go far as having a specific zip/post code as these vary from a simple 4 digits, to the British /[A-Z0-9]{2,4} [A-Z0-9]{2,4}/ or even the Irish Republic with err no post codes whatsoever (Official reasoning "Oh we don't need anything like that, we know where you live."). Additionally the conventional position for the postcode varies from country to country (UK on a line on its own at the end, CH before the city name etc. etc.), some domains follow the city,county,state patterns other places don't have states or counties or do not use them in mail addresses.
For generalization there are three concepts how you can design them for a database:
(explanation for 1 generalization, 2 subtypes)
1. One table for all
Here you have all attributes from generalization and subtypes in one table.
Pros:
- no JOINs, no UNIONs needed (=therefore it is often fastest possibility)
- easy and fast to get things from whole population
- easy and fast for getting things out of generalization attributes
- easy and fast for getting things out of subtypes attributes
Cons:
- table redesign when having new subtypes with new attributes
- many NULL values for unused attributes
- additional attribute to mark what subtype a specific record is
- (it looks like ugly design, but often it's so incredible much faster ...)
PKs/FKs: one PK, no FKs
For your example: Only one table for "Beneficiary".
2. Two tables
Only tables for subtypes. No table for generalization.
Pros:
- very easy and fast access to whole data of one specific subtype
- easy to add new attributes for a specific subtype
Cons:
- hard to get things from whole population (UNION = slow)
- hard to add new attributes to generalization (all tables must be altered)
PKs/FKs: every table it's own PK, no FKs
For your example: Two tables for "CorporateBeneficiary" and "PhysicalBeneficiary"
3. Three tables
Have an own table for generalization and for every subtype. (This is what you choosed.)
Pros:
- easy access to generalization attributes
- easy access to subtype attributes
- easy to add new attributes
Cons:
- expensive to get all attributes of a whole entity (JOIN = slow)
PKs/FKs: Ideally you have one PK Sequence used by all tables. (i.e. Oracle Sequence) In the subtype-tables the pk-column is at the same time a fk-column to the generalization table. This may be a bit complicated in DBMS where you do not have something like Oracle Sequences. There it might be necessary to have seperate pk and fk columns in the subtypes and an own PK sequence for every table. (i.e. because sequence gen is a column property)
Which one you choose depends on your requirements. 1.) and 3.) you will see very often where cases for 2.) are very rare and many developers do not know about this design.
Personally, when I do not have other boundary conditions I choose 3, because I find this the cleanest solution. But I also choosed option 1 in the past due to performance reasons. Cannot remember that I've done something like option 2 since university. :p
精彩评论