I'm developing a data warehouse and have come up against a problem I'm not sure how to fix. The current schema is defined below:
DimInstructor <- Dimension table for instructors DimStudent <- Dimension table for students
I want to implement a scenario whereby if details of an instructor change in my OLTP database, I want to add a new record in the DimInstructor table for historical reporting reasons.
Now, I'm wanting to create a lesson dimension table called DimLesson. In DimLesson I want to create a reference to the instructor.
The DimInstructor table contains:
InstructorDW开发者_Python百科ID <- Identity field when entered into DW InstructorID <- The instructor ID that has come from the OLTP database
Now, I can't make InstructorID a primary key because it isn't guaranteed to be unique (if the instructor changes their name, there will be 2 records in the DW with the same InstructorID value).
So my question is, how do I reference the instructor from DimLesson? Do I use the InstructorDWID? If so, should I have 2 entries for an instructor in DimInstructor, it would make queries more complicated when I'm wanting to look at all lessons by a specific instructor.
Any help would be appreciated!
What you are describing here is usually called type 2 dimension. Kimball data warehouse books have whole sections on type 2 dimensions and ETL for the type -- do read.
The first thing to understand is the difference between the primary key and the business key. The primary key uniquely identifies a row in the table, while the business key uniquely identifies an entity that the table describes, like an instructor. For example, if an instructor changes name, the dimInstructor table may look something like:
InstructorKey InstructorBusinessKey FirstName LastName row_ValidFrom row_ValidTo row_Status
1234 jane_doe_7211 Jane Doe 2000-03-11 2010-08-12 expired
7268 jane_doe_7211 Jane Smith 2010-08-12 3000-01-01 current
Now, providing that the dimLesson is proper design for your business model (as opposed to having some kind of fact) the dimLesson would have a column called InstructorKey
. During ETL process, when delivering the new row (7258) to the dimInstructor table, replace all references to row 1234 in the dimLesson with 7268 .
Paul,
There are multiple ways you can handle this. You can use an effective date/inactive date, sequence number or a version number to differentiate the records with the same InstructorID.
The DIM that captures all relevant details would be like..
create table DIM_INSTRUCTOR(
instr_guid number, --populated through a sequence -----Composite pk-Part1
istr_oid number, --direct id from the OLTP system -----cmposite pk-part2
instr_name number,
other_attr varchar2(25),
eff_date date,
expiration_date date
);
instr_guid is directly generated from a sequence and is independent of the OLTP system.
This would let you capture all the details for a given instructor. You can use just the instr_guid as the foreign key to the fact table, but including both of them (instr_guid,instr_guid) would increase the ease of querying .. which is one of the goals of Datawarehousing.
Useful Links:
http://en.wikipedia.org/wiki/Surrogate_key http://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2
Use a guid/uuid as the primary key or a combination of columns
精彩评论