I am designing a data warehouse and I have a sticky issue with time. The grain I need is hourly (to calculate aggregate counts of events per hour) and I also have to accommodate a shift pattern that does not conveniently fit inside a 24 hour period (in fact it is possible that 'blue' shift wont cover the same time of day for several days).
With this in mind I am contemplating one of 3 approaches
- a single time dimension with 175K rows in it.
- a snowflake time dimension with 7300 rows in a calendar dimension and 175k rows in a time dimension
- separate dimensions so that the fact table has foreign keys for the event date and for the event time.
I am tendin开发者_如何学Pythong towards approach 3 as it allows the small calendar dimension to be referenced separately in joins, but I would appreciate any thoughts.
Yes, manufacturing shifts are tricky and do change over time, often one shift starts day before, etc.
Keep in mind that there are two calendars here. One is the standard calendar and the other is the production calendar -- the shift belongs to the production calendar. In general, a day in production calendar may last more (or less) than 24 hours.
For example:
Part produced on Monday, 2011-02-07 23:45 may look like
TimeOfProduction = '2011-02-07 23:45'
DateKey = 20110207
TimeKey = 2345
ProductionDateKey = 20110208 (the first shift of the next day started at 22:00)
ProductionTimeKey = 145 (1 hour and 45 minutes of the current production date)
ShiftKey = 1
ShiftTimeKey = 145 (1 hour and 45 minutes of the current shift)
So, my suggestion is:
- Plain
Date Dimension
(one row per date) - Plain
Time Dimension
(one row per minute for 24 hours = 1440 rows + see note below) Shift Dimension
-- type 2 dimension withrw_ValidFrom, (rw_ValidTo) , rw_IsCurrent
- Role-play the
DateKey
intoProductionDateKey
- Role-play the
TimeKey
into aProductionTimeKey
andShiftTimeKey
. - Keep the
TimeOfProduction (datetime)
in the fact table too. - During the ETL process, apply the current shift logic to attach
ProductionDateKey, ProductionTimeKey, ShiftKey, ShiftTimeKey
to each row of thefactPart
table.
Note that you may need to add extra rows to the Time Dimension
if a production day can last more than 24 hours. It usually can if a local time is used and there is a daylight savings time jump.
So, the star may look something like this
My £0.02 for what it is worth:
Assuming that there is no additional issue arising from consideration of the shift (@Andriy M's question):
I would tend to discount option 2 unless there is a specific benefit (performance, simplification of a class of query, etc.) you can see from adopting it. You do not describe any such benefit, so it seems that you are adding complexity for its own sake.
My personal preference would be for option 1 - conceptually the simplest, the most direct, and the (IMO) best fit to data warehouse approaches.
Option 3 has the advantages you mention, but I have the nagging suspicion that it covers two alternatives: in both the calendar dimension is as you describe it, but the choices for the time dimension are 175k rows, or 24. I cannot at present provide arguments for either of these alternatives, only a gut feeling that there are two such choices. If the shift issue IS relevant here, it might influence the choice between these alternatives (if they are genuine alternatives).
If you wish to take option 2 further, the alternatives set out for option 3 are also relevant.
I would choose option 3. - Separate dimensions. Benefits:
Simplicity - two relatively small tables - with Time dimension loaded only once as there's fixed number of minutes in a day.
Reuse - two separete dimensions are more likely to be shared with other fact tables that can have only Date or Time dimension
Easy partitioning by having separate attribute for Date dimension in a fact table
Extensibility - think of attributes you could add to Date and Time dimensions as your reporting needs grow. For a Date dimension this could be (to avoid extracting this information each time from date): year, quarter, month, day, week, date label (like "12th September 2011"), month name, weekday name, various indicators (holiday indicator, end of quarter, end of month, etc.). For a Time dimension (which could - for accuracy - contain each second of a day) this could be: hour, minute, second, day part label (like "morning", "evening"), working time indicator (seconds from 8:00:00 to 17:00:00), etc. But having it all in just one dimension would mean a lot of redundancy.
Shifts that are not aligned with day start / end look to me as a good candidate for a separate fact fable recording start and end timestamp for each shift - I mean (factless) fact table with the following foreign keys: id_date_start, id_time_start, id_date_end, id_time_end. Then you can "drill-across" from the events fact table to the shifts table to get aggregate results for each shift.
Edit: Or model shifts just as another dimension - that depends on the fact if for you shift is an important business process that you want to track independently with its attributes (but at the moment I can't think of any other attributes then Date & Time... Location, perhaps?) or if it's just a context of an event (and therefore should be just a dimension).
精彩评论