We're starting to load up a datawarehouse with data from event logs. We have a normal star schema where a row in the fact table represents one event. Our di开发者_JS百科mension tables are a typical combination of user_agent, ip, referal, page, etc. One dimension table looks like this:
create table referal_dim(
id integer,
domain varchar(255),
subdomain varchar(255),
page_name varchar(4096),
query_string varchar(4096)
path varchar(4096)
)
Where we autogenerate the id to eventually join against the fact table. My question is: whats the best way to identify duplicate records in our bulk load process? We upload all the records for a log file into temp tables before doing the actual insert into the persistent store, however, the id is just auto-incremented, so two identical dim records from two days would have different ids. Would creating a hash of the value columns be appropriate and then trying to compare on that? It seems like trying to compare on each value column would be slow. Is there any best practices for a situation like this?
Auto-increment integer for a surrogate PK is OK, but (according to Mr. Kimball) a dimension table should also have a natural key too. So a hash NaturalKey
column would be in order, also a Status
column for "current" or "expired" may be useful to allow for SCD type 2.
精彩评论