I'm getting my first exposure to data warehousing, and I’m wondering is it necessary to have foreign key constraints between facts and dimensions. Are开发者_JAVA百科 there any major downsides for not having them? I’m currently working with a relational star schema. In traditional applications I’m used to having them, but I started to wonder if they were needed in this case. I’m currently working in a SQL Server 2005 environment.
UPDATE: For those interested I came across a poll asking the same question.
Most data-warehouses (DW) do not have foreign keys implemented as constraints, because:
In general, foreign key constraint would trigger on: an insert into a fact table, any key-updates, and a delete from a dimension table.
During loading, indexes and constraints are dropped to speed-up the loading process, data integrity is enforced by the ETL application.
Once tables are loaded, DW is essentially read-only -- the constraint does not trigger on reads.
Any required indexes are re-built after the loading.
Deleting in a DW is a controlled process. Before deleting rows from dimensions, fact tables are queried for keys of rows to be deleted -- deleting is allowed only if those keys do not exists in any of fact tables.
Just in case, it is common to periodically run queries to detect orphan records in fact tables.
We use them, and we're happy with it.
Is it good practice to have foreign keys in a datawarehouse (relationships)?
There is overhead, but you can always disable the constraint during load and then re-enable it.
Having the constraint in place can catch ETL bugs and modelling defects.
I think in theory, you need that. But it depends on how you separate your data over database. If all of them in the same database, foreign key can help you because setting foreign key will help the database do selecting faster based on the indexing. If you share tables over many database, you need to check it on your application level
You can have your database check it for you but it can be slow. And generally, in data warehouse, we don't care about redundancy or integrity. We already have a lot of data and a few integrity and redundancy will not affect the general aggregate data
I don't know about necessary, but I feel they are good for data integrity reasons. You want to make sure that your fact table is always pointing to a valid record in the dimension table. Even if you are sure this will happen, why not have the database validate the requirement for you?
The reasons for using integrity constraints in a data warehouse are exactly the same as in any other database: to guarantee the integrity of the data. Assuming you and your users care about the data being accurate then you need some way of ensuring that it remains so and that business rules are being correctly applied.
As far as I know FKs, speed up queries. Also, many BI solutions exploit them in their integration layer. So for me they are a must in DWs.
Hope this thread is still active. My thinking is: for large fact tables with many dimensions and records, foreign keys will slow inserts and updates so that a fact table becomes too slow to load especially as it increases in size. Indexes are used for querying AFTER the table is loaded, so they can be disabled during inserts/updates and then rebuilt. The foreign key RELATION is important NOT the foreign key itself: this is really implicit in the ETL process. I have found that foreign keys make things TOO slow in the real world Datawarehouse. You need to use a VIRTUAL foreign key: the relation is their but not the constraint. If you damage the foreign key relations in a Datawarehouse you are doing something wrong. If you disable them during inserts and there is an mismatch or orphan, you won't be able to reenable them, so what's the point. The whole point of the DW is fast access and querying. Foreign keys make that impossible. Interesting debate: not easy to find this question on the Net Kev
精彩评论