Which is better, ETL or ELT? [closed]_问答_开发者

开发者 https://www.devze.com 2023-01-04 05:40 出处：网络

Closed. This question is opinion-based. It is not currently accepting answers. Want to improve this question? Update the question so it can be answered with facts and citations by editing

Closed. This question is opinion-based. It is not currently accepting answers.

Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.

Closed 8 years ago.

开发者_如何学Go Improve this question

Having spent some time working on data warehousing, I have created both ETL (extract transform load) and ELT (extract load transform) processes. It seems that ELT is a newer approach to populating data warehouses that can more easily take advantage of cluster computing resources. I would like to hear what other people think the advantages are of ETL and ELT over each other and when you should use one or the other.

Which is better is hard to answer -- depends on the problem.

I prefer multi-step ETL -- ECCD (Extract, Clean, Conform, Deliver) whenever possible. I also keep intermediate csv files after each extract, clean, and conform step; takes some disk space, but is quite useful. Whenever DW has to be re-loaded due to bugs in etl, or DW schema changes, there is no need to query source systems again -- it is already in flat files. It is also quite convenient to be able to grep, sed and awk through flat files in the staging area when needed. In the case when there are several source systems which feed into the same DW, only extract steps have to be developed (and maintained) for each of the source systems -- clean, conform, and deliver steps are all common.

So after having played thoroughly with both ETL and ELT, I have come to the conclusion that you should avoid ELT at all costs. ETL prepares the data for your warehouse before you actually load it in. ELT however loads the raw data into the warehouse and you transform it in place. That is problematic if you have a busy data warehouse. If there is a reporting query running on a table that you are attempt to update, your query will get blocked. Consequently, it is possible for reporting queries to hold up or block updates.

Now some of you might say reporting queries do not need to block an update and you can set your isolation level to allow for dirty reads. Reporting queries however are not generally executed by software engineers. They are executed by business users so you can't rely on them to set their isolation levels properly. As well, not all reports can tolerate dirty reads.

There are cases where ELT can work however by introducing it to your data warehouse is dangerous and consequently, I recommend for your sanity and for maintainability, avoid it.

I use both. It's simply a matter of convenience and functionality. It all depends on the case. Sometimes I do TEL - i.e. the transform is done in the source database (in a stored procedure or view) and then extracted and loaded directly.

I prefer ELT. One can say it is against the Norm. It does require a change in mentality and design approach against traditional methods. But it utilizes Existing Hardware and skill sets, further reducing the cost and risk in the development process.

If we want to ensure referential integrity in ETL approach, then data must be downloaded from target to ETL server(Engine). But we don't need to do it in ETL approach.

To get the best from an ELT approach requires an open mind.