开发者

Strategy to avoid OutOfMemoryException during ETL in .NET

开发者 https://www.devze.com 2023-03-01 03:43 出处:网络
I have wrote a ETL process that perform ETL process. The ETL process needs to process more than 100+ million or rows overall for 2 years worth of records.To avoid out of memory issue, we chunk the dat

I have wrote a ETL process that perform ETL process. The ETL process needs to process more than 100+ million or rows overall for 2 years worth of records. To avoid out of memory issue, we chunk the data loading down to every 7 days. For each chunk process, it loads up all the required reference data, then the proces开发者_C百科s open a sql connection and load the source data one by one, transform it, and write it to the data warehouse.

The drawback of processing the data by chunk is it is slow.

This process has been working fine for most of the tables, but there is one table I still run into out of memory. The process has loaded too many reference data. I would like to avoid chunk the data down to 3 days so that it has a decent performance.

Is there any other strategies that I can use to avoid OutOfMemoryException?

For example, local database, write the reference data to files, spawn another .NET process to hold more memory in Windows, use CLR stored procedure to do ETL...

Environment: Windows 7 32 bit OS. 4 GB of RAM. SQL Server Standard Edition.

The only one solution is to use a store procedure and let SQL Server handle the ETL. However, I am trying to avoid it because the program needs to support Oracle as well. Other performance improvement I tried are added indexes to improve the loading queries. Create custom data access class to only load the necessary columns, instead of loading the entire row into memory.

Thanks


Without knowing how you exactly process the data it is hard to say, but a naive solution that can be implemented in any case is to use a 64-bit OS and compile your application as 64-bit. In 32-bit mode .NET heap will only grow to about 1.5GB which might be limiting you.


I know its old post but for people searching for better points to write data operations with programming languages.

I am not sure if you have considered to study how ETL tools perform their data loading operations and replicate similar strategy in your code.

One such suggestion, parallel data pipes. Here each pipe will perform the ETL on a single chunks based on partitioning of the data from the source. For example, you could consider spawning processes for different weeks data in parallel. This still will not solve your memory issues within a single process. Though can be used in case you reach a limit with memory allocation within heap within single process. This is also useful to read the data in parallel with random access. Though will require a master process to coordinate and complete the process as a single ETL operation.

I assume you perform in your transformation a lot of lookup operation before finally writing your data to database. Assuming the master transaction table is huge and reference data is small. You need to focus on data structure operation and alogirthm. There are few tips below for the same. Refer to the characteristics of your data before choosing what suites best when writing the algorithm.

Generally, Lookup data (reference data) is stored in cache. Choose a simple data structure that is efficient for read and search operation (say Array list). If possible sort this array by the key you will join to be efficient in your search algorithm.

There is different strategy for lookup operations in your transformation tasks. In database world you can call it as join operation.

Merge Join algorithm : Ideal when the source is already sorted on join attribute key. The key idea of the sort-merge algorithm is to first sort the relations by the join attribute, so that interleaved linear scans will encounter these sets at the same time. For sample code, https://en.wikipedia.org/wiki/Sort-merge_join

Nested Join: works like a nested loop, where each value of the index of the outer loop is taken as a limit (or starting point or whatever applicable) for the index of the inner loop, and corresponding actions are performed on the statement(s) following the inner loop. So basically, if the outer loop executes R times and for each such execution the inner loop executes S times, then the total cost or time complexity of the nested loop is O(RS).

Nested-loop joins provide efficient access when tables are indexed on join columns. Furthermore,in many small transactions, such as those affecting only a small set of rows, index nested loopsjoins are far superior to both sort -merge joins and hash joins

I am only describing two methods that can be thought in your lookup operation. The main idea to remember in ETL is all about lookup and retrieve the tuples (as set) for further operation. Search will be based on key and resultant transaction keys will extract all the records (projection). Take this and load the rows from the file in one reading operation. This is more of suggestion in case you don't need all the records for transformation operations.

Another very costly operation is writing back to the database. There might be tendency to process the extraction, transformation and loading one row at a time. Think of operations that can be vectorized where in you can perform it together with a data structure operation in bulk. For example, lambada operation on a multi dimensional vector rather than looping every row one at a time and performing transformation and operations across all columns for a given row. We then can write this vector into file or database. This will avoid memory pressure.


This was a very old question, and it is more a design question and I am sure there are many solutions to it, unless I get into more specific details.

Ultimately, I wrote SQL Stored Procedure using Merge to handle the ETL process for the data type that took too long to process thought C# application. In addition, the business requirement was changed such that we dropped Oracle support, and only support 64-bit server, which reduced maintenance cost and avoid ETL out of memory issue.

In addition, we added many indexes whenever we see an opportunity to improve the querying performance.

Instead of chunking by a day range, the ETL process also chunks the data by count (5000) and commit on every transaction, this reduced the transaction log file size and if the ETL fails , the process only needs to rollback a subset of the data.

Lastly, we implemented caches (key,value) so that frequently referenced data within the ETL date range are loaded in memory to reduce database querying.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号