I am trying to develop a piece of code in Java, that will be able to process large amounts of data fetched by JDBC driver from SQL database and the开发者_StackOverflow中文版n persisted back to DB.
I thought of creating a manager containing one reader thread, one writer thread and customizable number of worker threads processing data. The reader thread would read data to DTOs and pass them to a Queue labled 'ready for processing'. Worker threads would process DTOs and put processed objects to another queue labeld 'ready for persistence'. The writer thread would persist data back to DB. Is such an approach optimal? Or perhaps I should allow more readers for fetching data? Are there any ready libraries in Java for doing this sort of thing I am not aware of?
Whether or not your proposed approach is optimal depends crucially on how expensive it is to process the data in relation to how expensive it is to get it from the DB and to write the results back into the DB. If the processing is relatively expensive, this may work well; if it isn't, you may be introducing a fair amount of complexity for little benefit (you still get pipeline parallelism which may or may not be significant to the overall throughput.)
The only way to be sure is to benchmark the three stages separately, and then deside on the optimal design.
Provided the multithreaded approach is the way to go, your design with two queues sounds reasonable. One additional thing you may want to consider is having a limit on the size of each queue.
I hear echoes from my past and I'd like to offer a different approach just in case you are about to repeat my mistake. It may or may not be applicable to your situation.
You wrote that you need to fetch a large amount of data out of the database, and then persist back to the database.
Would it be possible to temporarily insert any external data you need to work with into the database, and perform all the processing inside the database? This would offer the following advantages:
- It eliminates the need to extract large amounts of data
- It eliminates the need to persist large amounts of data
- It enables set-based processing (which outperforms procedural)
- If your database supports it, you can make use of parallel execution
- It gives you a framework (Tables and SQL) to make reports on any errors you encounter during the process.
To give an example. A long time ago I implemented a (java) program whose purpose was to load purchases, payments and related customer data from files into a central database. At that time (and I regret it deeply), I designed the load to process the transactions one-by-one , and for each piece of data, perform several database lookups (sql) and finally a number of inserts into appropriate tables. Naturally this did not scale once the volume increased.
Then I made another misstake. I deemed that it was the database which was the problem (because I had heard that the SELECT is slow), so I decided to pull out all data from the database and do ALL processing in Java. And then finally persist back all data to the database. I implemented all kinds of layers with callback mechanisms to easily extend the load process, but I just couldn't get it to perform well.
Looking in the rear mirror, what I should have done was to insert the (laughably small amount of) 100,000 rows temporarily in a table, and process them from there. What took nearly half a day to process would have taken a few minutes at most if I played to the strength of all technologies I had at my disposal.
An alternative to using an explicit queue is to have an ExecutorService and add tasks to it. This way you let Java manager the pool of threads.
You're describing writing something similar to the functionality that Spring Batch provides. I'd check that out if I were you. I've had great luck doing operations similar to what you're describing using it. Parallel and multithreaded processing, and several different database readers/writers and whole bunch of other stuff are provided.
Use Spring Batch! That is exactly what you need
精彩评论