I am developing a Java application which will query tables which may hold over 1,000,000 records. I have tried everything I could to be as efficient as possible but I am only able to achieve on avg. about 5,000 records a minute and a maximum of 10,000 at one point. I have tried reverse engineering the data loader and my code seems to be very similar but still no luck.
Is threading a viable solution here? I have tried this but w开发者_如何学Cith very minimal results.
I have been reading and have applied every thing possible it seems (compressing requests/responses, threads etc.) but I cannot achieve data loader like speeds.
To note, it seems that the queryMore method seems to be the bottle neck.
Does anyone have any code samples or experiences they can share to steer me in the right direction?
Thanks
An approach I've used in the past is to query just for the IDs that you want (which makes the queries significantly faster). You can then parallelize the retrieves() across several threads.
That looks something like this:
[query thread] -> BlockingQueue -> [thread pool doing retrieve()] -> BlockingQueue
The first thread does query() and queryMore() as fast as it can, writing all ids it gets into the BlockingQueue. queryMore() isn't something you should call concurrently, as far as I know, so there's no way to parallelize this step. All ids are written into a BlockingQueue. You may wish to package them up into bundles of a few hundred to reduce lock contention if that becomes an issue. A thread pool can then do concurrent retrieve() calls on the ids to get all the fields for the SObjects and put them in a queue for the rest of your app to deal with.
I wrote a Java library for using the SF API that may be useful. http://blog.teamlazerbeez.com/2011/03/03/a-new-java-salesforce-api-library/
With the Salesforce API, the batch size limit is what can really slow you down. When you use the query/queryMore methods, the maximum batch size is 2000. However, even though you may specify 2000 as the batch size in your SOAP header, Salesforce may be sending smaller batches in response. Their batch size decision is based on server activity as well as the output of your original query.
I have noticed that if I submit a query that includes any "text" fields, the batch size is limited to 50.
My suggestion would be to make sure your queries are only pulling the data that you need. I know a lot of Salesforce tables end up with a lot of custom fields that may not be needed for every integration.
Salesforce documentation on this subject
We have about 14000 records in our Accounts object and it takes quite some time to get all the records. I perform a query which takes about a minute but SF only returns batches of no more than 500 even though I set batchsize to 2000. Each query more operation takes from 45 seconds to a minute also. This limitation is quite frustrating when you need to get bulk data.
Make use of Bulk-api to query any number of records from Java. I'm making use of it and performs very effectively even in seconds you get the result. The String returned is comma separated. Even you can maintain batches less than or equal to 10k to get the records either in CSV (using open csv) or directly in String.
Let me know if you require the code help.
Latency is going to be a killer for this type of situation - and the solution will be either multi-thread, or asynchronous operations (using NIO). I would start by running 10 worker threads in parallel and see what difference it makes (assuming that the back-end supports simultaneous gets).
I don't have any concrete code or anything I can provide here, sorry - just painful experience with API calls going over high latency networks.
精彩评论