I have to download 2.5k+ files using curl. I'm using Drupals inbuilt batch api to fire the curl script without it timing out but it's taking well over 10 minutes to grab and save the files.
Add this in with the the processing of the actual files. The potential runtime of this script is around 30 minutes. Server performance isn't an issue as both the dev/staging and live servers are more than powerful enough.
I'm looking for suggestions on how to improve the speed. The overall execution time isn't too big of a deal as this is meant to be 开发者_StackOverflow社区run once but it would be nice to know the alternatives.
Let's assume for a second that the problem is end-to-end latency, not bandwidth or CPU. Latency in this case is around making a system call out to curl, building up the HTTP connection, requesting the file and tearing down the connection.
One approach is to shard out your requests and run them in parallel. You mention Drupal so I assume you're talking about PHP here. Let's also assume that the 2.5k files are listed in an array in URL form. You can do something like this:
<?php
$urls = array(...);
$workers = 4;
$shard_size = count($urls) / $workers;
for ($i = 0; $i < $shard_size; $i++) {
for ($j = 0; $j < $workers - 1; $j++) {
system("curl " . $urls[$i * $shard_size + $j] . "&");
}
system("curl " . $urls[$i * $shard_size + $j]);
}
?>
This is pretty lame, but you get the idea. It forks off $worker-1 subprocesses to get files in the background, and runs the last worker in the foreground so that you get some pacing. It should scale roughly linearly with the number of workers. It does not take into account the edge case where the size of the data set doesn't evenly divide into the # of workers. I bet you can take this approach and make something reasonably fast.
Curl also supports requesting multiple files on the same command line, but I don't know if it's smart enough to reuse an existing HTTP connection. It might be.
After playing around with a few different methods I came to conclusion that you just have to bite the bullet and go for it.
The script takes a while to process but it has a lot of data to churn through.
精彩评论