Using AWS for parallel processing with R_问答_开发者

I want to take a shot at the Kaggle Dunnhumby challenge by building a model for each customer. I want to split the data into ten groups and use Amazon web-services (AWS) to build models using R on the ten groups in parallel. Some relevant links I开发者_JAVA百科 have come across are:

The segue package;
A presentation on parallel web-services using Amazon.

What I don't understand is:

How do I get the data into the ten nodes?
How do I send and execute the R functions on the nodes?

I would be very grateful if you could share suggestions and hints to point me in the right direction.

PS I am using the free usage account on AWS but it was very difficult to install R from source on the Amazon Linux AMIs (lots of errors due to missing headers, libraries and other dependencies).

You can build up everything manually at AWS. You have to build your own amazon computer cluster with several instances. There is a nice tutorial video available at the Amazon website: http://www.youtube.com/watch?v=YfCgK1bmCjw

But it will take you several hours to get everything running:

starting 11 EC2 instances (for every group one instance + one head instance)
R and MPI on all machines (check for preinstalled images)
configuring MPI correctly (probably add a security layer)
in best case a file server which will be mounted to all nodes (share data)
with this infrastructure the best solution is the use of the snow or foreach package (with Rmpi)

The segue package is nice but you will definitely get data communication problems!

The simples solution is cloudnumbers.com (http://www.cloudnumbers.com). This platform provides you with easy access to computer clusters in the cloud. You can test 5 hours for free with a small computer cluster in the cloud! Check the slides from the useR conference: http://cloudnumbers.com/hpc-news-from-the-user2011-conference

I'm not sure I can answer the question about which method to use, but I can explain how I would think about the question. I'm the author of Segue so keep that bias in mind :)

A few questions I would answer BEFORE I started trying to figure out how to get AWS (or any other system) running:

How many customers are in the training data?
How big is the training data (what you will send to AWS)?
What's the expected average run time to fit a model to one customer... all runs?
When you fit your model to one customer, how much data is generated (what you will return from AWS)?

Just glancing at the training data, it doesn't look that big (~280 MB). So this isn't really a "big data" problem. If your models take a long time to create, it might be a "big CPU" problem, which Segue may, or may not, be a good tool to help you solve.

In answer to your specific question about how to get the data onto AWS, Segue does this by serializing the list object you provide to the emrlapply() command, uploading the serialized object to S3, then using the Elastic Map Reduce service to stream the object through Hadoop. But as a user of Segue you don't need to know that. You just need to call emrlapply() and pass it your list data (probably a list where each element is a matrix or data frame of a single shopper's data) and a function (one you write to fit the model you choose) and Segue takes care of the rest. But keep in mind that the very first thing Segue does when you call emrlapply() is to serialize (sometimes slowly) and upload your data to S3. So depending on the size of the data and the speed of your internet connection upload speeds, this can be slow. I take issues with Markus' assertion that you will "definitely get data communication problems". That's clearly FUD. I use Segue on stochastic simulations that send/receive 300MB/1GB with some regularity. But I tend to run these simulations from an AWS instance so I am sending and receiving from one AWS rack to another, which makes everything much faster.

If you're wanting to do some analysis on AWS and get your feet wet with R in the cloud, I recommend Drew Conway's AMI for Scientific Computing. Using his AMI will save you from having to install/build much. To upload data to your running machine, once you set up your ssh certificates, you can use scp to upload files to your instance.

I like running RStudio on my Amazon instances. This will require setting up password access to your instance. There are a lot of resources around for helping with this.