MapReduce or a batch job?_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-03-19 19:37 出处：网络

I have a function which needs to be called on a lot of files (1000\'s). Each is independent of another, and can be run in parallel. The output of the function for each of the files does not need to be

I have a function which needs to be called on a lot of files (1000's). Each is independent of another, and can be run in parallel. The output of the function for each of the files does not need to be combined (currently) with the other ones. I have a lot of servers I can scale this on but I'm not sure what to do:

1) Run a MapReduce on it

2) Create 1000's of jobs (each has a different file it works on).

Would one solution be preferable to another?

开发者_Python百科

Thanks!

MapReduce will provide significant value for distributing large dataset workloads. In your case, being smaller independent jobs on small independent data files, in my opinion it could be overkill.

So, I would prefer run a bunch of dynamically created batch files.

Or, alternatively, use a cluster manager and job scheduler, like SLURM https://computing.llnl.gov/linux/slurm/

SLURM: A Highly Scalable Resource Manager

SLURM is an open-source resource manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Since it is only 1000's of files (and not 1000000000's of files) a full blown HADOOP setup is probably overkill. GNU Parallel tries to fill the gap between sequential scripts and HADOOP:

ls files | parallel -S server1,server2 your_processing {} '>' out{}

You will probably want to learn about --sshloginfile. Depending on where the files are stored you may want to learn --trc, too.

Watch the intro video to learn more: http://www.youtube.com/watch?v=OpaiGYxkSuQ

Use a job array in slurm. No need to submit 1000s of jobs...just 1 - the array job.

This will kick off the same program on as many nodes / cores as are available with the resources you specify. Eventually it will churn through them all. Your only issue is how to map the array index to a file to process. Simplest way would be to prepare a text file with a list of all the paths, one per line. Each element of the job-array will get the ith line of this file and use that as the path of the file to process.