Is this parallelizable?_问答_开发者_运维开发者技术经验分享

I have a huge tab delimited file. (10,000 subjects as rows and >1-million assays as columns). I have a mapping file which has information related to each of the 1 million columns. I need to for every subject, for every assay, (for every cell)look into the mapping file and get some value for it and replace the existing value.

In Python or Perl, I would have to read through every row, split it and for each cell look up in the mapping file.

In R, I could read each column at a time, and for all rows get info from mapping file.

Either ways, the whole process of looping through every row or column takes up a lot of time as every cell look-up needs to be done.

Is there a way I could parallelize this?? How should I be thinking if I want to parallelize this and make it go faster?

Also, am interested in learning as to how to approach this in map/reduce style?

Sample data file is as follows: (tab-seperated)

ID  S1  S2  S3  S4  S5  
1   AA  AB  BA  BB  AB  
2   BA  BB  AB  AA  AA  
3 开发者_StackOverflow社区  BA  AB  AB  AB  AB  
4   BA  AB  AB  BB  AA  
5   AA  AB  BA  BB  AB  
6   AA  BB  AB  AA  AA

mapping file is as follows:

SID  Al_A  Al_B    
S1    A     C  
S2    G     T  
S3    C     A  
S4    G     T  
S5    A     C

So in the data file, in every cell, for every A and B, a look-up has to be done in the mapping file to see what A maps to (from Al_A column), and what B maps to (from Al_B column).

Simple parallelism

python parse.py assays.txt | python lookup.py mapping.txt | python reformat.py >result.txt

Where parse.py reads the "assays" file of "10,000 subjects as rows and >1-million assays as columns" It parses and writes data to stdout.

The lookup.py reads the "get some value for it" mapping to populate an internal dictionary. It reads the stdin data, does the lookup and writes the result to stdout.

The reformat.py reads stdin and reformats it to write the final report, which appears to mirror the input structure.

While this isn't "embarassingly" parallel, it does break the job up into some trivially parallel steps. It's surprisingly robust and can shave some time off the process

What you probably want, however, is something that reflects the embarrassingly parallel nature of the problem. There are 10,000*1,000,000 == 10 billion individual values, all of which appear to be completely independent.

Another (more complex) approach is this. It depends on http://docs.python.org/library/multiprocessing.html.

A simple reader Process partitions the input file, writing records to n different Queues. This means each Queue gets 10,000/n records. n can be a big number between 10 and 100. Yes. 100 Queues, each of which gets 100 records. It's okay if they wait to be scheduled on the paltry few cores on your server. The cores will be 100% busy. That's a good thing.
Each of the n Queues is serviced by a worker Process which does the lookup thing for each assay in a record and puts the resulting record into an output Queue. You can tweak n to a wide variety of values to see what happens. At some point, a larger number of workers will slow things down. It's hard to predict where that level is, so experiment.
The output Queue is read by a worker Process which simply formats the output file from what it finds in the queue.

This means you need some kind of "Assay" object which you can serialize from the input file and enqueue into a Python

As I understand your problem each of your data cells is independent of the others, given this there is a really straight forward way to parallelize this without changing any of your pre-existing code. All you have to do is pre-process your data files with a command like the split command line tool. Then, process each of the files in parallel with whatever pre-existing code you have. Finally, cat them all back together at the end.

Here is an example of the commands you might want to do:

split -l 100 data_file.tsv data_
ls data_* | xargs -L 1 perl your_processing_file.pl
cat output_* > total_output
rm output_*

This assumes that your script will take a file with name data_$x and create a new file of output name output_$x, you may have to change it slightly for that.

This is actually a very common approach to parallelizing a problem like this.