I am new to hadoop and I am learning by using few examples. I am currently trying to pass a file with random integers on it. For each and every number i w开发者_运维知识库ant it to be double base on the number specify by the user at runtime.
3536 5806 2545 249 485 5467 1162 8941 962 6457 665 6754 889 5159 3161 5401 704 4897 135 907 8111 1059 4971 5195 3031 630 6265 827 5882 9358 9212 9540 676 3191 4995 8401 9857 4884 8002 3701 931 875 6427 6945 5483 545 4322 5120 1694 2540 9039 5524 872 840 8730 4756 2855 718 6612 4125
Above is the file sample.
For example when the user specify at runtime
jar ~/dissertation/workspace/TestHadoop/src/DoubleNum.jar DoubleNum Integer Output 3
the output for say the first line will be 3536 * 8 5806* 8 2545* 8 249* 8 485* 8 5467* 8 1162* 8 8941* 8 962* 8 6457* 8
Because for each iteration the number will be double so for 3 iterations it will be 2^3. How can I achieve this using mapreduce?
For chaining one job into the next, check out: Chaining multiple MapReduce jobs in Hadoop
Also, this may be a good time to learn about sequence files, as they provide an efficient way of passing data from one map/reduce job to another.
As for your particular problem, you don't need reducers here, so make it map-only by setting the number of reducers to zero. Sending the output to reducers will only incur extra network overhead. (However, be careful about the number of files you create over time, eventually the NameNode will not appreciate it. Each mapper will create one file.)
I understand that you are trying to use this as an example of perhaps something more complex... but in this case you can use a common optimization technique: If you find yourself wanting to chain one mapper-only task into another map/reduce job, you can squash the two mappers together. For example, instead of multiplying by 2, then 2 gain, the 2 again, why not just multiply by 2 and by 2 and by in the same mapper? Basically, if all your operations are independent on one number or line, you can apply the iterations within the same mapper, per record. This will reduce the amount of overhead significantly.
精彩评论