开发者

Hadoop Streaming Multiple Files per Map Job

开发者 https://www.devze.com 2023-02-15 08:32 出处:网络
I have a Hadoop streaming setup that works, however t开发者_如何学运维here is a bit of overhead when initializing the mappers which is done once per file, and since I am processing many files I notice

I have a Hadoop streaming setup that works, however t开发者_如何学运维here is a bit of overhead when initializing the mappers which is done once per file, and since I am processing many files I notice I'm spending a lot of time in initialization.

Is there a way, without writing any Java, to specify that I want to reuse the same mapper instance for multiple files to amortize the initialization cost?


In $HADOOP_HOME/conf/mapred-site.xml add/edit the follow property

<property>
    <name>mapred.job.reuse.jvm.num.tasks</name>
    <value>#</value>
</property>

The # can be set to a number to specify how many times the JVM is to be reused (default is 1), or set to -1 for no limit on the reuse amount.

It's also to specify it per job by setting the job configuration mapred.job.reuse.jvm.num.tasks to the desired value.

0

精彩评论

暂无评论...
验证码 换一张
取 消