开发者

Managing dependencies with Hadoop Streaming?

开发者 https://www.devze.com 2022-12-31 18:41 出处:网络
I have a quick Hadoop Streaming question. If I\'m using Python streaming and I have Python packages that my mappers/reducers require but aren\'t installed by default do I need to install those on all

I have a quick Hadoop Streaming question. If I'm using Python streaming and I have Python packages that my mappers/reducers require but aren't installed by default do I need to install those on all the Hadoop machines as wel开发者_如何学Pythonl or is there some sort of serialization that sends them to the remote machines?


If they're not installed on your task boxes, you can send them with -file. If you need a package or other directory structure, you can send a zipfile, which will be unpacked for you. Here's a Haddop 0.17 invocation:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.17.0-streaming.jar -mapper mapper.py -reducer reducer.py -input input/foo -output output -file /tmp/foo.py -file /tmp/lib.zip

However, see this issue for a caveat:

https://issues.apache.org/jira/browse/MAPREDUCE-596


If you use Dumbo you can use -libegg to distribute egg files and auto-configure the Python runtime:

https://github.com/klbostee/dumbo/wiki/Short-tutorial#wiki-eggs_and_jars https://github.com/klbostee/dumbo/wiki/Configuration-files

0

精彩评论

暂无评论...
验证码 换一张
取 消