Merging multiple files into one within Hadoop_问答_开发者

Merging multiple files into one within Hadoop

开发者 https://www.devze.com 2023-01-12 13:27 出处：网络

I get multiple small files into my input dir开发者_运维问答ectory which I want to merge into a single file without using the local file system or writing mapreds. Is there a way I could do it using ha

相关专题：apache-pig

Thanks!

In order to keep everything on the grid use hadoop streaming with a single reducer and cat as the mapper and reducer (basically a noop) - add compression using MR flags.

hadoop jar \
    $HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming.jar \<br>
    -Dmapred.reduce.tasks=1 \
    -Dmapred.job.queue.name=$QUEUE \
    -input "$INPUT" \
    -output "$OUTPUT" \
    -mapper cat \
    -reducer cat

If you want compression add
-Dmapred.output.compress=true \ -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec

hadoop fs -getmerge <dir_of_input_files> <mergedsinglefile>

okay...I figured out a way using hadoop fs commands -

hadoop fs -cat [dir]/* | hadoop fs -put - [destination file]

It worked when I tested it...any pitfalls one can think of?

Thanks!

If you set up fuse to mount your HDFS to a local directory, then your output can be the mounted filesystem.

For example, I have our HDFS mounted to /mnt/hdfs locally. I run the following command and it works great:

hadoop fs -getmerge /reports/some_output /mnt/hdfs/reports/some_output.txt

Of course, there are other reasons to use fuse to mount HDFS to a local directory, but this was a nice side effect for us.

You can use the tool HDFSConcat, new in HDFS 0.21, to perform this operation without incurring the cost of a copy.

If you are working in Hortonworks cluster and want to merge multiple file present in HDFS location into a single file then you can run 'hadoop-streaming-2.7.1.2.3.2.0-2950.jar' jar which runs single reducer and get the merged file into HDFS output location.

$ hadoop jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.2.0-2950.jar \

-Dmapred.reduce.tasks=1 \
-input "/hdfs/input/dir" \
-output "/hdfs/output/dir" \
-mapper cat \
-reducer cat

You can download this jar from Get hadoop streaming jar

If you are writing spark jobs and want to get a merged file to avoid multiple RDD creations and performance bottlenecks use this piece of code before transforming your RDD

sc.textFile("hdfs://...../part*).coalesce(1).saveAsTextFile("hdfs://...../filename)

This will merge all part files into one and save it again into hdfs location

Addressing this from Apache Pig perspective,

To merge two files with identical schema via Pig, UNION command can be used

 A = load 'tmp/file1' Using PigStorage('\t') as ....(schema1)
 B = load 'tmp/file2' Using PigStorage('\t') as ....(schema1) 
 C = UNION A,B
 store C into 'tmp/fileoutput' Using PigStorage('\t')

All the solutions are equivalent to doing a

hadoop fs -cat [dir]/* > tmp_local_file  
hadoop fs -copyFromLocal tmp_local_file

it only means that the local m/c I/O is on the critical path of data transfer.

Merging multiple files into one within Hadoop

精彩评论

关注公众号

热门标签

图文推荐

Merging multiple files into one within Hadoop

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：