I want to merge 2 bzip2'ed files. I tried appending one to another: cat file1.bzip2 file2.bzip2 > out.bzip2
which seems to work (this file decompressed correctly), but I want to use this file as a Hadoop input file, and I get errors about c开发者_运维技巧orrupted blocks.
What's the best way to merge 2 bzip2'ed files without decompressing them?
Handling concatenated bzip is fixed on trunk, or should be: https://issues.apache.org/jira/browse/HADOOP-4012. There are examples of it working: https://issues.apache.org/jira/browse/MAPREDUCE-477?focusedCommentId=12871993&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12871993 Make sure you're running a recent version of Hadoop and you should be fine.
You could compress (well, store) them both into a new bz2? It'd mean you'd have to do 3 decompressions to get the contents of the 2 archives, but might work with your scenario.
This question is quite old, but I came upon it right now, so, if anyone else searches for this, this is what I found to join multiple bz2 files in HDFS into one whithout using the local filesystem. This can be used for any text file also.
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input foo \
-output foo_merged \
-mapper /bin/cat \
-reducer /bin/cat
This joins all the files in folder foo
and writes a single file (part-00000) to folder foo_merged
.
You can use wildcards for the input folder or use as many -input
as you need to include all the files that are going to be joined.
The output file will be uncompressed. If you want the output also compressed in bz2, you should specify these two options:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-D mapred.output.compress=true \
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
-input foo \
-output foo_merged \
-mapper /bin/cat \
-reducer /bin/cat
Replacing the BZip2Codec for whichever you want to use.
More info here.
You wouldn't necessary have to merge files to use them as Hadoop input:
- consider
file_name*
- a pattern; file_name_1,file_name_2
- list of inputs.
And Hadoop will handle it.
Otherwise you can use streaming of the Hadoop to merge them (with decompression).
You could produce list of files by pattern like:
FILES_LIST="'ls -m template*.bz2'"
INPUT_FILE="'echo $FILES_LIST | tr -d ' ' '"
inner '
quotes should be different. You can pass $INPUT_FILE
as a variable to your script via CLI.
Also consider the CombineFileInputFormat class as InputFormat.
精彩评论