What is the difference between
hadoop distcp
and
hadoop distcp -update
Both of them would do the same work with only slight difference in how we call them. None o开发者_运维百科f them overwrites an already existing file in the destination. What's the point then in two different set of commands?
The difference between distcp and distcp -update is that distcp by default skips files while "distcp -update" will update a file if src size is different from dst size.
It's a bit confusing in documentation, since the default nature of distcp is to skip if a file exists to prevent collision.
From the docs:
"As noted in the preceding, this is not a "sync" operation. The only criterion examined is the source and destination file sizes; if they differ, the source file replaces the destination file."
Keep in mind -update
is not a delta-xfer algo like rsync and only does a size check, which isn't perfect when files are all the same size yet data is different.
I should also elaborate some and explain that distcp -overwrite
will overwrite the file no matter whether the size matches or not. It's a destructive process, so make sure that you really want to do this.
Some great examples can be found here: http://hadoop.apache.org/common/docs/r0.19.2/distcp.html#uo
I also want to give an example of what I do in a sync operation between two clusters:
hadoop distcp -pugp -i -delete -update hftp://hdfs-nn1:50070/clustera hdfs://hdfs-nn2:9000/clustera
This will update all files in hdfs-nn2 that don't match in size from hdfs-nn1, as well as delete any extraneous files. If using .Trash, then any files deleted are placed in your Trash of user invoking distcp.
I would experiment with it a bit so you can see the effect of various commands, since it can be painful when you accidentally wipe out TBs of data so definitely use your Trash.
精彩评论