开发者

Difference between 'distcp' and 'distcp -update'?

开发者 https://www.devze.com 2023-02-02 11:07 出处:网络
What is the difference between hadoop distcp and hadoop distcp -update Both of them would do the same work with only slight difference in how we call them. None o开发者_运维百科f them overwrite

What is the difference between

hadoop distcp

and

hadoop distcp -update

Both of them would do the same work with only slight difference in how we call them. None o开发者_运维百科f them overwrites an already existing file in the destination. What's the point then in two different set of commands?


The difference between distcp and distcp -update is that distcp by default skips files while "distcp -update" will update a file if src size is different from dst size.

It's a bit confusing in documentation, since the default nature of distcp is to skip if a file exists to prevent collision.

From the docs:

"As noted in the preceding, this is not a "sync" operation. The only criterion examined is the source and destination file sizes; if they differ, the source file replaces the destination file."

Keep in mind -update is not a delta-xfer algo like rsync and only does a size check, which isn't perfect when files are all the same size yet data is different.

I should also elaborate some and explain that distcp -overwrite will overwrite the file no matter whether the size matches or not. It's a destructive process, so make sure that you really want to do this.

Some great examples can be found here: http://hadoop.apache.org/common/docs/r0.19.2/distcp.html#uo

I also want to give an example of what I do in a sync operation between two clusters:

hadoop distcp -pugp -i -delete -update hftp://hdfs-nn1:50070/clustera hdfs://hdfs-nn2:9000/clustera

This will update all files in hdfs-nn2 that don't match in size from hdfs-nn1, as well as delete any extraneous files. If using .Trash, then any files deleted are placed in your Trash of user invoking distcp.

I would experiment with it a bit so you can see the effect of various commands, since it can be painful when you accidentally wipe out TBs of data so definitely use your Trash.

0

精彩评论

暂无评论...
验证码 换一张
取 消