开发者

How do I prevent `hadoop fs rmr <uri>` from creating $folder$ files?

开发者 https://www.devze.com 2023-02-27 13:45 出处:网络
We\'re using Amazon\'s Elastic Map Reduce to perform some large file processing jobs. As a part of our workflow, we occasionally need to remove files from S3 that may already exist. We do so using the

We're using Amazon's Elastic Map Reduce to perform some large file processing jobs. As a part of our workflow, we occasionally need to remove files from S3 that may already exist. We do so using the hadoop fs interface, like this:

hadoop fs -rmr s3://mybucket/a/b/myfile.log

This removes the file from S3 appropriately, but in it's place leaves an empty file named "s3://mybucket/a/b_$folder$". As described in this question, Hadoop's Pig is unable to handle these files, so later steps in th开发者_JS百科e workflow can choke on this file.

(Note, it doesn't seem to matter whether we use -rmr or -rm or whether we use s3:// or s3n:// as the scheme: all of these exhibit the described behavior.)

How do I use the hadoop fs interface to remove files from S3 and be sure not to leave these troublesome files behind?


I wasn't able to figure out if it's possible to use the hadoop fs interface in this way. However, the s3cmd interface does the right thing (but only for one key at a time):

s3cmd del s3://mybucket/a/b/myfile.log

This requires configuring a ~/.s3cfg file with your AWS credentials first. s3cmd --configure will interactively help you create this file.


It is how the S3 suppot is implemented in Hadoop, see this: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/s3native/NativeS3FileSystem.html.

So use s3cmd.

0

精彩评论

暂无评论...
验证码 换一张
取 消