开发者

rsync files to hadoop

开发者 https://www.devze.com 2023-03-14 16:52 出处:网络
I have 6 servers and each contains a lot of logs. I\'d like to put these logs to hadoop fs via rsync. Now I\'m using fuse and rsync writes directly to fuse-mounted fs /mnt/hdfs.

I have 6 servers and each contains a lot of logs. I'd like to put these logs to hadoop fs via rsync. Now I'm using fuse and rsync writes directly to fuse-mounted fs /mnt/hdfs. But there is a big problem. After about a day, the fuse 开发者_StackOverflow社区deamon occupies 5 GB of RAM and it's not possible to do anything with mounted fs. So I have to remount fuse and everything is OK, but just for some time. Rsync command is

rsync --port=3360 -az --timeout=10 --contimeout=30 server_name::ap-rsync/archive /mnt/hdfs/logs

Rsync produces error message after some time:

rsync error: timeout in data send/receive (code 30) at io.c(137) [sender=3.0.7]
rsync: connection unexpectedly closed (498784 bytes received so far) [receiver]
rsync error: error in rsync protocol data stream (code 12) at io.c(601) [receiver=3.0.7]
rsync: connection unexpectedly closed (498658 bytes received so far) [generator]
rsync error: error in rsync protocol data stream (code 12) at io.c(601) [generator=3.0.7]


Fuse-hdfs does not support O_RDWR and O_EXCL, so rsync get a EIO error. If you want to use rsync with fuse-hdfs, it is needed to patch the code. You have two ways to modify, each one is OK. I recommend to use the second method.

  1. patch fuse-hdfs, it could be find in hadoop.

    https://issues.apache.org/jira/browse/HDFS-861

  2. patch rsync (version 3.0.8).

    diff -r rsync-3.0.8.no_excl/syscall.c rsync-3.0.8/syscall.c
    
    234a235,252
    > #if defined HAVE_SECURE_MKSTEMP && defined HAVE_FCHMOD && (!defined HAVE_OPEN64 || defined HAVE_MKSTEMP64)
    >   {
    >       int fd = mkstemp(template);
    >       if (fd == -1)
    >           return -1;
    >       if (fchmod(fd, perms) != 0 && preserve_perms) {
    >           int errno_save = errno;
    >           close(fd);
    >           unlink(template);
    >           errno = errno_save;
    >           return -1;
    >       }
    > #if defined HAVE_SETMODE && O_BINARY
    >       setmode(fd, O_BINARY);
    > #endif
    >       return fd;
    >   }
    > #else
    237c255,256
    <   return do_open(template, O_WRONLY|O_CREAT, perms);
    ---
    >   return do_open(template, O_RDWR|O_EXCL|O_CREAT, perms);
    > #endif
    


I don't know what exactly is the question, but I think to properly solve the generic problem of moving logs to HDFS you may consider using Flume: https://github.com/cloudera/flume - http://www.cloudera.com/resource/hw10_flume_reliable_distributed_streaming_log_collection


I would use hadoop fs -copyFromLocal /path/to/logs hdfs:///path/to/logs/$DATE. No need for rsync since you're putting the logs in dated directories. No need for FUSE either, which is good for prototyping but unreliable as you have seen.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号