I have 6 servers and each contains a lot of logs. I'd like to put these logs to hadoop fs via rsync. Now I'm using fuse and rsync writes directly to fuse-mounted fs /mnt/hdfs. But there is a big problem. After about a day, the fuse 开发者_StackOverflow社区deamon occupies 5 GB of RAM and it's not possible to do anything with mounted fs. So I have to remount fuse and everything is OK, but just for some time. Rsync command is
rsync --port=3360 -az --timeout=10 --contimeout=30 server_name::ap-rsync/archive /mnt/hdfs/logs
Rsync produces error message after some time:
rsync error: timeout in data send/receive (code 30) at io.c(137) [sender=3.0.7]
rsync: connection unexpectedly closed (498784 bytes received so far) [receiver]
rsync error: error in rsync protocol data stream (code 12) at io.c(601) [receiver=3.0.7]
rsync: connection unexpectedly closed (498658 bytes received so far) [generator]
rsync error: error in rsync protocol data stream (code 12) at io.c(601) [generator=3.0.7]
Fuse-hdfs does not support O_RDWR
and O_EXCL
, so rsync get a EIO error.
If you want to use rsync with fuse-hdfs, it is needed to patch the code.
You have two ways to modify, each one is OK. I recommend to use the second method.
patch fuse-hdfs, it could be find in hadoop.
https://issues.apache.org/jira/browse/HDFS-861
patch rsync (version 3.0.8).
diff -r rsync-3.0.8.no_excl/syscall.c rsync-3.0.8/syscall.c 234a235,252 > #if defined HAVE_SECURE_MKSTEMP && defined HAVE_FCHMOD && (!defined HAVE_OPEN64 || defined HAVE_MKSTEMP64) > { > int fd = mkstemp(template); > if (fd == -1) > return -1; > if (fchmod(fd, perms) != 0 && preserve_perms) { > int errno_save = errno; > close(fd); > unlink(template); > errno = errno_save; > return -1; > } > #if defined HAVE_SETMODE && O_BINARY > setmode(fd, O_BINARY); > #endif > return fd; > } > #else 237c255,256 < return do_open(template, O_WRONLY|O_CREAT, perms); --- > return do_open(template, O_RDWR|O_EXCL|O_CREAT, perms); > #endif
I don't know what exactly is the question, but I think to properly solve the generic problem of moving logs to HDFS you may consider using Flume: https://github.com/cloudera/flume - http://www.cloudera.com/resource/hw10_flume_reliable_distributed_streaming_log_collection
I would use hadoop fs -copyFromLocal /path/to/logs hdfs:///path/to/logs/$DATE. No need for rsync since you're putting the logs in dated directories. No need for FUSE either, which is good for prototyping but unreliable as you have seen.
精彩评论