开发者

Configuring Hadoop logging to avoid too many log files

开发者 https://www.devze.com 2022-12-27 09:30 出处:网络
I\'m having a problem with Hadoop producing too many log files in $HADOOP_LOG_DIR/userlogs (the Ext3 filesystem allows only 32000 subdirectories) which looks like开发者_如何转开发 the same problem in

I'm having a problem with Hadoop producing too many log files in $HADOOP_LOG_DIR/userlogs (the Ext3 filesystem allows only 32000 subdirectories) which looks like开发者_如何转开发 the same problem in this question: Error in Hadoop MapReduce

My question is: does anyone know how to configure Hadoop to roll the log dir or otherwise prevent this? I'm trying to avoid just setting the "mapred.userlog.retain.hours" and/or "mapred.userlog.limit.kb" properties because I want to actually keep the log files.

I was also hoping to configure this in log4j.properties, but looking at the Hadoop 0.20.2 source, it writes directly to logfiles instead of actually using log4j. Perhaps I don't understand how it's using log4j fully.

Any suggestions or clarifications would be greatly appreciated.


Unfortunately, there isn't a configurable way to prevent that. Every task for a job gets one directory in history/userlogs, which will hold the stdout, stderr, and syslog task log output files. The retain hours will help keep too many of those from accumulating, but you'd have to write a good log rotation tool to auto-tar them.

We had this problem too when we were writing to an NFS mount, because all nodes would share the same history/userlogs directory. This means one job with 30,000 tasks would be enough to break the FS. Logging locally is really the way to go when your cluster actually starts processing a lot of data.

If you are already logging locally and still manage to process 30,000+ tasks on one machine in less than a week, then you are probably creating too many small files, causing too many mappers to spawn for each job.


I had this same problem. Set the environment variable "HADOOP_ROOT_LOGGER=WARN,console" before starting Hadoop.

export HADOOP_ROOT_LOGGER="WARN,console"
hadoop jar start.jar


Configuring hadoop to use log4j and setting

log4j.appender.FILE_AP1.MaxFileSize=100MB
log4j.appender.FILE_AP1.MaxBackupIndex=10

like described on this wiki page doesn't work?

Looking at the LogLevel source code, seems like hadoop uses commons logging, and it'll try to use log4j by default, or jdk logger if log4j is not on the classpath.

Btw, it's possible to change log levels at runtime, take a look at the commands manual.


According to the documentation, Hadoop uses log4j for logging. Maybe you are looking in the wrong place ...


I also ran in the same problem.... Hive produce a lot of logs, and when the disk node is full, no more containers can be launched. In Yarn, there is currently no option to disable logging. One file particularly huge is the syslog file, generating GBs of logs in few minutes in our case.

Configuring in "yarn-site.xml" the property yarn.nodemanager.log.retain-seconds to a small value does not help. Setting "yarn.nodemanager.log-dirs" to "file:///dev/null" is not possible because a directory is needed. Removing the writing ritght (chmod -r /logs) did not work either.

One solution could be to a "null blackhole" directory. Check here: https://unix.stackexchange.com/questions/9332/how-can-i-create-a-dev-null-like-blackhole-directory

Another solution working for us is to disable the log before running the jobs. For instance, in Hive, starting the script by the following lines is working:

set yarn.app.mapreduce.am.log.level=OFF;
set mapreduce.map.log.level=OFF;
set mapreduce.reduce.log.level=OFF;
0

精彩评论

暂无评论...
验证码 换一张
取 消