I'm running out of memory on a normal 8GB server working with a fairly small dataset in a machine learning context:
> dim(basetrainf) # this is a dataframe [1] 58168 118
The only pre-modeling step I take which significantly increases memory consumption is convert a data frame to a model matrix. This is since caret
, cor
, etc. only work with (model) matrices. Even after removing factors with many levels, the matrix (mergem
below) is fairly large. (sparse.model.matrix
/Matrix
is poorly supported in general, so I can't use that.)
> lsos() Type Size PrettySize Rows Columns mergem matrix 879205616 838.5 Mb 115562 943 trainf data.frame 80613120 76.9 Mb 106944 119 inttrainf matrix 76642176 73.1 Mb 907 10387 mergef data.frame 58264784 55.6 Mb 115562 75 dfbase data.frame 48031936 45.8 Mb 54555 115 basetrainf data.frame 40369328 38.5 Mb 58168 118 df2 data.frame 34276128 32.7 Mb 54555 103 tf data.frame 33182272 31.6 Mb 54555 98 m.gbm train 20417696 19.5 Mb 16 NA res.glmnet list 14263256 13.6 Mb 4 NA
Also, since many R models don't support example weights, I had to first oversample the minority class, doubling the size of my dataset (why trainf, mergef, mergem have twice as many rows as basetrainf).
R is at this point using 1.7GB of memory, bringing my total memory usage up to 4.3GB out of 7.7GB.
The next thing I do is:
> m = train(mergem[mergef$istrain,], mergef[mergef$istrain,response], method='rf')
Bam - in a few seconds, the Linux out-of-memory killer kills rsession.
I can sample my data, undersample instead of oversample, etc., but these are non-ideal. What (else) should I do (differently), short of rewriting caret and the various model packages I intend to use?
FWIW, I've never run into this problem with other ML software (Weka, Orange, etc.), even without pruning out any of my factors, perhaps because of both example weighting and "data frame" support, across all models.
Complete script follows:
library(caret) library(Matrix) library(doMC) registerDoMC(2) response = 'class' repr = 'dummy' do.impute = F xmode = functi开发者_运维技巧on(xs) names(which.max(table(xs))) read.orng = function(path) { # read header hdr = strsplit(readLines(path, n=1), '\t') pairs = sapply(hdr, function(field) strsplit(field, '#')) names = sapply(pairs, function(pair) pair[2]) classes = sapply(pairs, function(pair) if (grepl('C', pair[1])) 'numeric' else 'factor') # read data dfbase = read.table(path, header=T, sep='\t', quote='', col.names=names, na.strings='?', colClasses=classes, comment.char='') # switch response, remove meta columns df = dfbase[sapply(pairs, function(pair) !grepl('m', pair[1]) && pair[2] != 'class' || pair[2] == response)] df } train.and.test = function(x, y, trains, method) { m = train(x[trains,], y[trains,], method=method) ps = extractPrediction(list(m), testX=x[!trains,], testY=y[!trains,]) perf = postResample(ps$pred, ps$obs) list(m=m, ps=ps, perf=perf) } # From sparse.cor = function(x){ memory.limit(size=10000) n 200 levels') badfactors = sapply(mergef, function(x) is.factor(x) && (nlevels(x) 200)) mergef = mergef[, -which(badfactors)] print('remove near-zero variance predictors') mergef = mergef[, -nearZeroVar(mergef)] print('create model matrix, making everything numeric') if (repr == 'dummy') { dummies = dummyVars(as.formula(paste(response, '~ .')), mergef) mergem = predict(dummies, newdata=mergef) } else { mat = if (repr == 'sparse') model.matrix else sparse.model.matrix mergem = mat(as.formula(paste(response, '~ .')), data=mergef) # remove intercept column mergem = mergem[, -1] } print('remove high-correlation predictors') merge.cor = (if (repr == 'sparse') sparse.cor else cor)(mergem) mergem = mergem[, -findCorrelation(merge.cor, cutoff=.75)] print('try a couple of different methods') do.method = function(method) { train.and.test(mergem, mergef[response], mergef$istrain, method) } res.gbm = do.method('gbm') res.glmnet = do.method('glmnet') res.rf = do.method('parRF')
With that much data, the resampled error estimates and the random forest OOB error estimates should be pretty close. Try using trainControl(method = "OOB")
and train()
will not fit the extra models on resampled data sets.
Also, avoid the formula interface like the plague.
You also might try bagging instead. Since there is no random selection of predictors at each spit, you can get good result with 50-100 resamples (instead of many more needed by random forests to be effective).
Others may disagree, but I also think that modeling all the data you have is not always the best approach. Unless the predictor space is large, many of the data points will be very similar to others and don't contribute much to the model fit (besides the additional computation complexity and the footprint of the resulting object). caret
has a function called maxDissim
that might be helpful to thinning the data (although it is not terribly efficient either)
Check that the underlying randomForest code is not storing the forest of trees. Perhaps reduce the tuneLength
so that fewer values of mtry
are being tried.
Also, I would probably just fit a single random forest by hand to see if I could fit such a model on my machine. If you can't fit one directly, you won't be able to use caret to fit many in one go.
At this point I think you need to work out what is causing the memory to balloon and how you might control the model fitting so it doesn't balloon out of control. So work out how caret is calling randomForest()
and what options it is using. You might be able to turn some of those off (like storing the forest I mentioned earlier, but also the variable importance measures). Once you've determined the optimal value for mtry
, you can then try to fit the model with all the extras you might want to help interpret the fit.
You can try to use the ff package, which implements "memory-efficient storage of large data on disk and fast access functions".
精彩评论