I built an rpart tree model and now I want to extract the used variables in this model out of a big prediction dataframe (over 7.000 variables), because I have to to some calculations on this prediction dataframe before prediction and this calculation exceeds memory.
Now I don't know how to extract the variables from the rpart model. For randomForest-models, there is the function varUsed, but perhaps the problem might be cleared in a general way, so also for a glm-model.
names() on the rpart-Model gives back:
"frame" "where" "call" "terms" "cptable" "splits" "method"
"parms" "control" "functions" "model" "y" "ordered"
The split-value gives back:
count ncat improve index adj
**m24_a_ec_fakt** 6000 -1 0.026346646 0.15 0.00000000
**m24_a_ec_fakt_dwl** 6000 -1 0.026346646 0.15 0.00000000
**m3_a_fak_rech** 6000 -1 0.022821246 0.30 0.00000000
**m9_a_ec_fakt** 6000 -1 0.021599372 0.05 0.00000000
**m9_a_ec_fakt_dwl** 6000 -1 0.021599372 0.05 0.00000000
...
The split is a matrix and the first column(?) are the variable names.
Can I refer somehow on this matrix to filter the variables of my prediction dataframe by name?
something like:
newPredDM<- oldPredDM[ --GET THE VARIABLE NAMES 开发者_StackOverflowFROM rpart-Modell somehow-- ]
regards and thnx for help, Rainer
See help("rpart.object")
for the structure of the returned value. Since
frame: data frame with one row for each node in the tree. [...] Elements of ‘frame’ include ‘var’, a factor giving the variable used in the split at each node
you can use levels(fit$frame$var)[-1]
to get the columns as a character string vector and use something like
newPredDM<- oldPredDM[, levels(fit$frame$var)[-1]]
for your selection.
Hope this helps.
精彩评论