Often some data.frame
contains 20+ variables and you want to get a first overview (of the correlation structure). Even on a 30" screen you run quickly out of space and it remains hard to grasp the message. Are there any established strategies to highlight what's importa开发者_JAVA技巧nt? I am aware this question is somewhat general, but I wondered over and over again and never had the panacea to cure it. And yup, I heard of summary
.
Well I just have to post about my own package here:)
You can use qgraph
to visualize a correlation matrix as a network. This will plot variables as nodes and correlations as edges connecting the nodes. Green edges indicate positive correlations and red edges indicate negative correlations. The wider and more saturated the edges the stronger the absolute correlation.
For example (this is the first example from the help page), the following code will plot the correlation matrix of a 240 variable dataset.
library("qgraph")
data(big5)
data(big5groups)
qgraph(cor(big5),minimum=0.25,cut=0.4,vsize=2,groups=big5groups,legend=TRUE,borders=FALSE)
title("Big 5 correlations",line=-2,cex.main=2)
You can also cluster strongly correlated nodes together (uses Fruchterman-Reingold) which creates quite a clear image of what the structure of your correlation matrix actually looks like:
And alot more. For some more examples take a look at my site:
http://sachaepskamp.com/?page_id=73
I have used heatmap()
(or, rather, the underlying image()
function) for that purpose. I do not have the code handy anymore, and as I recall I had to fiddle with the colormap to get something that made sense for the [-1, 1] range.
Here is a simple example:
R> set.seed(42)
R> X <- matrix(rnorm(100*20), nrow=100)
R> XC <- cor(X)
R> image(XC) # color range could do with improvements here
You play further tricks by blanking one lower or upper triangle and putting text there. The PerformanceAnalytics package has a function chart.Correlation()
that does that (from the raw data matrix) but it is much slower and will not scale to large matrices as per your original question. I am sure there are others...
With pairs
you can generate some scatterplot matrices quickly. If too many variables are present you could use on of the tools of Rattle:
In fact rattle itself does not do most of the analysis (as dwin pointed out correctly), but it offers (imho) easy tools to quickly run a pca, correlation tree, correlation matrix like above without having to manipulate your dataset to make sure that eg only numeric variables are present in the dataset, ...
What about doing a PCA on the correlation matrix? Then the angle between variables show their correlation.
library(HSAUR)
heptathlon
round(cor(heptathlon[,-8]),2) # correlations [without score]
require(vegan)
PCA <- rda(heptathlon[,-8], scale=TRUE) # run a PCA
biplot(PCA, display = "species") # correlation biplot
# The angles between descriptors in the biplot reflect their correlations
Often, the column structure of a matrix can be presented in a random order. In that case, I'd look to do some reordering. For visualizing and working with sparse matrices, I often do some sort of reordering, such as Reverse Cuthill-McKee or some other form of bandwidth reordering, and this could be applied to other contexts to make visualizations easier.
For a correlation matrix, you can squash low magnitude correlations (e.g. within (-eps,+eps)) to create sparsity, then reorder to examine the structure.
What is nice is if you can find blocks of related objects. This reordering plus the heatmaps (using one color gradient for negative correlation, another for positive correlation) can be very helpful.
精彩评论