I'm examining some biological data which is basically a long list (a few million values) of integers, each saying how well this position in the genome is covered. Here is a graphical example for a data set:
I would like to look for "valleys" in this data, that is, regions which are significantly lower than their开发者_开发知识库 surrounding environment.
Note that the size of the valleys I'm looking for is not really known - it may range from 50 bases to a few thousands. Defining what is a valley is of course one of the questions I'm struggling with, but the previous examples are relatively easy for me:
What kind of paradigms would you recommend using to find those valleys? I mainly program using Perl and R.
Thanks!
We do peak detection (and valley detection) using running medians and median absolute deviation. You can specify how much deviation from the running median means a peak.
In a next step, we use a binomial model to check which regions contain more "extreme" values than can be expected. This model (basically a score test) results in "peak regions" instead of single peaks. Turning it around to get "valley regions" is trivial.
The running median is calculated using the function weightedMedian from the package aroma.light. We use the embed() function to make a list of "windows" and apply a kernel function on it.
The application of the weighted median :
center <- apply(embed(tmp,wdw),1,weightedMedian,w=weights,na.rm=T)
Here tmp is the temporary data vector and wdw the window size (has to be uneven). tmp is constructed by adding (wdw-1)/2 NA values at every side of the data vector. the weights are constructed using a customized function. For the mad we use the same procedure, but then on diff(data) instead of the data itself.
Running Sample code :
require(aroma.light)
# make.weights : function to make weights on basis of a normal distribution
# n is window size !!!!!!
make.weights <- function(n,
type=c("gaussian","epanechnikov","biweight","triweight","cosinus")){
type <- match.arg(type)
x <- seq(-1,1,length.out=n)
out <-switch(type,
gaussian=(1/sqrt(2*pi)*exp(-0.5*(3*x)^2)),
epanechnikov=0.75*(1-x^2),
biweight=15/16*(1-x^2)^2,
triweight=35/32*(1-x^2)^3,
cosinus=pi/4*cos(x*pi/2),
)
out <- out/sum(out)*n
return(out)
}
# score.test : function to become a p-value based on the score test
# uses normal approximation, but is still quite correct when p0 is
# pretty small.
# This test is one-sided, and tests whether the observed proportion
# is bigger than the hypothesized proportion
score.test <- function(x,p0,w){
n <- length(x)
if(missing(w)) w<-rep(1,n)
w <- w[!is.na(x)]
x <- x[!is.na(x)]
if(sum(w)!=n) w <- w/sum(w)*n
phat <- sum(x*w)/n
z <- (phat-p0)/sqrt(p0*(1-p0)/n)
p <- 1-pnorm(z)
return(p)
}
# embed.na is a modification of embed, adding NA strings
# to the beginning and end of x. window size= 2n+1
embed.na <- function(x,n){
extra <- rep(NA,n)
x <- c(extra,x,extra)
out <- embed(x,2*n+1)
return(out)
}
# running.score : function to calculate the weighted p-value for the chance of being in
# a run of peaks. This chance is based on the weighted proportion of the neighbourhood
# the null hypothesis is calculated by taking the weighted proportion
# of detected peaks in the whole dataset.
# This lessens the need for adjusting parameters and makes the
# method more automatic.
# for a correct calculation, the weights have to sum up to n
running.score <- function(sel,n=20,w,p0){
if(missing(w)) w<- rep(1,2*n+1)
if(missing(p0))p0 <- sum(sel,na.rm=T)/length(sel[!is.na(sel)]) # null hypothesis
out <- apply(embed.na(sel,n),1,score.test,p0=p0,w=w)
return(out)
}
# running.med : function to calculate the running median and mad
# for a dataset. Window size = 2n+1
running.med <- function(x,w,n,cte=1.4826){
wdw <- 2*n+1
if(missing(w)) w <- rep(1,wdw)
center <- apply(embed.na(x,n),1,weightedMedian,w=w,na.rm=T)
mad <- median(abs(x-center))*cte
return(list(med=center,mad=mad))
}
##############################################
#
# Create series
set.seed(100)
n = 1000
series <- diffinv(rnorm(20000),lag=1)
peaks <- apply(embed.na(series,n),1,function(x) x[n+1] < quantile(x,probs=0.05,na.rm=T))
pweight <- make.weights(0.2*n+1)
p.val <- running.score(peaks,n=n/10,w=pweight)
plot(series,type="l")
points((1:length(series))[p.val<0.05],series[p.val<0.05],col="red")
points((1:length(series))[peaks],series[peaks],col="blue")
The sample code above is developed to find regions with big fluctuations rather than valleys. I adapted it a bit, but it's not optimal. On top of that, for series larger than 20000 values you need a whole lot of memory, I can't run it on my computer any more.
Alternatively, you could work with an approximation of the numerical derivative and second derivative to define valleys. In your case, this might even work better. A pragmatic way of calculating the derivatives and the minima/maxima of the first derivative :
#first derivative
f.deriv <- diff(lowess(series,f=n/length(series),delta=1)$y)
#second derivative
f.sec.deriv <- diff(f.deriv)
#minima and maxima defined by where f.sec.deriv changes sign :
minmax <- cumsum(rle(sign(f.sec.deriv))$lengths)
op <- par(mfrow=c(2,1))
plot(series,type="l")
plot(f.deriv,type="l")
points((1:length(f.deriv))[minmax],f.deriv[minmax],col="red")
par(op)
You can define a valley by different criterion :
- depth
- width
- volume (depth*width)
You might also have valley in a big mountain, do you want these too ?
For example there is a valley here : 1 2 3 4 1000 1000 800 800 800 1000 1000 500 200 3
Try to explain with more details how YOU (or any expert in your field) would choose the valleys given the data
You might want to look at watershed
You might want to try the peak detection function to identify the regions of interest. The desired minimum width of the valleys can be specified with the span
parameter.
It might be a good idea to smooth the data first, to get rid of the noise peaks like the one in the right "valley" of the blue graph. A simple stats::filter
should be enough.
The final step would be to check the depth of the found "valleys". This really depends on your requirements. As a first approximation, you can simply compare the peak value with the median level of the data.
精彩评论