I managed to write a for loop
to compare letters in the following vector:
bases <- c("G","C","A","T")
test <- sample(bases, replace=T, 20)
test
will return
[1] "T" "G" "T" "G" 开发者_如何学JAVA"C" "A" "A" "G" "A" "C" "A" "T" "T" "T" "T" "C" "A" "G" "G" "C"
with the function Comp()
I can check if a letter is matching to the next letter
Comp <- function(data)
{
output <- vector()
for(i in 1:(length(data)-1))
{
if(data[i]==data[i+1])
{
output[i] <-1
}
else
{
output[i] <-0
}
}
return(output)
}
Resulting in;
> Comp(test)
[1] 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 1 0
This is working, however its verry slow with large numbers. Therefor i tried sapply()
Comp <- function(x,i) if(x[i]==x[i+1]) 1 else 0
unlist(lapply(test, Comp, test))
Unfortunately its not working... (Error in i + 1 : non-numeric argument to binary operator
) I have trouble figuring out how to access the preceding letter in the vector to compare it. Also the length(data)-1
, to "not compare" the last letter might become a problem.
Thank you all for the help!
Cheers Lucky
Just "lag" test
and use ==
, which is vectorized.
bases <- c("G","C","A","T")
set.seed(21)
test <- sample(bases, replace=TRUE, 20)
lag.test <- c(tail(test,-1),NA)
#lag.test <- c(NA,head(test,-1))
test == lag.test
Update:
Also, your Comp
function is slow because you don't specify the length of output
when you initialize it. I suspect you were trying to pre-allocate, but vector()
creates a zero-length vector that must be expanded during every iteration of your loop. Your Comp
function is significantly faster if you change the call to vector()
to vector(length=NROW(data)-1)
.
set.seed(21)
test <- sample(bases, replace=T, 1e5)
system.time(orig <- Comp(test))
# user system elapsed
# 34.760 0.010 34.884
system.time(prealloc <- Comp.prealloc(test))
# user system elapsed
# 1.18 0.00 1.19
identical(orig, prealloc)
# [1] TRUE
As @Joshua wrote, you should of course use vectorization - it is way more efficient.
...But just for reference, your Comp
function can still be optimized a bit.
The result of a comparison is TRUE/FALSE
which is glorified versions of 1/0
. Also, ensuring the result is integer instead of numeric consumes half the memory.
Comp.opt <- function(data)
{
output <- integer(length(data)-1L)
for(i in seq_along(output))
{
output[[i]] <- (data[[i]]==data[[i+1L]])
}
return(output)
}
...and the speed difference:
> system.time(orig <- Comp(test))
user system elapsed
21.10 0.00 21.11
> system.time(prealloc <- Comp.prealloc(test))
user system elapsed
0.49 0.00 0.49
> system.time(opt <- Comp.opt(test))
user system elapsed
0.41 0.00 0.40
> all.equal(opt, orig) # opt is integer, orig is double
[1] TRUE
Have a look at this :
> x = c("T", "G", "T", "G", "G","T","T","T")
>
> res = sequence(rle(x)$lengths)-1
>
> dt = data.frame(x,res)
>
> dt
x res
1 T 0
2 G 0
3 T 0
4 G 0
5 G 1
6 T 0
7 T 1
8 T 2
Might work faster.
精彩评论