I have a text file like this:
V1 V2 V3
X N aaaaaabbbabab
C T ababaaabaaabb
V H babbbabaabbba
What I want to do is count how much a and how much b there is in column of each V3.
So the output would be like this:
col1 col2 col3 ....... col13
a 2 2 2 1
b 1 1 1 2
How this can be done?
I tried the c开发者_StackOverflow社区ount function along with sub-string, but it did not worked.
Thanks
Assuming dat
contains your data, we process using strsplit()
to
tt <- matrix(unlist(strsplit(dat$V3, split = "")), ncol = 13, byrow = TRUE)
giving:
> tt
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
[1,] "a" "a" "a" "a" "a" "a" "b" "b" "b" "a" "b" "a" "b"
[2,] "a" "b" "a" "b" "a" "a" "a" "b" "a" "a" "a" "b" "b"
[3,] "b" "a" "b" "b" "b" "a" "b" "a" "a" "b" "b" "b" "a"
We can get the desired results via, taking care to set the levels correctly:
apply(tt, 2, function(x) c(table(factor(x, levels = c("a","b")))))
which gives:
> apply(tt, 2, function(x) c(table(factor(x, levels = c("a","b")))))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
a 2 2 2 1 2 3 1 1 2 2 1 1 1
b 1 1 1 2 1 0 2 2 1 1 2 2 2
To automate the selection of appropriate levels, we could do something like:
> lev <- levels(factor(tt))
> apply(tt, 2, function(x, levels) c(table(factor(x, levels = lev))),
+ levels = lev)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
a 2 2 2 1 2 3 1 1 2 2 1 1 1
b 1 1 1 2 1 0 2 2 1 1 2 2 2
where in the first line we treat tt
as a vector, and extract the levels after temporarily converting tt
to a factor. We then supply these levels (lev
) to the apply()
step, instead of stating the levels explicitly.
EDIT : solution corrected after comments of Gavin Simpson. This works now
To avoid many conversions to factor, you can use following trick with the indices and tapply :
tt <- c("aaaaaabbbabab","ababaaabaaabb","babbbabaabbba")
ttstr <- strsplit(tt,"")
ttf <- factor(unlist(ttstr))
n <- length(ttstr[[1]])
k <- length(ttstr)
> do.call(cbind,tapply(ttf,rep(1:n,k),table))
1 2 3 4 5 6 7 8 9 10 11 12 13
a 2 2 2 1 2 3 1 1 2 2 1 1 1
b 1 1 1 2 1 0 2 2 1 1 2 2 2
Which gives a speedup of about 7 times to the method shown by @Gavin
> benchmark(method1(tt),method2(tt),replications=1)
test replications elapsed relative user.self
1 method1(tt) 1 0.89 1.000000 0.89
2 method2(tt) 1 6.99 7.853933 6.98
Here is a new version to awnser the actual question. Still using gregexpr
, but this time using the indexes. I have to go out of my way a bit to account for zero count cells (which I can't get in table?)
foo <- data.frame(
V1 = c("X","C","V"),
V2 = c("N","T","H"),
V3 = c("aaaaaabbbabab","ababaaabaaabb","babbbabaabbba"))
n <- nchar(as.character(foo$V3)[1])
tabA <- table(unlist(gregexpr("a",foo$V3)),exclude=-1)
tabB <- table(unlist(gregexpr("b",foo$V3)),exclude=-1)
res <- matrix(0,2,n)
res[1,as.numeric(names(tabA))] <- tabA
res[2,as.numeric(names(tabB))] <- tabB
rownames(res) <- c("a","b")
res
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
a 2 2 2 1 2 3 1 1 2 2 1 1 1
b 1 1 1 2 1 0 2 2 1 1 2 2 2
Without zerocount cells you could simply do rbind(tabA,tabB)
.
精彩评论