开发者

count number of part of string by columns

开发者 https://www.devze.com 2023-03-08 11:50 出处:网络
I have a text file like this: V1 V2V3 XNaaaaaabbbabab CTababaaabaaabb VHbabbbabaabbba What I want to do is count how much a and how much b there is in column of each V3.

I have a text file like this:

V1 V2   V3
X  N    aaaaaabbbabab
C  T    ababaaabaaabb
V  H    babbbabaabbba

What I want to do is count how much a and how much b there is in column of each V3.

So the output would be like this:

   col1  col2 col3 .......  col13
a  2     2    2             1
b  1     1    1             2

How this can be done?

I tried the c开发者_StackOverflow社区ount function along with sub-string, but it did not worked.

Thanks


Assuming dat contains your data, we process using strsplit() to

tt <- matrix(unlist(strsplit(dat$V3, split = "")), ncol = 13, byrow = TRUE)

giving:

> tt
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
[1,] "a"  "a"  "a"  "a"  "a"  "a"  "b"  "b"  "b"  "a"   "b"   "a"   "b"  
[2,] "a"  "b"  "a"  "b"  "a"  "a"  "a"  "b"  "a"  "a"   "a"   "b"   "b"  
[3,] "b"  "a"  "b"  "b"  "b"  "a"  "b"  "a"  "a"  "b"   "b"   "b"   "a"

We can get the desired results via, taking care to set the levels correctly:

apply(tt, 2, function(x) c(table(factor(x, levels = c("a","b")))))

which gives:

> apply(tt, 2, function(x) c(table(factor(x, levels = c("a","b")))))
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
a    2    2    2    1    2    3    1    1    2     2     1     1     1
b    1    1    1    2    1    0    2    2    1     1     2     2     2

To automate the selection of appropriate levels, we could do something like:

> lev <- levels(factor(tt))
> apply(tt, 2, function(x, levels) c(table(factor(x, levels = lev))), 
+       levels = lev)
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
a    2    2    2    1    2    3    1    1    2     2     1     1     1
b    1    1    1    2    1    0    2    2    1     1     2     2     2

where in the first line we treat tt as a vector, and extract the levels after temporarily converting tt to a factor. We then supply these levels (lev) to the apply() step, instead of stating the levels explicitly.


EDIT : solution corrected after comments of Gavin Simpson. This works now


To avoid many conversions to factor, you can use following trick with the indices and tapply :

tt <- c("aaaaaabbbabab","ababaaabaaabb","babbbabaabbba")

ttstr <- strsplit(tt,"")
ttf <- factor(unlist(ttstr))
n <- length(ttstr[[1]])
k <- length(ttstr)

> do.call(cbind,tapply(ttf,rep(1:n,k),table))
  1 2 3 4 5 6 7 8 9 10 11 12 13
a 2 2 2 1 2 3 1 1 2  2  1  1  1
b 1 1 1 2 1 0 2 2 1  1  2  2  2

Which gives a speedup of about 7 times to the method shown by @Gavin

> benchmark(method1(tt),method2(tt),replications=1)
         test replications elapsed relative user.self 
1 method1(tt)            1    0.89 1.000000      0.89   
2 method2(tt)            1    6.99 7.853933      6.98     


Here is a new version to awnser the actual question. Still using gregexpr, but this time using the indexes. I have to go out of my way a bit to account for zero count cells (which I can't get in table?)

foo <- data.frame(
    V1 = c("X","C","V"),
    V2 = c("N","T","H"),
    V3 = c("aaaaaabbbabab","ababaaabaaabb","babbbabaabbba"))

n <- nchar(as.character(foo$V3)[1])
tabA <- table(unlist(gregexpr("a",foo$V3)),exclude=-1)
tabB <- table(unlist(gregexpr("b",foo$V3)),exclude=-1)

res <- matrix(0,2,n)

res[1,as.numeric(names(tabA))] <- tabA
res[2,as.numeric(names(tabB))] <- tabB

rownames(res) <- c("a","b")
res
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
a    2    2    2    1    2    3    1    1    2     2     1     1     1
b    1    1    1    2    1    0    2    2    1     1     2     2     2

Without zerocount cells you could simply do rbind(tabA,tabB).

0

精彩评论

暂无评论...
验证码 换一张
取 消