I'm using R, and I'm a beginner. I have two large lists (30K elements each). One is called descriptions
and where each element is (maybe) a tokenized string. The other is called probes
where each element is a number. I need to make a dictionary that mapsprobes
to something in descriptions
, if that something is there. Here's how I'm going about this:
probe2gene <- list()
for (i in 1:length(probes)){
strings<-strsplit(descriptions[i]), '//')
if (length(strings[[1]]) > 1){
probe2gene[probes[i]] = strings[[1]][2]
}
}
Which works fine, but seems slow, much slower than the roughly equivalent python:
probe2gene = {}
for p,d in zip(probes, descriptions):
try:
probe2gene[p] = descriptions.split('//')[1]
except IndexError:
pass
My question: is there an "R-thonic" way of doing what I'm trying to do? The R manual entry on for loops suggests that such loops开发者_高级运维 are rare. Is there a better solution?
Edit: a typical good "description" looks like this:
"NM_009826 // Rb1cc1 // RB1-inducible coiled-coil 1 // 1 A2 // 12421 /// AB070619 // Rb1cc1 // RB1-inducible coiled-coil 1 // 1 A2 // 12421 /// ENSMUST00000027040 // Rb1cc1 // RB1-inducible coiled-coil 1 // 1 A2 // 12421"
a bad "description: looks like this
"-----"
though it can quite easily be some other not-very-helpful string. Each probe is simply a number. The probe
and description
vectors are the same length, and completely correspond to each other, i.e. probe[i]
maps to description[i]
.
It's usually better in R if you use the various apply-like functions, rather than a loop. I think this solves your problem; the only drawback is that you have to use string keys.
> descriptions <- c("foo//bar", "")
> probes <- c(10, 20)
> probe2gene <- lapply(strsplit(descriptions, "//"), function (x) x[2])
> names(probe2gene) <- probes
> probe2gene <- probe2gene[!is.na(probe2gene)]
> probe2gene[["10"]]
[1] "bar"
Unfortunately, R doesn't have a good dictionary/map type. The closest I've found is using lists as a map from string-to-value. That seems to be idiomatic, but it's ugly.
If I understand correctly you are looking to save each probe-description combination where the there is more than one (split) value in description?
Probe and Description are the same length?
This is kind of messy but a quick first pass at it?
a <- list("a","b","c")
b <- list(c("a","b"),c("DEF","ABC"),c("Z"))
names(b) <- a
matches <- which(lapply(b, length)>1) #several ways to do this
b <- lapply(b[matches], function(x) x[2]) #keeps the second element only
That's my first attempt. If you have a sample dataset that would be very useful.
Best regards,
Jay
Another way.
probe<-c(4,3,1)
gene<-c('red//hair','strange','blue//blood')
probe2gene<-character()
probe2gene[probe]<-sapply(strsplit(gene,'//'),'[',2)
probe2gene
[1] "blood" NA NA "hair"
In the sapply, we take advantage of the fact that in R the subsetting operator is also a function named '[' to which we can pass the index as an argument. Also, an out-of-range index does not cause an error but gives a NA value. On the left hand of the same line, we use the fact that we can pass a vector of indices in any order and with gaps.
Here's another approach that should be fast. Note that this doesn't remove the empty descriptions. It could be adapted to do that or you could clean those in a post processing step using lapply. Is it the case that you'll never have a valid description of length one?
make_desc <- function(n)
{
word <- function(x) paste(sample(letters, 5, replace=TRUE), collapse = "")
if (runif(1) < 0.70)
paste(sapply(seq_len(n), word), collapse = "//")
else
"----"
}
description <- sapply(seq_len(10), make_desc)
probes <- seq_len(length(description))
desc_parts <- strsplit(description, "//", fixed=TRUE, useBytes=TRUE)
lens <- sapply(desc_parts, length)
probes_expand <- rep(probes, lens)
ans <- split(unlist(desc_parts), probes_expand)
> description
[1] "fmbec"
[2] "----"
[3] "----"
[4] "frrii//yjxsa//wvkce//xbpkc"
[5] "kazzp//ifrlz//ztnkh//dtwow//aqvcm"
[6] "stupm//ncqhx//zaakn//kjymf//swvsr//zsexu"
[7] "wajit//sajgr//cttzf//uagwy//qtuyh//iyiue//xelrq"
[8] "nirex//awvnw//bvexw//mmzdp//lvetr//xvahy//qhgym//ggdax"
[9] "----"
[10] "ubabx//tvqrd//vcxsp//rjshu//gbmvj//fbkea//smrgm//qfmpy//tpudu//qpjbu"
> ans[[3]]
[1] "----"
> ans[[4]]
[1] "frrii" "yjxsa" "wvkce" "xbpkc"
精彩评论