I have a dataframe with 2 columns, one Identifier and column with names. Each Identifier is several times present in the column ID (see below).
ID Names
uc001aag.1 DKFZp686C24272
uc001aag.1 DQ786314
uc001aag.1 uc001aag.1
uc001aah.2 AK056232
uc001aah.2 FLJ00038
uc001aah.2 uc001aah.1
uc001aah.2 uc001aah.2
uc001aai.1 AY217347
Now I want to create a dataframe like this:
ID Names
uc001aag.1 DKFZp686C24272 | DQ786314 | uc001aag.1
uc001aah.2 AK056232 | FLJ00038 | uc001aah.1 | u开发者_如何转开发c001aah.2
uc001aai.1 AY217347
Can anyone help me?
Aggregate is quite a fast one, but you can use an sapply solution to parallelize the code. This can easily be done on Windows using snowfall
:
require(snowfall)
sfInit(parallel=TRUE,cpus=2)
sfExport("Data")
ID <- unique(Data$ID)
CombNames <- sfSapply(ID,function(i){
paste(Data$Names[Data$ID==i],collapse=" | ")
})
data.frame(ID,CombNames)
sfStop()
The parallel version will give you an extra speedup, but the single sapply solution is actually slower than aggregate. Tapply is a bit faster, but can't be parallelized using snowfall. on my computer :
n <- 3000
m <- 3
Data <- data.frame( ID = rep(1:n,m),
Names=rep(LETTERS[1:m],each=n))
# using snowfall for parallel sapply
system.time({
ID <- unique(Data$ID)
CombNames <- sfSapply(ID,function(i){
paste(Data$Names[Data$ID==i],collapse=" | ")
})
data.frame(ID,CombNames)
})
user system elapsed
0.02 0.00 0.33
# using tapply
system.time({
CombNames <- tapply(Data$Names,Data$ID,paste,collapse=" | ")
data.frame(ID=names(CombNames),CombNames)
})
user system elapsed
0.44 0.00 0.44
# using aggregate
system.time(
aggregate(Names ~ ID, data=Data, FUN=paste, collapse=" | ")
)
user system elapsed
0.47 0.00 0.47
# using the normal sapply
system.time({
ID <- unique(Data$ID)
CombNames <- sapply(ID,function(i){
paste(Data$Names[Data$ID==i],collapse=" | ")
})
data.frame(ID,CombNames)
})
user system elapsed
0.75 0.00 0.75
Note:
For the record, the better sapply-solution would be :
CombNames <- sapply(split(Data$Names,Data$ID),paste,collapse=" | ")
data.frame(ID=names(CombNames),CombNames)
which is equivalent to tapply. But parallelizing this one is actually slower, as you have to move more data around within the sfSapply
. The speed comes from copying the dataset to every cpu. This is what you have to keep in mind when your dataset is huge : you'll pay the speed with more memory usage.
You can use aggregate
:
R> aggregate(Names ~ ID, data=tmp, FUN=paste, collapse=" | ")
ID Names
1 uc001aag.1 DKFZp686C24272 | DQ786314 | uc001aag.1
2 uc001aah.2 AK056232 | FLJ00038 | uc001aah.1 | uc001aah.2
3 uc001aai.1 AY217347
精彩评论