I very recently started learning R as a result of a need and so far, so good, I开发者_如何学Python think. But I'm still in the very early stages. I am however faced with this major urgent challenge in R that I will greatly appreciate some help with. My programming skills are quite obviously amateur and will most definitely accept any help I can get. Here goes:
- To create a list of datasets (gdslist) to be retrieved from the GEO database using the GEOquery package
- To convert the gdslist items (gdsid) to expression data. That is, data that can work with my analysis. For this, a GDS2eSet function does the work fine.
- To read in this converted expression data in a way that a class/levels file (.cls) can be created. The GDS3715 dataset for example, has 3 levels – insulin resistant, insulin sensitive and diabetic. Sometimes, the datasets are as straightforward as that. But other times, like in this case, the levels will be 6 for analysis purposes, because although there are phenotypically 3 levels, they have been divided into treated and untreated groups. There is often an added “agent” column in such cases. Each class/level needs to be assigned a numerical number (0,1,2...). That’s pretty much the general format of a .cls file.
- To run a Siggenes/SAM analysis (also a package in R), two files are needed for each dataset: an expression file (the converted file from 2 above) and the accompanying cluster file (from 3).
- To be able to run this process for the gdslist items in a kind of loop and have my data stored in a specified directory.
I can currently only get to step 2. I think step 3 is the crux of the challenge...
Many thanks in anticipation.
Script so far:
> gdslist = c('GDS3715','GDS3716','GDS3717'...)#up to perhaps 100 datasets
> analysisfunc = function(gdsid) {
gdsdat = getGEO(gdsid,destdir=".")
gdseset = GDS2eSet(gdsdat)
pData(gdseset)$disease.state #Needed assignment, etc...Step 3 stuff ;Siggenes/SAM can perhaps be done here
return(sprintf("Results from %s should be here",gdsid))
}
> resultlist = sapply(gdslist,analysisfunc) #loop function
This should work for all the gds datasets.
GEOSAM.analysis <- function( gdsid, destdir = getwd() ) {
require( 'GEOquery' )
require( 'siggenes' )
## test if gdsid is gdsid
if( length(grep('GDS', gdsid)) == 0 ){
stop()
}
gdsdat = getGEO( gdsid, destdir = destdir )
gdseset = GDS2eSet( gdsdat )
gdseset.pData <- pData( gdseset )
gds.factors <- names( gdseset.pData )
gds.factors[gds.factors == 'sample'] <- NA
gds.factors[gds.factors == 'description'] <- NA
gds.factors <- gds.factors[!is.na( gds.factors )]
cl.list <- sapply( gdseset.pData[gds.factors], as.character)
cl.list <- factor( apply( cl.list, 1, function(x){ paste( x , collapse = '-' )} ) )
if( length( levels ( cl.list ) ) == 2 ){
levels( cl.list ) <- 0:length( levels( cl.list ) )
} else {
levels( cl.list ) <- 1:length( levels( cl.list ) )
}
sam.gds <- sam( gdseset, cl.list )
results.file <- file.path( destdir, paste( gdsid, '.sam.gds.rdata', sep ='' ) )
save( sam.gds, file = results.file )
return( sprintf( "Results from %s are saved in '%s'. These can be loaded by 'load('%s')'.",gdsid, results.file, results.file ) )
}
gdslist = c('GDS3715', 'GDS3716', 'GDS3717')
resultlist = sapply(gdslist, GEOSAM.analysis)
print(resultlist)
精彩评论