Using GEOquery and SAM/Siggenes in R_问答_开发者

开发者 https://www.devze.com 2023-04-04 10:07 出处：网络

I very recently started learning R as a result of a need and so far, so good, I开发者_如何学Python think. But I\'m still in the very early stages. I am however faced with this major urgent challenge i

相关专题：bioconductor

I very recently started learning R as a result of a need and so far, so good, I开发者_如何学Python think. But I'm still in the very early stages. I am however faced with this major urgent challenge in R that I will greatly appreciate some help with. My programming skills are quite obviously amateur and will most definitely accept any help I can get. Here goes:

To create a list of datasets (gdslist) to be retrieved from the GEO database using the GEOquery package
To convert the gdslist items (gdsid) to expression data. That is, data that can work with my analysis. For this, a GDS2eSet function does the work fine.
To read in this converted expression data in a way that a class/levels file (.cls) can be created. The GDS3715 dataset for example, has 3 levels – insulin resistant, insulin sensitive and diabetic. Sometimes, the datasets are as straightforward as that. But other times, like in this case, the levels will be 6 for analysis purposes, because although there are phenotypically 3 levels, they have been divided into treated and untreated groups. There is often an added “agent” column in such cases. Each class/level needs to be assigned a numerical number (0,1,2...). That’s pretty much the general format of a .cls file.
To run a Siggenes/SAM analysis (also a package in R), two files are needed for each dataset: an expression file (the converted file from 2 above) and the accompanying cluster file (from 3).
To be able to run this process for the gdslist items in a kind of loop and have my data stored in a specified directory.

I can currently only get to step 2. I think step 3 is the crux of the challenge...

Many thanks in anticipation.

Script so far:

> gdslist = c('GDS3715','GDS3716','GDS3717'...)#up to perhaps 100 datasets
> analysisfunc = function(gdsid) {
    gdsdat = getGEO(gdsid,destdir=".")
    gdseset = GDS2eSet(gdsdat)
    pData(gdseset)$disease.state #Needed assignment, etc...Step 3 stuff ;Siggenes/SAM can perhaps be done here
    return(sprintf("Results from %s should be here",gdsid))
  }
> resultlist = sapply(gdslist,analysisfunc) #loop function

This should work for all the gds datasets.

 GEOSAM.analysis <- function( gdsid, destdir = getwd() ) {
       require( 'GEOquery' )
       require( 'siggenes' )
       ## test if gdsid is gdsid
       if( length(grep('GDS', gdsid)) == 0 ){
        stop()
       }
       gdsdat = getGEO( gdsid, destdir = destdir )
       gdseset = GDS2eSet( gdsdat )
       gdseset.pData <- pData( gdseset )
       gds.factors <- names( gdseset.pData )
       gds.factors[gds.factors == 'sample'] <- NA
       gds.factors[gds.factors == 'description'] <- NA
       gds.factors <- gds.factors[!is.na( gds.factors )]
       cl.list <- sapply( gdseset.pData[gds.factors], as.character)
       cl.list <- factor( apply( cl.list, 1, function(x){ paste( x , collapse = '-' )} ) )
       if( length( levels ( cl.list ) ) == 2 ){
        levels( cl.list ) <- 0:length( levels( cl.list ) )
       } else {
        levels( cl.list ) <- 1:length( levels( cl.list ) )
       }
       sam.gds <- sam( gdseset, cl.list )
       results.file <- file.path( destdir, paste( gdsid, '.sam.gds.rdata', sep =''  ) )
       save( sam.gds, file = results.file )
       return( sprintf( "Results from %s are saved in '%s'. These can be loaded by 'load('%s')'.",gdsid, results.file, results.file ) )
  }

  gdslist = c('GDS3715', 'GDS3716', 'GDS3717')
  resultlist = sapply(gdslist, GEOSAM.analysis)  
  print(resultlist)