Reading the NCBI's GEO microarray SOFT files in R/BioConductor

This page discusses how to load GEO SOFT format microarray data from the Gene Expression Omnibus database (GEO) (hosted by the NCBI) into R/BioConductor. SOFT stands for Simple Omnibus Format in Text. There are actually four types of GEO SOFT file available:

GEO Platform (GPL)
These files describe a particular type of microarray. They are annotation files.

GEO Sample (GSM)
Files that contain all the data from the use of a single chip. For each gene there will be multiple scores including the main one, held in the VALUE column.

GEO Series (GSE)
Lists of GSM files that together form a single experiment.

GEO Dataset (GDS)
These are curated files that hold a summarised combination of a GSE file and its GSM files. They contain normalised expression levels for each gene from each sample (i.e. just the VALUE field from the GSM file).

As long as you just need the expression level then a GDS file will suffice. If you need to dig deeper into how the expression levels were calculated, you'll need to get all the GSM files instead (which are listed in the GDS or GSE file).

To me, it was natural to ask: How can I turn a GEO DataSet (GDS file) into an R/BioConductor expression set object (exprSet)? (answer) And while we're at it, how to load the GEO Platform annotation (GPL file) too? (answer)

In the MOAC Module 5 assignment, the approach taken was to sanitize the data by hand, allowing it to be loaded into R with a simple call to the read.table command. Its a good idea to look at the raw files to understand what you are dealing with, but surely there is a more elegant way...

It turns out there are several existing GEO parsers, but one stands out above all others: Sean Davis' GEOquery (released roughly December 2005).

Installing GEOquery

Assuming you are running a recent version of BioConductor (1.8 or later) you should be able to install it from within R as follows:

> source("http://www.bioconductor.org/biocLite.R")
> biocLite("GEOquery")

Running bioCLite version 0.1.6  with R version  2.3.1 
...

For those of you on an older version of BioConductor, you will have to download and install it by hand from here.

If you are using Windows, download GEOquery_1.6.0.zip (or similar) and save it. Then from within the R program, use the menu option "Packages", "Install package(s) from local zip files..." and select the ZIP file.

On Linux, download GEOquery_1.6.0.tar.gz (or similar) and use sudo R CMD INSTALL GEOquery_1.6.0.tar.gz at the command prompt.

Loading a GDS file with GEOquery

Here is a quick introduction to how to load a GDS file, and turn it into an expression set object:

library(Biobase)
library(GEOquery)

#Download GDS file, put it in the current directory, and load it:
gds858 <- getGEO('GDS858', destdir=".")

#Or, open an existing GDS file (even if its compressed):
gds858 <- getGEO(filename='GDS858.soft.gz')

I'm using GDS858 as input. The SOFT file is available in compressed form here GDS858.soft.gz, but GEOquery takes care of finding this file for you and unzipping it automatically.

Loading this file from the hard disk takes about two minutes on my laptop.

There are two main things the GDS object gives us, meta data (from the file header) and a table of expression data. These are extracted using the Meta and Table functions. First lets have a look at the metadata:

> Meta(gds858)$channel_count
[1] "1"
> Meta(gds858)$description
[1] "Comparison of lung epithelial Calu-3 cells infected ..."
>  Meta(gds858)$feature_count
[1] "22283"
>  Meta(gds858)$platform
[1] "GPL96"
> Meta(gds858)$sample_count
[1] "19"
> Meta(gds858)$sample_organism
[1] "Homo sapiens"
> Meta(gds858)$sample_type
[1] "cDNA"
> Meta(gds858)$title
[1] "Mucoid and motile Pseudomonas aeruginosa infected lung epithelial cell comparison"
> Meta(gds858)$type
[1] "gene expression array-based"

Useful stuff, and now the expression data table:

> colnames(Table(gds858))
 [1] "ID_REF"     "IDENTIFIER" "GSM14498"   "GSM14499"   "GSM14500"  
 [6] "GSM14501"   "GSM14513"   "GSM14514"   "GSM14515"   "GSM14516"  
[11] "GSM14506"   "GSM14507"   "GSM14508"   "GSM14502"   "GSM14503"  
[16] "GSM14504"   "GSM14505"   "GSM14509"   "GSM14510"   "GSM14511"  
[21] "GSM14512"

> Table(gds858)[1:10,1:6]
      ID_REF IDENTIFIER GSM14498 GSM14499 GSM14500 GSM14501
1  1007_s_at     U48705   3736.9   3811.0   3699.6   3897.6
2    1053_at     M87338    343.0    500.3    288.3    341.3
3     117_at     X51757    120.9     34.3    145.8    110.5
4     121_at     X69699   1523.8   1281.1   1281.9   1493.4
5  1255_g_at     L36861     51.6     15.9     45.9      8.1
6    1294_at     L13852    253.2    164.8    200.0    205.2
7    1316_at     X55005    199.6    250.7    290.3    218.6
8    1320_at     X79510     81.7     13.4     13.9     88.7
9  1405_i_at     M21121     18.9      5.6     11.0      9.5
10   1431_at     J02843     99.7     74.5     72.6    114.8

Now, lets turn this GDS object into an expression set object (using base 2 logarithms) and have a look at it:

> eset <- GDS2eSet(gds858, do.log2=TRUE)
> eset
Expression Set (exprSet) with 
        22283 genes
        19 samples
                 phenoData object with 4 variables and 19 cases
         varLabels
                : sample
                : infection
                : genotype/variation
                : description

> geneNames(eset)[1:10]
 [1] "1007_s_at" "1053_at"   "117_at"    "121_at"    "1255_g_at"
 [6] "1294_at"   "1316_at"   "1320_at"   "1405_i_at" "1431_at"

> sampleNames(eset)
 [1] "GSM14498" "GSM14499" "GSM14500" "GSM14501" "GSM14513"
 [6] "GSM14514" "GSM14515" "GSM14516" "GSM14506" "GSM14507"
[11] "GSM14508" "GSM14502" "GSM14503" "GSM14504" "GSM14505"
[16] "GSM14509" "GSM14510" "GSM14511" "GSM14512"

GEOquery does an excellent job of extracting the phenotype data, as you can see:

> pData(eset)$infection
 [1] FRD1       FRD1       FRD1       FRD1       FRD440    
 [6] FRD440     FRD440     FRD440     FRD875     FRD875    
[11] FRD875     FRD875     FRD1234    FRD1234    FRD1234   
[16] uninfected uninfected uninfected uninfected
Levels: FRD1 FRD1234 FRD440 FRD875 uninfected

> pData(eset)$"genotype/variation"
 [1] control                control               
 [3] control                control               
 [5] mucoid                 mucoid                
 [7] mucoid                 mucoid                
 [9] motile                 motile                
[11] motile                 motile                
[13] non-mucoid, non-motile non-mucoid, non-motile
[15] non-mucoid, non-motile non-mucoid, non-motile
[17] non-mucoid, non-motile non-mucoid, non-motile
[19] non-mucoid, non-motile
Levels: control motile mucoid non-mucoid, non-motile

As with any expression set object, its easy to pull out a subset of the data:

> eset["1320_at","GSM14504"]
Expression Set (exprSet) with 
        1 genes
        1 samples
                 phenoData object with 4 variables and 1 cases
         varLabels
                : sample
                : infection
                : genotype/variation
                : description

> exprs(eset["1320_at","GSM14504"])
        GSM14504
1320_at  6.70044

You should be able to produce a heatmap of differentially expressed genes easily enough using this page, especially as the phenotype/sub-sample information has been sorted out for you.

Loading a GPL (Annotation) file with GEOquery

In addition to loading a GDS file to get the expression levels, you can also load the associated platform annotation file. You can find this out from the GDS858 meta information:

>  Meta(gds858)$platform
[1] "GPL96"

So, for GDS858, the platform is GPL96, Affymetrix GeneChip Human Genome U133 Array Set HG-U133A.

Now let's load up the GPL file and have a look at it (its a big file, about 12 MB, so this takes a while!):

library(Biobase)
library(GEOquery)

#Download GPL file, put it in the current directory, and load it:
gpl96 <- getGEO('GPL96', destdir=".")

#Or, open an existing GPL file:
gpl96 <- getGEO(filename='GPL96.soft')

As with the GDS object, we can use the Meta and Table functions to extract information:

> Meta(gpl96)$title
[1] "Affymetrix GeneChip Human Genome U133 Array Set HG-U133A"

> colnames(Table(gpl96))
 [1] "ID"                               "Species.Scientific.Name"         
 [3] "Annotation.Date"                  "GB_LIST"                         
 [5] "SPOT_ID"                          "Sequence.Source"                 
 [7] "Representative.Public.ID"         "Gene.Title"                      
 [9] "Gene.Symbol"                      "Entrez.Gene"                     
[11] "RefSeq.Transcript.ID"             "Gene.Ontology.Biological.Process"
[13] "Gene.Ontology.Cellular.Component" "Gene.Ontology.Molecular.Function"

Lets look at the first four columns, for the first ten genes:

> Table(gpl96)[1:10,1:4]
          ID Species.Scientific.Name Annotation.Date GB_LIST
1  1007_s_at            Homo sapiens       16-Sep-05  U48705
2    1053_at            Homo sapiens       16-Sep-05  M87338
3     117_at            Homo sapiens       16-Sep-05  X51757
4     121_at            Homo sapiens       16-Sep-05  X69699
5  1255_g_at            Homo sapiens       16-Sep-05  L36861
6    1294_at            Homo sapiens       16-Sep-05  L13852
7    1316_at            Homo sapiens       16-Sep-05  X55005
8    1320_at            Homo sapiens       16-Sep-05  X79510
9  1405_i_at            Homo sapiens       16-Sep-05  M21121
10   1431_at            Homo sapiens       16-Sep-05  J02843

This shows a hand picked selection of the columns, again for the first ten genes:

> Table(gpl96)[1:10,c("ID","GB_LIST","Gene.Title","Gene.Symbol","Entrez.Gene")]
          ID GB_LIST                                            Gene.Title Gene.Symbol Entrez.Gene
1  1007_s_at  U48705            discoidin domain receptor family, member 1        DDR1         780
2    1053_at  M87338           replication factor C (activator 1) 2, 40kDa        RFC2        5982
3     117_at  X51757                  heat shock 70kDa protein 6 (HSP70B')       HSPA6        3310
4     121_at  X69699                                     paired box gene 8        PAX8        7849
5  1255_g_at  L36861               guanylate cyclase activator 1A (retina)      GUCA1A        2978
6    1294_at  L13852                   ubiquitin-activating enzyme E1-like       UBE1L        7318
7    1316_at  X55005   thyroid hormone receptor, alpha (erythroblastic...)        THRA        7067
8    1320_at  X79510    protein tyrosine phosphatase, non-receptor type 21      PTPN21       11099
9  1405_i_at  M21121                        chemokine (C-C motif) ligand 5        CCL5        6352
10   1431_at  J02843 cytochrome P450, family 2, subfamily E, polypeptide 1      CYP2E1        1571

The above all used the 12MB file GPL96.soft, but you can also get a much smaller 3MB file GPL96.annot (compressed as GPL96.annot.gz) which has slightly different information in it... see here.

Using the BioConductor hgu133a package

Instead of loading the GEO annotation file for GPL96/HG-U133A, we could use an existing annotation package from the BioConductor annotation sets, hgu133a. These libraries exist for most of the popular microarray gene chips.

First of all, we need to install the package:

> source("http://www.bioconductor.org/biocLite.R")
> biocLite("hgu133a")

Running bioCLite version 0.1  with R version  2.1.1 
...

Then we can load the newly installed library:

> library(hgu133a)

There is any easy way to check when this was lasted updated, and what it can translate the Affy probe names into:

> hgu133a()

Quality control information for  hgu133a 
Date built: Created: Tue May 17 13:02:12 2005  
 
Number of probes: 22277 
Probe number missmatch: None 
Probe missmatch: None 
Mappings found for probe based rda files: 
         hgu133aACCNUM found 22277 of 22277
         hgu133aCHRLOC found 20195 of 22277
         hgu133aCHR found 21283 of 22277
         hgu133aENZYME found 2507 of 22277
         hgu133aGENENAME found 18726 of 22277
         hgu133aGO found 18647 of 22277
         hgu133aLOCUSID found 21747 of 22277
         hgu133aMAP found 21183 of 22277
         hgu133aOMIM found 15109 of 22277
         hgu133aPATH found 5067 of 22277
         hgu133aPMID found 21004 of 22277
         hgu133aREFSEQ found 21002 of 22277
         hgu133aSUMFUNC found 0 of 22277
         hgu133aSYMBOL found 21303 of 22277
         hgu133aUNIGENE found 21128 of 22277 
Mappings found for non-probe based rda files:
         hgu133aCHRLENGTHS found 25
         hgu133aENZYME2PROBE found 663
         hgu133aGO2ALLPROBES found 5912
         hgu133aGO2PROBE found 4326
         hgu133aORGANISM found 1
         hgu133aPATH2PROBE found 142
         hgu133aPMID2PROBE found 96291

And now lets test some of those mappings on the fourth gene 121_at in the GPL file:

> Table(gpl96)[4,c("ID","GB_LIST","Gene.Title","Gene.Symbol","Entrez.Gene")]
      ID   GB_LIST          Gene.Title   Gene.Symbol   Entrez.Gene
4 121_at    X69699   paired box gene 8          PAX8          7849

Now, what does the annotation file have to say?

> mget("121_at",hgu133aACCNUM)
$"121_at"
[1] "X69699"

> mget("121_at",hgu133aGENENAME)
$"121_at"
[1] "paired box gene 8"

> mget("121_at",hgu133aSYMBOL)
$"121_at"
[1] "PAX8"

> mget("121_at",hgu133aUNIGENE)
$"121_at"
[1] "Hs.469728"

You will notice that there is some overlap between the information in the GEO annotation table, and the hgu133a package (which compiles its data from a range of sources). See help(hgu133a) .

You should also read this introduction, Bioconductor: Annotation Package Overview .

The R Project

BioConductor
for GEOquery

GEO

NCBI