Reading the NCBI's GEO microarray SOFT files in R/BioConductor
This page discusses how to load GEO SOFT format microarray data from the Gene Expression Omnibus database (GEO) (hosted by the NCBI) into R/BioConductor. SOFT stands for Simple Omnibus Format in Text. There are actually four types of GEO SOFT file available:
GEO Platform (GPL)
These files describe a particular type of microarray. They are annotation files.
GEO Sample (GSM)
Files that contain all the data from the use of a single chip. For each gene there will be multiple scores including the main one, held in the VALUE column.
GEO Series (GSE)
Lists of GSM files that together form a single experiment.
GEO Dataset (GDS)
These are curated files that hold a summarised combination of a GSE file and its GSM files. They contain normalised expression levels for each gene from each sample (i.e. just the VALUE field from the GSM file).
As long as you just need the expression level then a GDS file will suffice. If you need to dig deeper into how the expression levels were calculated, you'll need to get all the GSM files instead (which are listed in the GDS or GSE file).
To me, it was natural to ask: How can I turn a GEO DataSet (GDS file) into an R/BioConductor expression set object (exprSet)? (answer) And while we're at it, how to load the GEO Platform annotation (GPL file) too? (answer)
In the MOAC Module 5 assignment, the approach taken was to sanitize the data by hand, allowing it to be loaded into R with a simple call to the read.table command. Its a good idea to look at the raw files to understand what you are dealing with, but surely there is a more elegant way...
It turns out there are several existing GEO parsers, but one stands out above all others: Sean Davis' GEOquery (released roughly December 2005).
Installing GEOquery
Assuming you are running a recent version of BioConductor (1.8 or later) you should be able to install it from within R as follows:
> source("http://www.bioconductor.org/biocLite.R") > biocLite("GEOquery") Running bioCLite version 0.1.6 with R version 2.3.1 ...
For those of you on an older version of BioConductor, you will have to download and install it by hand from here.
If you are using Windows, download GEOquery_1.6.0.zip (or similar) and save it. Then from within the R program, use the menu option "Packages", "Install package(s) from local zip files..." and select the ZIP file.
On Linux, download GEOquery_1.6.0.tar.gz (or similar) and use sudo R CMD INSTALL GEOquery_1.6.0.tar.gz at the command prompt.
Loading a GDS file with GEOquery
Here is a quick introduction to how to load a GDS file, and turn it into an expression set object:
library(Biobase) library(GEOquery) #Download GDS file, put it in the current directory, and load it: gds858 <- getGEO('GDS858', destdir=".") #Or, open an existing GDS file (even if its compressed): gds858 <- getGEO(filename='GDS858.soft.gz')
I'm using GDS858 as input. The SOFT file is available in compressed form here GDS858.soft.gz, but GEOquery takes care of finding this file for you and unzipping it automatically.
Loading this file from the hard disk takes about two minutes on my laptop.
There are two main things the GDS object gives us, meta data (from the file header) and a table of expression data. These are extracted using the Meta and Table functions. First lets have a look at the metadata:
> Meta(gds858)$channel_count [1] "1" > Meta(gds858)$description [1] "Comparison of lung epithelial Calu-3 cells infected ..." > Meta(gds858)$feature_count [1] "22283" > Meta(gds858)$platform [1] "GPL96" > Meta(gds858)$sample_count [1] "19" > Meta(gds858)$sample_organism [1] "Homo sapiens" > Meta(gds858)$sample_type [1] "cDNA" > Meta(gds858)$title [1] "Mucoid and motile Pseudomonas aeruginosa infected lung epithelial cell comparison" > Meta(gds858)$type [1] "gene expression array-based"
Useful stuff, and now the expression data table:
> colnames(Table(gds858)) [1] "ID_REF" "IDENTIFIER" "GSM14498" "GSM14499" "GSM14500" [6] "GSM14501" "GSM14513" "GSM14514" "GSM14515" "GSM14516" [11] "GSM14506" "GSM14507" "GSM14508" "GSM14502" "GSM14503" [16] "GSM14504" "GSM14505" "GSM14509" "GSM14510" "GSM14511" [21] "GSM14512" > Table(gds858)[1:10,1:6] ID_REF IDENTIFIER GSM14498 GSM14499 GSM14500 GSM14501 1 1007_s_at U48705 3736.9 3811.0 3699.6 3897.6 2 1053_at M87338 343.0 500.3 288.3 341.3 3 117_at X51757 120.9 34.3 145.8 110.5 4 121_at X69699 1523.8 1281.1 1281.9 1493.4 5 1255_g_at L36861 51.6 15.9 45.9 8.1 6 1294_at L13852 253.2 164.8 200.0 205.2 7 1316_at X55005 199.6 250.7 290.3 218.6 8 1320_at X79510 81.7 13.4 13.9 88.7 9 1405_i_at M21121 18.9 5.6 11.0 9.5 10 1431_at J02843 99.7 74.5 72.6 114.8
Now, lets turn this GDS object into an expression set object (using base 2 logarithms) and have a look at it:
> eset <- GDS2eSet(gds858, do.log2=TRUE) > eset Expression Set (exprSet) with 22283 genes 19 samples phenoData object with 4 variables and 19 cases varLabels : sample : infection : genotype/variation : description > geneNames(eset)[1:10] [1] "1007_s_at" "1053_at" "117_at" "121_at" "1255_g_at" [6] "1294_at" "1316_at" "1320_at" "1405_i_at" "1431_at" > sampleNames(eset) [1] "GSM14498" "GSM14499" "GSM14500" "GSM14501" "GSM14513" [6] "GSM14514" "GSM14515" "GSM14516" "GSM14506" "GSM14507" [11] "GSM14508" "GSM14502" "GSM14503" "GSM14504" "GSM14505" [16] "GSM14509" "GSM14510" "GSM14511" "GSM14512"
GEOquery does an excellent job of extracting the phenotype data, as you can see:
> pData(eset)$infection [1] FRD1 FRD1 FRD1 FRD1 FRD440 [6] FRD440 FRD440 FRD440 FRD875 FRD875 [11] FRD875 FRD875 FRD1234 FRD1234 FRD1234 [16] uninfected uninfected uninfected uninfected Levels: FRD1 FRD1234 FRD440 FRD875 uninfected > pData(eset)$"genotype/variation" [1] control control [3] control control [5] mucoid mucoid [7] mucoid mucoid [9] motile motile [11] motile motile [13] non-mucoid, non-motile non-mucoid, non-motile [15] non-mucoid, non-motile non-mucoid, non-motile [17] non-mucoid, non-motile non-mucoid, non-motile [19] non-mucoid, non-motile Levels: control motile mucoid non-mucoid, non-motile
As with any expression set object, its easy to pull out a subset of the data:
> eset["1320_at","GSM14504"] Expression Set (exprSet) with 1 genes 1 samples phenoData object with 4 variables and 1 cases varLabels : sample : infection : genotype/variation : description > exprs(eset["1320_at","GSM14504"]) GSM14504 1320_at 6.70044
You should be able to produce a heatmap of differentially expressed genes easily enough using this page, especially as the phenotype/sub-sample information has been sorted out for you.
Loading a GPL (Annotation) file with GEOquery
In addition to loading a GDS file to get the expression levels, you can also load the associated platform annotation file. You can find this out from the GDS858 meta information:
> Meta(gds858)$platform [1] "GPL96"
So, for GDS858, the platform is GPL96, Affymetrix GeneChip Human Genome U133 Array Set HG-U133A.
Now let's load up the GPL file and have a look at it (its a big file, about 12 MB, so this takes a while!):
library(Biobase) library(GEOquery) #Download GPL file, put it in the current directory, and load it: gpl96 <- getGEO('GPL96', destdir=".") #Or, open an existing GPL file: gpl96 <- getGEO(filename='GPL96.soft')
As with the GDS object, we can use the Meta and Table functions to extract information:
> Meta(gpl96)$title [1] "Affymetrix GeneChip Human Genome U133 Array Set HG-U133A" > colnames(Table(gpl96)) [1] "ID" "Species.Scientific.Name" [3] "Annotation.Date" "GB_LIST" [5] "SPOT_ID" "Sequence.Source" [7] "Representative.Public.ID" "Gene.Title" [9] "Gene.Symbol" "Entrez.Gene" [11] "RefSeq.Transcript.ID" "Gene.Ontology.Biological.Process" [13] "Gene.Ontology.Cellular.Component" "Gene.Ontology.Molecular.Function"
Lets look at the first four columns, for the first ten genes:
> Table(gpl96)[1:10,1:4] ID Species.Scientific.Name Annotation.Date GB_LIST 1 1007_s_at Homo sapiens 16-Sep-05 U48705 2 1053_at Homo sapiens 16-Sep-05 M87338 3 117_at Homo sapiens 16-Sep-05 X51757 4 121_at Homo sapiens 16-Sep-05 X69699 5 1255_g_at Homo sapiens 16-Sep-05 L36861 6 1294_at Homo sapiens 16-Sep-05 L13852 7 1316_at Homo sapiens 16-Sep-05 X55005 8 1320_at Homo sapiens 16-Sep-05 X79510 9 1405_i_at Homo sapiens 16-Sep-05 M21121 10 1431_at Homo sapiens 16-Sep-05 J02843
This shows a hand picked selection of the columns, again for the first ten genes:
> Table(gpl96)[1:10,c("ID","GB_LIST","Gene.Title","Gene.Symbol","Entrez.Gene")] ID GB_LIST Gene.Title Gene.Symbol Entrez.Gene 1 1007_s_at U48705 discoidin domain receptor family, member 1 DDR1 780 2 1053_at M87338 replication factor C (activator 1) 2, 40kDa RFC2 5982 3 117_at X51757 heat shock 70kDa protein 6 (HSP70B') HSPA6 3310 4 121_at X69699 paired box gene 8 PAX8 7849 5 1255_g_at L36861 guanylate cyclase activator 1A (retina) GUCA1A 2978 6 1294_at L13852 ubiquitin-activating enzyme E1-like UBE1L 7318 7 1316_at X55005 thyroid hormone receptor, alpha (erythroblastic...) THRA 7067 8 1320_at X79510 protein tyrosine phosphatase, non-receptor type 21 PTPN21 11099 9 1405_i_at M21121 chemokine (C-C motif) ligand 5 CCL5 6352 10 1431_at J02843 cytochrome P450, family 2, subfamily E, polypeptide 1 CYP2E1 1571
The above all used the 12MB file GPL96.soft, but you can also get a much smaller 3MB file GPL96.annot (compressed as GPL96.annot.gz) which has slightly different information in it... see here.
Using the BioConductor hgu133a package
Instead of loading the GEO annotation file for GPL96/HG-U133A, we could use an existing annotation package from the BioConductor annotation sets, hgu133a. These libraries exist for most of the popular microarray gene chips.
First of all, we need to install the package:
> source("http://www.bioconductor.org/biocLite.R") > biocLite("hgu133a") Running bioCLite version 0.1 with R version 2.1.1 ...
Then we can load the newly installed library:
> library(hgu133a)
There is any easy way to check when this was lasted updated, and what it can translate the Affy probe names into:
> hgu133a() Quality control information for hgu133a Date built: Created: Tue May 17 13:02:12 2005 Number of probes: 22277 Probe number missmatch: None Probe missmatch: None Mappings found for probe based rda files: hgu133aACCNUM found 22277 of 22277 hgu133aCHRLOC found 20195 of 22277 hgu133aCHR found 21283 of 22277 hgu133aENZYME found 2507 of 22277 hgu133aGENENAME found 18726 of 22277 hgu133aGO found 18647 of 22277 hgu133aLOCUSID found 21747 of 22277 hgu133aMAP found 21183 of 22277 hgu133aOMIM found 15109 of 22277 hgu133aPATH found 5067 of 22277 hgu133aPMID found 21004 of 22277 hgu133aREFSEQ found 21002 of 22277 hgu133aSUMFUNC found 0 of 22277 hgu133aSYMBOL found 21303 of 22277 hgu133aUNIGENE found 21128 of 22277 Mappings found for non-probe based rda files: hgu133aCHRLENGTHS found 25 hgu133aENZYME2PROBE found 663 hgu133aGO2ALLPROBES found 5912 hgu133aGO2PROBE found 4326 hgu133aORGANISM found 1 hgu133aPATH2PROBE found 142 hgu133aPMID2PROBE found 96291
And now lets test some of those mappings on the fourth gene 121_at in the GPL file:
> Table(gpl96)[4,c("ID","GB_LIST","Gene.Title","Gene.Symbol","Entrez.Gene")] ID GB_LIST Gene.Title Gene.Symbol Entrez.Gene 4 121_at X69699 paired box gene 8 PAX8 7849
Now, what does the annotation file have to say?
> mget("121_at",hgu133aACCNUM) $"121_at" [1] "X69699" > mget("121_at",hgu133aGENENAME) $"121_at" [1] "paired box gene 8" > mget("121_at",hgu133aSYMBOL) $"121_at" [1] "PAX8" > mget("121_at",hgu133aUNIGENE) $"121_at" [1] "Hs.469728"
You will notice that there is some overlap between the information in the GEO annotation table, and the hgu133a package (which compiles its data from a range of sources). See help(hgu133a) .
You should also read this introduction, Bioconductor: Annotation Package Overview .