Downloading GEO data and preparing to load into R

This document assumes you have installed R/BioConductor already.

Downloading data

Next we need to download our data set.

The data set you are going to use is available from the Gene Expression Omnibus database (GEO), hosted by NCBI. First go to their website:

http://www.ncbi.nlm.nih.gov/geo/

Next, query Datasets for "GDS858" (or the dataset you have been allocated). Experiment accessions start GSE, datasets start GDS).

On the first result, click the icon of a heat map on the right hand side. It will look like this:
[Heatmap icon]

About half way down the new page, to the left of the heat map icon, choose:

Data->Download->DataSet SOFT file

Save the file (GDS858.soft.gz or similar) to your computer, making a note of where you save it.

Also, from the same menu, pick:

Data->Download->Annotation SOFT file

Note that the annotation file (GPL96.annot.gz or similar) is specific to the array and the platform used in the experiment, not the experiment itself.

Again save the file in the same folder. Before leaving the GEO webpage, make sure you have a careful note of which sample name corresponds to each condition.

The data is stored in a compressed form (*.gz files) so it takes less time to download. We need to uncompress it. If you right click the file and select the default path for unzipping, the new folder should appear in the same folder as the download. If not, you probably need to download Winzip from the web.

If you complete all these steps successfully you will have succeeded in downloading your dataset and annotations, two files called GDS858.soft and GPL96.annot or similar. (Annotations are a little information about each gene). You should open the files in Excel to see what the data looks like.

Aside:
Some of you may be interested in the SOFT file format definition, but you won't need to worry about this for the assignment.

Also you can download the GDS data SOFT files and the GPL annotation SOFT files by FTP if you find that easier. -Peter

The files start with several lines beginning with either !,^ or # these lines represent different bits of information about the data. Some will be obvious what they mean, the majority will not. A little way down you will get to a table containing either the actual data (if you are currently looking at the .SOFT file) or a description of the genes (if you are looking at the .ANNOT file).

For example, the GPL96.soft file loaded into Excel:
[Excel screenshot]

These tables are what we are going to load into R. The ID_REF column in the table is a unique identification label for each gene, it allows us to associate expression data from one file with information about the genes from the annotation file.

Preparing Data to load into R

When we looked at the data in the excel files we saw it had a lot of additional information. When we are loading in the data we don't want to load this in. The easiest way for us to do this is to copy the data we want into a new excel worksheet, leaving behind the information we don't want. We want to copy the column headings as these will be important to us later.

Depending on what dataset you are using you may find rows (at the start/end) containing something like '!table_begin' or '!table_end'. We should discard these rows also. Your final data in the excel sheet should look a little like that in figures 2 and 3(albeit with many more rows!), the exact content depends on the data set you are dealing with.

With the annotation data it is important that you select all of the columns, even though most of the entries appear to be blank.

In your file containing the expression data we need to make decide what we do with missing data values. These will be stored in the table with the value 'Null'. Do a search Edit->Search for 'Null' to see if there are any in your data set, we will deal with these later on. We now save the two data files in tab delimited text format as Data.txt and Annot.txt

To do this go to File->Save As and choose "Text (Tab delimited) (*.txt)" in the save as type option box. Ignore the set of warnings that Excel will throw at you, and proceed to save. Make a note of where you save these files.

Figure 2:

Your annotation table, Annot.txt, should look something like this:
[Excel screenshot]

Figure 3:

Your gene expression table, Data.txt, should look very similar to this:
[Excel screenshot]

Next, loading this GEO data into R...

GEO

NCBI