Envisage: Linear Models for Microarray Analysis
Introduction and Motivation
High-throughput microarray analyses can simultaneously measure the expression levels of a large number of genes (tens of thousands), providing a genetic "fingerprint", or snapshot, of the transcriptome (the set of all transcribed genes) at a moment in time. This is a very powerful technique, allowing us to compare the response in transcriptional regulation between biological conditions. For instance, we may be interested to see how the genetic machinery is changed by treatment with a particular drug. By comparing the transcriptomic fingerprint between tissue samples from treated and untreated patients, we can see what effects the drug has on cellular function at the transcript level.
In general, microarray analyses are designed to look for changes in gene expression across one, or maybe two conditions. Typical conditions to look at are changes over time, changes between healthy and diseased tissues, effects of drug treatment, etc. Two of the most important areas of gene expression analysis are:
We want to find genes that show a change in their expression levels from one condition to another (e.g. genes that change expression levels when a drug is administered). Given the variabliity in the data, a change in expression of 2-fold or greater (up or down) is often considered to represent a significant change.
ANOVA is a standard method for significance analysis, and is often used to calculate the significance of changes in gene expression between conditions. Typically, ANOVA calculations are performed for a single variable (1-way ANOVA), or for two or more variables (multi-way ANOVA) in a factorial design experiment where all possible conditions spanning the variable classes are considered. In these cases, a model (see below) is fit for all genes, with gene expression as the response variable and the variable(s) of interest (and their interactions if a multi-way ANOVA is used) as the predictor variables. One issue with this method is that a single model is fit for all genes, which fails to account for the variability between the genes. Many genes may show significant change when only a subset of the model terms are considered. The problem of fitting a saturated model to all genes becomes larger as more variables are included in the model.
In the majority of microarray experiments, particularly clinical studies, the variables of interest to the experimenter will not be the only sources of variation. Environmental, phenotypic, technical and random variation will all be present, albeit to varying degress. Varaiation can be roughly split into two groups:
- Parameters Variables that are under the direct control of the experimenter
- Covariates Sources of variation that may influence the relationship between the response variable and experimental parameters
It is important to consider all sources of variation to ensure that results correspond to true biological events, and are not related to, say, the phenotypic variation of the sample set. Including more variables into the analysis when using ANOVA may result in missing a lot of interesting effects for genes whose expression changes in response to only a subset of the experiment variables, so a method must be used to fit a model to each gene individually. Also, ANOVA requires variables of interest to be factorial in nature, with samples falling into one of a finitie number of groups, or levels. However, many covariates, particularly phenotypic variables, may be numeric in nature which cannot be considered in this analysis. Linear models allow extension ofANOVA to allow inclusion of numeric variables.
Linear Models
For some gene g ∈ (1,...,G) with gene expression Yg = (yg1,...,ygn) over n samples, a linear model can be applied to ygi with experiment variables x1 = (x11,...,x1n), x2 = (x21,...,x2n), etc. as predictor variables:
ygi = ß0 + ß1x1i + ß2x2i + ... + ßpxpi + εgi
Envisage: Model-Based Significance Analysis of Microarray Gene Expression Data
Envisage (Enables Numerous Variables in Significance Analysis of Gene Expression) is a package written in the statistical programming language R that uses linear models to find genes that show a significant change across variables of interest as described. An automated stepwise model fitting procedure, based on the Akaike Information Criterion, is used to determine a candidate model consisting of experimentla variables of interest; both parameters and covariates. For each gene, the significance of each term (main effects and interactions) is determined using a Type II sum of squares F-test statistic, which determines which variables ellicit a response in the gene expression.
This process can therefore be used in three ways:
-
To check how much of an effect variables other than those of interest have on the expression of genes. For example, we can observe to see if batching of samples has a significant effect on the gene expression.
- To find interesting variables in your data.
-
To find genes that change based on the variables of interest whilst taking the effects of other variables into account. This is particularly useful for clinical data.
This project has been carried out in conjunction with Agilent Technologies, the producers of Genespring GX . This is one of the most widely used gene expression analysis suites in use, and is the principal piece of software used in my project. Working together with Helen Bird from Biological Sciences, Heather Turner from the Statistics Department, and Ewan Hunter from Agilent, I have created a program that allows the use of linear models for microarray data analysis that can be used independently in R (together with relevent packages from Bioconductor), or can be used as an add on to the Genespring GX analysis suite. Using the program through Genespring provides a simple way for preparing data and curating genes prior to analysis. A simple to use graphical user interface has been implemented in Tcl/Tk to make use of the program as simple as possible for the user.