Envisage: Linear Models for Microarray Analysis

Introduction and Motivation

High-throughput microarray analyses can simultaneously measure the expression levels of a large number of genes (tens of thousands), providing a genetic "fingerprint", or snapshot, of the transcriptome (the set of all transcribed genes) at a moment in time. This is a very powerful technique, allowing us to compare the response in transcriptional regulation between biological conditions. For instance, we may be interested to see how the genetic machinery is changed by treatment with a particular drug. By comparing the transcriptomic fingerprint between tissue samples from treated and untreated patients, we can see what effects the drug has on cellular function at the transcript level.

In general, microarray analyses are designed to look for changes in gene expression across one, or maybe two conditions. Typical conditions to look at are changes over time, changes between healthy and diseased tissues, effects of drug treatment, etc. Two of the most important areas of gene expression analysis are:

1. Analysis for differential expression

We want to find genes that show a change in their expression levels from one condition to another (e.g. genes that change expression levels when a drug is administered). Given the variabliity in the data, a change in expression of 2-fold or greater (up or down) is often considered to represent a significant change.

2. Statistical significance analysisWe want to ensure that the changes that we see in expression between conditions are due to true biological variation, and not due to changes expected by chance. To cut down on errors such as this, we run microarray experiments using replicates for each condition (typically n=3 as a minimum) and average over them. Increasing the sample size will increase the power of any statistical analysis and reduce the risk of missing interesting changes (false negatives) and seeing changes that can be explained by chance (false positives). A typical measure of statistical significance is analysis of variance (ANOVA), which looks for differences in the means of the replicates while taking the variance into account. A p-value is produced, which is the probability that the difference in the means of two groups of samples is due purely to chance. A p-value of 0.05 is typically used as a cutoff for "significant changes", indicating that there is a 5% chance that the gene is showing differential expression purely by chance, and not due to any biological change.

ANOVA is a standard method for significance analysis, and is often used to calculate the significance of changes in gene expression between conditions. Typically, ANOVA calculations are performed for a single variable (1-way ANOVA), or for two or more variables (multi-way ANOVA) in a factorial design experiment where all possible conditions spanning the variable classes are considered. In these cases, a model (see below) is fit for all genes, with gene expression as the response variable and the variable(s) of interest (and their interactions if a multi-way ANOVA is used) as the predictor variables. One issue with this method is that a single model is fit for all genes, which fails to account for the variability between the genes. Many genes may show significant change when only a subset of the model terms are considered. The problem of fitting a saturated model to all genes becomes larger as more variables are included in the model.

In the majority of microarray experiments, particularly clinical studies, the variables of interest to the experimenter will not be the only sources of variation. Environmental, phenotypic, technical and random variation will all be present, albeit to varying degress. Varaiation can be roughly split into two groups:

Parameters Variables that are under the direct control of the experimenter
Covariates Sources of variation that may influence the relationship between the response variable and experimental parameters

It is important to consider all sources of variation to ensure that results correspond to true biological events, and are not related to, say, the phenotypic variation of the sample set. Including more variables into the analysis when using ANOVA may result in missing a lot of interesting effects for genes whose expression changes in response to only a subset of the experiment variables, so a method must be used to fit a model to each gene individually. Also, ANOVA requires variables of interest to be factorial in nature, with samples falling into one of a finitie number of groups, or levels. However, many covariates, particularly phenotypic variables, may be numeric in nature which cannot be considered in this analysis. Linear models allow extension ofANOVA to allow inclusion of numeric variables.

Linear Models

For some gene g ∈ (1,...,G) with gene expression Y_g = (y_g1,...,y_gn) over n samples, a linear model can be applied to y_gi with experiment variables x₁ = (x₁₁,...,x_1n), x₂ = (x₂₁,...,x_2n), etc. as predictor variables:

y_gi = ß₀+ ß₁x_1i + ß₂x_2i + ... + ß_px_pi+ ε_gi

Where ß₀ is the mean expression over all samples

ß_j is the model coefficient for numeric explanatory variable x_j

p is the number of variables

ε_gi is the error term

By modelling the gene expression for each gene in this way, we can use a number of different experiment variables in the model and see how much of an effect each variable has on the expression of the gene (i.e. variables that have a large effect on the expression of the gene will have a higher coefficient ß_j).

Note that this linear model contains only main effect terms for the sake of clarity, however the model can also contain higher order interaction terms (modifications of the combined main effects caused by interdependencies between the variables).

Envisage: Model-Based Significance Analysis of Microarray Gene Expression Data

Envisage (Enables Numerous Variables in Significance Analysis of Gene Expression) is a package written in the statistical programming language R that uses linear models to find genes that show a significant change across variables of interest as described. An automated stepwise model fitting procedure, based on the Akaike Information Criterion, is used to determine a candidate model consisting of experimentla variables of interest; both parameters and covariates. For each gene, the significance of each term (main effects and interactions) is determined using a Type II sum of squares F-test statistic, which determines which variables ellicit a response in the gene expression.

This process can therefore be used in three ways:

To check how much of an effect variables other than those of interest have on the expression of genes. For example, we can observe to see if batching of samples has a significant effect on the gene expression.
To find interesting variables in your data.
To find genes that change based on the variables of interest whilst taking the effects of other variables into account. This is particularly useful for clinical data.

This project has been carried out in conjunction with Agilent Technologies, the producers of Genespring GX . This is one of the most widely used gene expression analysis suites in use, and is the principal piece of software used in my project. Working together with Helen Bird from Biological Sciences, Heather Turner from the Statistics Department, and Ewan Hunter from Agilent, I have created a program that allows the use of linear models for microarray data analysis that can be used independently in R (together with relevent packages from Bioconductor), or can be used as an add on to the Genespring GX analysis suite. Using the program through Genespring provides a simple way for preparing data and curating genes prior to analysis. A simple to use graphical user interface has been implemented in Tcl/Tk to make use of the program as simple as possible for the user.

For more information, and to download the necessary files, please see the downloads page.

Sam Robson

Contact Me:

Email:

S.C.Robson@warwick.ac.uk

Address:

MOAC Doctoral Training Centre

Coventry House

University of Warwick

Coventry

CV4 7AL