Using R to draw a Heatmap from Microarray Data
The first section of this page uses R to analyse an Acute lymphocytic leukemia (ALL) microarray dataset, producing a heatmap (with dendrograms) of genes differentially expressed between two types of leukemia.
There is a follow on page dealing with how to do this from Python using RPy.
The original citation for the raw data is "Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival" by Chiaretti et al. Blood 2004. (PMID: 14684422)
The analysis is a "step by step" recipe based on this paper, Bioconductor: open software development for computational biology and bioinformatics, Gentleman et al. 2004. Their Figure 2 Heatmap, which we recreate, is reproduced here:
Heatmaps from R
Assuming you have a recent version of R (from The R Project) and BioConductor (see Windows XP installation instructions), the example dataset can be downloaded as the BioConductor ALL package.
You should be able to install this from within R as follows:
> source("http://www.bioconductor.org/biocLite.R") > biocLite("ALL") Running bioCLite version 0.1 with R version 2.1.1 ...
Alternatively, you can download the package by hand from here or here.
If you are using Windows, download ALL_1.0.2.zip (or similar) and save it. Then from within the R program, use the menu option "Packages", "Install package(s) from local zip files..." and select the ZIP file.
On Linux, download ALL_1.0.2.tar.gz (or similar) and use sudo R CMD INSTALL ALL_1.0.2.tar.gz at the command prompt.
With that out of the way, you should be able to start R and load this package with the library and data commands:> library("ALL") Loading required package: Biobase Loading required package: tools Welcome to Bioconductor Vignettes contain introductory material. To view, simply type: openVignette() For details on reading vignettes, see the openVignette help page. > data("ALL")
If you inspect the resulting ALL variable, it contains 128 samples with 12625 genes, and associated phenotypic data.
> ALL Expression Set (exprSet) with 12625 genes 128 samples phenoData object with 21 variables and 128 cases varLabels cod: Patient ID diagnosis: Date of diagnosis sex: Gender of the patient age: Age of the patient at entry BT: does the patient have B-cell or T-cell ALL remission: Complete remission(CR), refractory(REF) or NA. Derived from CR CR: Original remisson data date.cr: Date complete remission if achieved t(4;11): did the patient have t(4;11) translocation. Derived from citog t(9;22): did the patient have t(9;22) translocation. Derived from citog cyto.normal: Was cytogenetic test normal? Derived from citog citog: original citogenetics data, deletions or t(4;11), t(9;22) status mol.biol: molecular biology fusion protein: which of p190, p210 or p190/210 for bcr/able mdr: multi-drug resistant kinet: ploidy: either diploid or hyperd. ccr: Continuous complete remission? Derived from f.u relapse: Relapse? Derived from f.u transplant: did the patient receive a bone marrow transplant? Derived from f.u f.u: follow up data available date last seen: date patient was last seen
We can looks at the results of molecular biology testing for the 128 samples:
> ALL$mol.biol [1] BCR/ABL NEG BCR/ABL ALL1/AF4 NEG NEG NEG NEG NEG [10] BCR/ABL BCR/ABL NEG E2A/PBX1 NEG BCR/ABL NEG BCR/ABL BCR/ABL [19] BCR/ABL BCR/ABL NEG BCR/ABL BCR/ABL NEG ALL1/AF4 BCR/ABL ALL1/AF4 ... [127] NEG NEG Levels: ALL1/AF4 BCR/ABL E2A/PBX1 NEG NUP-98 p15/p16
Ignoring the samples which came back negative on this test (NEG), most have been classified as having a translocation between chromosomes 9 and 22 (BCR/ABL), or a translocation between chromosomes 4 and 11 (ALL1/AF4).
For the purposes of this example, we are only interested in these two subgroups, so we will create a filtered version of the dataset using this as a selection criteria:
> eset <- ALL[, ALL$mol.biol %in% c("BCR/ABL", "ALL1/AF4")]
The resulting variable, eset, contains just 47 samples - each with the full 12,625 gene expression levels.
This is far too much data to draw a heatmap with, but we can do one for the first 100 genes as follows:
> heatmap(exprs(eset[1:100,]))
According to the BioConductor paper we are following, the next step in the analysis was to use the lmFit function (from the limma package) to look for genes differentially expressed between the two groups. The fitted model object is further processed by the eBayes function to produce empirical Bayes test statistics for each gene, including moderated t-statistics, p-values and log-odds of differential expression.
> library(limma) > f <- factor(as.character(eset$mol.biol)) > design <- model.matrix(~f) > fit <- eBayes(lmFit(eset,design))
If the limma package isn't installed, you'll need to install it first:
> source("http://www.bioconductor.org/biocLite.R") > biocLite("limma") Running bioCLite version 0.1 with R version 2.1.1 ...
We can now reproduce Figure 1 from the paper.
> topTable(fit, coef=2) ID M A t P.Value B 1016 1914_at -3.076231 4.611284 -27.49860 5.887581e-27 56.32653 7884 37809_at -3.971906 4.864721 -19.75478 1.304570e-20 44.23832 6939 36873_at -3.391662 4.284529 -19.61497 1.768670e-20 43.97298 10865 40763_at -3.086992 3.474092 -17.00739 7.188381e-18 38.64615 4250 34210_at 3.618194 8.438482 15.45655 3.545401e-16 35.10692 11556 41448_at -2.500488 3.733012 -14.83924 1.802456e-15 33.61391 3389 33358_at -2.269730 5.191015 -12.96398 3.329289e-13 28.76471 8054 37978_at -1.036051 6.937965 -10.48777 6.468996e-10 21.60216 10579 40480_s_at 1.844998 7.826900 10.38214 9.092033e-10 21.27732 330 1307_at 1.583904 4.638885 10.25731 1.361875e-09 20.89145
The leftmost numbers are row indices, ID is the Affymetrix HGU95av2 accession number, M is the log ratio of expression, A is the log average expression, t the moderated t-statistic, and B is the log odds of differential expression.
Next, we select those genes that have adjusted p-values below 0.05, using a very stringent Holm method to select a small number (165) of genes.
> selected <- p.adjust(fit$p.value[, 2]) <0.05 > esetSel <- eset [selected, ]
The variable esetSel has data on (only) 165 genes for all 47 samples . We can easily produce a heatmap as follows (in R-2.1.1 this defaults to a yellow/red "heat" colour scheme):
> heatmap(exprs(esetSel))
If you have the topographical colours installed (yellow-green-blue), you can do this:
> heatmap(exprs(esetSel), col=topo.colors(100))
This is getting very close to Gentleman et al.'s Figure 2, except they have added a red/blue banner across the top to really emphasize how the hierarchical clustering has correctly split the data into the two groups (10 and 37 patients).
To do that, we can use the heatmap function's optional argument of ColSideColors. I created a small function to map the eselSet$mol.biol values to red (#FF0000) and blue (#0000FF), which we can apply to each of the molecular biology results to get a matching list of colours for our columns:
> color.map <- function(mol.biol) { if (mol.biol=="ALL1/AF4") "#FF0000" else "#0000FF" } > patientcolors <- unlist(lapply(esetSel$mol.bio, color.map)) > heatmap(exprs(esetSel), col=topo.colors(100), ColSideColors=patientcolors)
Looks pretty close now, doesn't it:
To recap, this is "all" we needed to type into R to achieve this:
library("ALL") data("ALL") eset <- ALL[, ALL$mol.biol %in% c("BCR/ABL", "ALL1/AF4")] library("limma") f <- factor(as.character(eset$mol.biol)) design <- model.matrix(~f) fit <- eBayes(lmFit(eset,design)) selected <- p.adjust(fit$p.value[, 2]) <0.05 esetSel <- eset [selected, ] color.map <- function(mol.biol) { if (mol.biol=="ALL1/AF4") "#FF0000" else "#0000FF" } patientcolors <- unlist(lapply(esetSel$mol.bio, color.map)) heatmap(exprs(esetSel), col=topo.colors(100), ColSideColors=patientcolors)
Heatmaps in R - More Options
One subtle point in the previous examples is that the heatmap function has automatically scaled the colours for each row (i.e. each gene has been individually normalised across patients). This can be disabled using scale="none", which you might want to do if you have already done your own normalisation (or this may not be appropriate for your data):
heatmap(exprs(esetSel), col=topo.colors(75), scale="none", ColSideColors=patientcolors, cexRow=0.5)
You might also have noticed in the above snippet, that I have shrunk the row captions which were so big they overlapped each other. The relevant options are cexRow and cexCol.
So far so good - but what if you wanted a key to the colours shown? The heatmap function doesn't offer this, but the good news is that heatmap.2 from the gplots library does. In fact, it offers a lot of other features, many of which I deliberately turn off in the following example:
library("gplots") heatmap.2(exprs(esetSel), col=topo.colors(75), scale="none", ColSideColors=patientcolors, key=TRUE, symkey=FALSE, density.info="none", trace="none", cexRow=0.5)
By default, heatmap.2 will also show a trace on each data point (removed this with trace="none"). If you ask for a key (using key=TRUE) this function will actually give you a combined "color key and histogram", but that can be overridden (with density.info="none").
Don't like the colour scheme? Try using the functions bluered/redblue for a red-white-blue spread, or redgreen/greenred for the red-black-green colour scheme often used with two-colour microarrays:
library("gplots") heatmap.2(exprs(esetSel), col=redgreen(75), scale="row", ColSideColors=patientcolors, key=TRUE, symkey=FALSE, density.info="none", trace="none", cexRow=0.5)
Heatmaps from Python
So, how can we do that from within Python? One way is using RPy (R from Python), and this is discussed on this page.
P.S. If you want to use heatmap.2 from within python using RPy, use the syntax heatmap_2 due to the differences in how R and Python handle full stops and underscores.
What about other microarray data?
Well, I have also documented how you can load NCBI GEO SOFT files into R as a BioConductor expression set object. As long as you can get your data into R as a matrix or data frame, converting it into an exprSet shouldn't be too hard.
This was all written and tested using Windows XP and R-2.1.1, but it should all work on other platforms.