Talk Abstracts

Peter Donnelly, "Human recombination hotspots, evolution, and the PRDM9 gene"

Most recombination in humans (and many other species) occurs in small regions, typically 1-2kb in size, called recombination hotspots. Information about these hotspots can be gleaned directly, for example through sperm-typing, or indirectly, through linkage disequilibrium, the correlations between alleles at loci which are physically close to each other. The availability of extensive data documenting genome-wide patterns of variation in human populations, and the development of statistical methods for inferring hotspot locations from these data, has led to the characterisation of over 30,000 human recombination hotspots. Subsequent statistical and bioinformatics analyses identified a 13bp sequence motif associated with hotspot activity. Recently the motif has been shown not to operate in chimpanzees, suggesting rapid evolution of the hotspot motif, and potentially of the associated molecular machinery. Further in silico analyses, and independent experimental work in other groups, identified a zinc-finger protein, PRDM9, which binds to the motif. PRDM9 shows extensive variation across mammals, and within human populations.

Chris Greenman, "Estimating Rearrangment Evolution in Cancer with New Sequencing Technologies"

The paired end sequencing data of new sequencing technologies have enabled the construction of increasingly complete somatic mutation profiles of cancer genomes. This includes single nucleotide changes, copy number changes and rearrangements. Here we outline a method that utilises these data to examine five key things. Firstly, the paired reads are used to reconstruct the copy number segments into entire cancer chromosomes. Secondly, the possible rearrangements involved in the evolution of these genomes are identified. Thirdly, the possible order that these rearrangements occurred throughout tumourgenesis is examined. The multiplicity (zygosity) of point mutations are then used chronologically place these events relative to the generation of point mutations. Finally, this information is used to investigate the timing of driving mutations in the cancer genome. These techniques have potential application to elucidate the process of evolution in cancer, identifying mutations occurring in the latter stages of oncogenesis, which may be more therapeutically relevant.

Nicola De Maio, Carolin Kosiol, Andreas Futschik, Christian Schloetterer, "Estimation of population genetics parameters from pooled next generation sequencing data of Drosophila species"

Next generation sequencing provides a great amount of genetic data which can be used for inference of population genetics parameters. We show that pooled sequencing at population level is often more efficient and accurate than individual sequencing at higher costs. We provide a program which implements our unbiased estimates of Tajima's Pi and Watterson Theta. We not only account for the bias derived by pooling sequencing, but also for the one generated by sequencing errors, which are quite common in next generation sequencing techniques. We also present an application of the program to the analysis of genome-wide data from populations of Drosophila Melanogaster and Simulans.

Klaudia Walter, Lorenz Wernisch and Matthew Hurles, "Genotyping Structural Variants from New Sequencing Technology Data"

New sequencing technologies such as Illumina/Solexa, Roche/454 or SOLiD enable the sequencing of whole genomes at much reduced cost which in the future will have a strong influence on medical treatment for some diseases. Sequencing data allow the discovery of more types of structural variants (SV) than arrays, such as novel insertions and inversions, as well as deletions and duplications. In addition the breakpoints of these SVs can be determined more accurately. In order to assess the functional impact of SVs, it is necessary to be able to genotype them from sequencing data, to establish for each individual whether they possess the homozygous or the heterozygous variant or whether there is no difference to the reference genome.

We have focussed on genotyping large deletions (> 50bp) from the pilot phase of the 1000 Genomes Project where 180 samples were sequenced at low-coverage (~4X). We propose a probabilistic model that incorporates three distinct parts of information which can be combined or be used separately: the re-mapping to the non-reference allele, the distance between the two mapped mates of a read pair and the read depth within and outside the deletion breakpoints. SNP genotype information can be used to improve the results. It is also possible to extend the model to genotype other types of SVs such as small indels, large insertions or inversions.

Thomas J. Hardcastle and Krystyna A. Kelly. "Empirical Bayesian methods for detecting patterns of differential expression in data acquired from high-throughput sequencing"
The development of high-throughput sequencing technologies in recent years (Margulies 2005, Bentley 2006, Schuster 2008, Mardis 2008) has led to a massive increase in genomic data represented by counts. These count data are distinct from those acquired using bead and array technologies in that they are fundamentally discrete, rather than continuous, in nature. Rather than measurements of intensity, we acquire counts of the number of times a particular sequence is observed in a library, whether the source is genomic DNA, DNA fragments produced by immune precipitation, mRNA or small RNAs.

A key question in the analysis of such data remains; how can we identify elements of the data that show patterns of differential expression that correlate with our knowledge about the biological samples? Here we discuss an empirical Bayes method which performs at least as well as, and often better than current alternatives. We consider a distribution for a given element of the data defined by a set of underlying parameters for which some prior distribution exists. By estimating this prior distribution from the total data available, then for a given model about the relatedness of our underlying parameters for multiple libraries, we are able to assess the posterior likelihood of the model. This approach allows the simultaneous comparison of multiple models for differential expression between samples and is thus readily applicable to complex experimental designs.

We show that for analyses of differential expression between genomic regions (e.g. genic regions) we can improve the accuracy of predictions further by adapting our methods to include information on the lengths of the genomic regions. Adapting this method further, we are able to distinguish between genomic regions that have genuine association with expressed sequencing reads and those that do not.

Alena Mysickova and and Martin Vingron, "Rank List Aggregation on Small RNA Digital Gene Expression Data"

The identification of genomic insight from gene expression analysis is very substantial. However, genomic signatures coming from different studies that explore the same scientific question are usually very diverse. In addition, results of small RNA profiling are strongly biased by the method used for the RNA library preparation. Therefore, there is a need for reliable and sensitive data-integration approach which would be able to consolidate the results of such experiments. Here, we analyze small RNA digital gene expression data and first rank the RNAs according to their expression level. In this way, rank lists for three different library preparation methods and two different sequencing platforms are created. A stochastic approach for integration of ranked lists of arbitrary length is applied. This method uses the Cross-entropy Monte Carlo to estimate a partial list of miRNAs that is characterized by rankings of high conformity across the lists. Finally, we evaluate the resulting set of small RNAs and thus, try to reduce the bias caused by the library preparation method.

Juliane Liepe, "Quantitative imaging reveals p38 MAP kinase mediated chemotactic variability in leukocytes"

We investigate leukocyte migration processes in zebrafish embryos during inflammation. Zebrafish embryos are optically transparent and therefore allow us to combine in vivo imaging with conventional biochemical and molecular biology interventions. Using automatic image processing we extract leukocyte trajectories following surgical wounding. Statistical analysis of the resulting migratory patterns reveals considerable heterogeneity in their chemotactic behaviour. We characterize the spatiotemporal behaviour of the leukocyte populations under different conditions, including the presence of chemical inhibitors for molecules involved in immune response related signal transduction. We find that p38 MAPK does influence the innate immune response to bacterial infections (which is here mimicked using LPS). The present analysis allows us to link up the molecular processes inside the leukocytes with tissue and whole-organism chemokine mediated signalling processes. We discuss how the results from such a multi-scale analysis shed further light onto the regulation of the innate immune response, the recruitment of individual leukocytes, and assess simple models that could explain the observed cell variability.

Michal Komorowski, "Approximate likelihood methods for quantitative single cell quantitative imaging data"

Fluorescent proteins are often used as reporters of transcriptional activity. The understanding of how the observed fluorescence level relates to the dynamics of gene expression requires mathematical and statistical modelling. Given the prevalence of noise in biochemical systems the timeseries
data arising from these is of signicant interest in efforts to calibrate stochastic models of gene expression and obtain information about sources
of non-genetic variability.
We present a statistical inference framework that can be used to infer kinetic parameters of biochemical reactions from experimentally measured time series. The linear noise approximation is used to derive an explicit formula for the likelihood of observed fluorescent data. The method is embedded in a Bayesian paradigm, so that certain parameters can be informed from other experiments allowing portability of results across dierent studies. Inference is performed using Markov chain Monte Carlo. We demonstrate applicability of the method using a number of various examples that include a model of single gene expression with extrinsic noise, a Bayesian hierarchical model for estimation of degradation rates and stochastic differential equations with bifurcations.

Paul Kirk, "Robust Biomarker Discovery"
Biomarker discovery is a major challenge in computational biology, with reported biomarkers often being treated with suspicion due to lack of reproducibility. Here we address this problem in the context of a cluster of disease phenotypes that are all characterised by a strong inflammatory response (including HTLV1-associated myelopathy and multiple sclerosis). We develop and discuss a general statistical procedure (stability selection) that allows us to identify reliable proteomic disease markers. Our aim is to find a robust minimal set of predictors (from mass spectrometry data) that permit successful diagnosis of disease outcome. The concept of robustness is here defined by how probable it is that we would select the same predictors if presented with a new data set. As we show, these approaches can result in robust, diagnostically useful biomarkers. We demonstrate that dependencies between predictors affect the robustness of the selected set and argue that stability selection methods allow subsequent experimental investigations to be targeted toward the most clinically useful biomarkers.

Catherine Higham, Christina Cobbold, Daniel Haydon, Darren Monckton, "A hierarchical Bayesian approach to quantify DNA instability in the inherited neuromuscular disease myotonic dystrophy"

About 20 human genetic diseases are associated with inheriting abnormally long, unstable DNA simple sequence repeats that mutate, by changing the number of repeats. These changes occur between generations but also during the lifetime of patients and have been termed ‘dynamic’ to distinguish them from much rarer static mutational events. By calibrating a new stochastic process model to a recent data set comprising over 30,000 de novo dynamic blood DNA mutations, from myotonic dystrophy patients, we have shown that the evolution of repeat length can be explained using relatively few biological parameters. The model predicts that the observed expansion bias is actually the result of many expansion and contraction events.

However, applying the model to each patient does not provide a basis for inference about the population so we will present new work that investigates the distribution of DNA instability within the population using a hierarchical Bayesian approach. This is a richer statistical model that will provide more robust prognostic information for patients. DNA instability is also a quantitative trait that could be assessed in terms of its heritability and used as a biomarker to identify any trans-acting genetic, epigenetic or environmental effects. Our expectation is that these trans-acting genetic modifiers will also apply in the general population where they will affect ageing, cancer, inherited disease and human genetic variation.

Sarah Behrens and Martin Vingron, "Studying the evolution of promoter sequences: a waiting time problem"

While the evolution of coding DNA sequences has been intensively studied over the past decades, the evolution and structure of regulatory DNA sequences still remain poorly understood. However, there is growing body of experimental evidence that promoter sequences are highly dynamic and that significant changes in gene regulation can occur on a microevolutionary time scale.
In order to give a probabilistic explanation for the rapidness of cis-regulatory evolution, we have addressed the following questions: (1) How long do we have to wait until a given transcription factor (TF) binding site emerges at random in a promoter sequence (by single nucleotide mutations)? and (2) How does the composition of a TF binding site affect this waiting time?
Using a Markovian model of sequence evolution and assuming neutral evolution, we can approximate the expected waiting time for every k-mer, k ranging from 3 to 10, until it gets fixed in one promoter of a species. The evolutionary rates of nucleotide substitution are estimated from a multiple species alignment (Homo sapiens, Pan troglodytes and Macaca mulatta). Since the CpG methylation deamination process (CG->TG and CG->CA) is the predominant evolutionary substitution process, we have also incorporated these neighbor dependent substitution rates into our model.
Our findings indicate that new TF binding sites can appear on a small evolutionary time scale: for example, on average we only have to wait around 7,500 years for a given 5-mer to emerge in at least one of all the human promoters, for 8-mers around 350,000 years and for 10-mers about 4.8 million years, i.e. a time span below the speciation time of human and chimp. Furthermore, we can conclude that the composition of a TF binding site plays a crucial role concerning its waiting time, e.g. some particular 10-mers can be even created in only 700,000 years. Our results suggest that the CpG methylation deamination substitution process is probably one of the driving forces in generating new TF binding sites.

Ryan Topping, William P. Kelly, Michael P.H. Stumpf and John W. Pinney, "Predictors of selection pressure across 39 strains of Saccharomyces cerevisiae"

There are myriad factors affecting the evolutionary pressures on a single gene. Using statistics designed to detect these pressures, we can attempt to decipher the effect, if any, of each of these proposed factors. This will allow a deeper understanding of which properties of genes influence their evolution.

In this work we aim to assess the importance of several factors in determining the evolutionary forces acting at each gene in the Saccharomyces cerevisiae genome. We took 39 resequenced S. cerevisiae genomes and calculated Tajima’s D for each gene, producing a genome-wide map of selection in yeast. This statistic was then compared to several predictive variables of each gene in a linear regression to ascertain their ability to explain selection in the genome.

We found that the only significant predictor of Tajima’s D is a measure of the time since the last duplication of the gene in question. This is in contrast to previous analyses using the ratio of non-synonymous to synonymous substitutions as their outcome variable, which find factors such as expression level and degree in protein interaction nework to be predictive of evolutionary pressures.

This work shows us that recent duplicate genes are under different selective pressures to non-duplicates, showing a tendency to be under positive selection. The results also show that different tests for selection can give distinct and complementary information.

Rodrigo Liberal and John W. Pinney, "An evolutionary perspective on filling pathway holes"
More accurate metabolic networks of pathogens and parasites are required to support the identification of important enzymes or transporters that could be potential targets for new drugs. This aim of this work is to build a probabilistic model of metabolic network evolution that can help us reach a new level of quality for automated metabolic network reconstruction.

We focus here on the task of filling pathway holes, given an initial metabolic network as a starting point. For this task we have developed a methodology that from a given set of holes will return candidate enzymes to fill the gaps. The key point of this methodology is its superfamily perspective, allowing us to search for remote homologues within our target species to domains known to perform the required enzymatic function in other species.

We expand this evolutionary perspective to extract further information that is useful for the ranking of our candidates, including the identification of evolutionary events that may indicate changes in enzymatic function. Other types of information, such as the analysis of network properties, are also taken into account.

We present the results of our method as applied to pathway holes for the human malaria parasite, Plasmodium falciparum.

Yang Luo, Chia Lim, Alfonso Martinez-Arias and Lorenz Wernisch, "Modelling stochastic heterogeneity in cultures of stem cells"

Phenotypic cell-to-cell variability is often observed in stable cultures of genetically identical stem cells. This is mostly due to stochastic fluctuation which results in certain types of probabilistic and transient cellular differentiation and which is observable when individual cells periodically switch between alternative states. We see this in mouse embryonic stem cells where pluripotency is associated with the activity of a gene regulatory network formed by the transcription factors Nanog, Oct4 and Sox2. Using fluorescent reporters for the expression of Nanog a population of embryonic stem cells can be described by a dynamic distribution of Nanog expression characterized by two peaks at high and low expression levels. In clonal populations, spontaneous
outlier cells with either extremely high or low expression reconstitute the two peak distribution after several days. Different cell culture media change the dynamics of the system in a way that enables the inference and fitting of dynamical models.

In our study, we analyse this behaviour using two complementary approaches. In the first, a system of stochastic differential equations describes the possible dynamics of the underlying genetic circuit. The second approach models the stochastic behaviour nonparametrically by fitting a Gaussian mixture model of the potential function of the system as well as a stochastic noise term. We use a range of methods for parameter fitting and model selection, such as MCMC, Approximate Bayesian Computation, and nested sampling. We demonstrate that stochastic fluctuation can have both a quantitative and qualitative impact on the behaviour of gene regulatory networks and is a major source of phenotypic variation.

Michalis Titsias, Antti Honkela, Magnus Rattray and Neil Lawrence, "Bayesian Inference of Multiple Transcription Factor Networks"
We present a Bayesian inference methodology for learning the structure and the parameters of a transcription regulatory network where multiple transcription factors (TFs) can jointly regulate target genes. Gaussian process priors are used for the TF protein activity functions and spike and slab sparse priors are assigned to connectivity weights. Fully Bayesian inference is considered through the use of MCMC by exploiting the information given by a finite set of noisy observations of the gene expression functions.

Inference of the TF activities can be improved by incorporating observations of the corresponding gene expression in case they are available and the TF is under transcriptional control. This is achieved through the use of a protein translation ODE model, which allows incorporating the above observations into the Bayesian estimation framework.

The above model can be used for i) systems identification of moderate-sized networks (e.g.\ consisting of 5 TFs and 100 possible target genes) and ii) for genome-wide transcription factor target identification where the model is first fitted to a small set of {\em training} genes and then is used to identify the connections in {\em test} genes by computing Bayesian predictive distributions.

Frank Dondelinger, Sophie Lebre and Dirk Husmeier, "Inferring Developmental Gene Networks using Heterogeneous Dynamic Bayesian Networks with Information Sharing"

Dynamic Bayesian networks (DBNs) are frequently applied to the problem of inferring gene regulatory networks. However, classical DBNs are based on the homogeneous Markov assumption and cannot deal with the heterogeneity that arises if we investigate the changing gene interactions during development. Various heterogeneous DBN methods have been proposed. We improve on the shortcomings of the previous versions of heterogeneous DBNS by: (1) avoiding the need for data discretization, (2) increasing the flexibility over a time-invariant network structure, (3) avoiding over-flexibility and overfitting by introducing a regularization scheme based on information sharing between time segments and (4) allowing all hyperparameters to be inferred from the data via a consistent Bayesian inference scheme.

We evaluate our method on synthetic data, and show that it outperforms the unconstrained method without information sharing. We also investigate the differences between a global information sharing scheme (taking all segments into account) and a sequential information sharing scheme (where only information from the previous segment is propagated forward). We then apply our method to the problem of inferring the gene regulation networks for the embryo, larva, pupa and adult stages of Drosophila melanogaster from a muscle development gene expression time series, inferring both the change points and the network structure. We compare the results with those obtained by alternative published methods, and note that we get better agreement with the known morphogenic transitions. Furthermore, the changes we have detected in the gene regulatory interactions are consistent with independent biological findings.

Chris Barnes, "Efficient inference in systems biology models"

The mathematical models used in systems biology are usually of high dimension with many unknown parameters. Measuring parameter values in vivo can be difficult and measurements taken in vitro are often not applicable. Therefore, in a growing number of cases, the parameter values must be inferred from data such as time series measurements. In fact, due to the large correlations in inferred parameters, many parameter combinations can be consistent with a given data set, a concept that has recently become known as 'sloppyness', which poses severe challenges to inference as well as experimental measurements. We will show that these effects can become further compounded when trying to draw inferences from highly resolved state-of-the-art single cell phosphoproteomic data.

Bayesian inference involving large numbers of correlated parameters can be difficult for both deterministic and stochastic models. Here we present methods to utilise the correlation structure of the posterior distribution - a generic feature of sloppy systems - in order create more efficient Monte Carlo samplers. These methods are here presented in an approximate Bayesian framework but are applicable more widely to MCMC and SMC methods. We show that it is possible to develop improved perturbation kernels for sloppy systems and discuss their use in the context of p53 oscillations and MAPK phosphorylation dynamics.

William Kelly and Paul Kirk, "Stochastic Emulation of Biological Models"

Complex mathematical models have become commonplace in the modern life sciences, allowing the formal description of biological processes, generating predictions of unobserved behaviour, and providing insight into previous experimental findings. Typically, these models are constructed on the basis of prior mechanistic knowledge, with unknown parameters determined by fitting to an observed data set. In order to understand the properties of our models (such as parameter sensitivity) and to explore the range of behaviours that they permit, we often turn to simulation studies. Such analyses usually proceed by determining the model’s output at a (large) number of well-chosen input vectors (the design points), providing a coarse-grained view of the global input-output relationship. For deterministic models, the method of emulation has been used to speed up this process, training a regression model on a smaller number of input-output pairs, and then either simulating from this surrogate (exploiting its relative computational cheapness) or learning directly from the statistical dependencies that it uncovers. We here demonstrate that density estimation methods can be used in an analogous manner as emulators for stochastic models. We borrow ideas from Approximate Bayesian Computation (ABC) in order to make higher-dimensional problems more tractable through the use of (low-dimensional) summary statistics. We illustrate the approach in the context of Hes-1 signalling data collected in mice. As an integral part of the vertebrate innate immune response, this is an intrinsically multi-scale problem that particularly benefits from the computational advances offered by stochastic emulators.

Daniel Silk, "Estimating Lyapunov exponents for biological systems in a Bayesian framework"

Lyapunov exponents (LEs) characterize the rate of separation of nearby trajectories and allow the predictability of a systems future states to be quantiﬁed. Although currently under-exploited in the ﬁeld, LEs have previously been employed in a number of systems biology studies, including investigations into: the sensitivity of cell signalling networks to initial conditions; when deterministic models may be used in place of their stochastic counterparts; and the robustness of an inferred dynamic gene regulatory network. It has also been noted that ﬁnite-time LEs may be used to identify regions of predictability within highly non-linear (and chaotic) systems, and that they could be employed as a summary statistic for time series data, providing a particularly insightful means by which to compare observed and predicted behaviours.

The success of current methods for estimating LEs from observed time-series relies on the availability of copious and virtually noise-free observations. This explains their apparent underuse in the analysis of biological system dynamics where the associated time series are often both “short” and “noisy”.

With reference to classical and biological dynamical systems, we identify the key sources of error in a standard LE estimation procedure. We then show that it is possible to estimate LEs of such systems, and their uncertainty, within a Bayesian framework. Here we can employ the ﬂexibility of state-space models and obtain reliable estimates of both the noiseless states and parameters of the (embedded) system using a dual Unscented Kalman Filter. We will discuss how these methods can be applied in the context of innate immune response signalling (HES-1 and P38 MAP Kinase).