Talk Abstracts
Most recombination in humans (and many other species) occurs in small regions, typically 1-2kb in size, called recombination hotspots. Information about these hotspots can be gleaned directly, for example through sperm-typing, or indirectly, through linkage disequilibrium, the correlations between alleles at loci which are physically close to each other. The availability of extensive data documenting genome-wide patterns of variation in human populations, and the development of statistical methods for inferring hotspot locations from these data, has led to the characterisation of over 30,000 human recombination hotspots. Subsequent statistical and bioinformatics analyses identified a 13bp sequence motif associated with hotspot activity. Recently the motif has been shown not to operate in chimpanzees, suggesting rapid evolution of the hotspot motif, and potentially of the associated molecular machinery. Further in silico analyses, and independent experimental work in other groups, identified a zinc-finger protein, PRDM9, which binds to the motif. PRDM9 shows extensive variation across mammals, and within human populations.
We have focussed on genotyping large deletions (> 50bp) from the pilot phase of the 1000 Genomes Project where 180 samples were sequenced at low-coverage (~4X). We propose a probabilistic model that incorporates three distinct parts of information which can be combined or be used separately: the re-mapping to the non-reference allele, the distance between the two mapped mates of a read pair and the read depth within and outside the deletion breakpoints. SNP genotype information can be used to improve the results. It is also possible to extend the model to genotype other types of SVs such as small indels, large insertions or inversions.
The development of high-throughput sequencing technologies in recent years (Margulies 2005, Bentley 2006, Schuster 2008, Mardis 2008) has led to a massive increase in genomic data represented by counts. These count data are distinct from those acquired using bead and array technologies in that they are fundamentally discrete, rather than continuous, in nature. Rather than measurements of intensity, we acquire counts of the number of times a particular sequence is observed in a library, whether the source is genomic DNA, DNA fragments produced by immune precipitation, mRNA or small RNAs.
A key question in the analysis of such data remains; how can we identify elements of the data that show patterns of differential expression that correlate with our knowledge about the biological samples? Here we discuss an empirical Bayes method which performs at least as well as, and often better than current alternatives. We consider a distribution for a given element of the data defined by a set of underlying parameters for which some prior distribution exists. By estimating this prior distribution from the total data available, then for a given model about the relatedness of our underlying parameters for multiple libraries, we are able to assess the posterior likelihood of the model. This approach allows the simultaneous comparison of multiple models for differential expression between samples and is thus readily applicable to complex experimental designs.
We show that for analyses of differential expression between genomic regions (e.g. genic regions) we can improve the accuracy of predictions further by adapting our methods to include information on the lengths of the genomic regions. Adapting this method further, we are able to distinguish between genomic regions that have genuine association with expressed sequencing reads and those that do not.
Alena Mysickova and and Martin Vingron, "Rank List Aggregation on Small RNA Digital Gene Expression Data"
data arising from these is of signicant interest in efforts to calibrate stochastic models of gene expression and obtain information about sources
of non-genetic variability.
We present a statistical inference framework that can be used to infer kinetic parameters of biochemical reactions from experimentally measured time series. The linear noise approximation is used to derive an explicit formula for the likelihood of observed fluorescent data. The method is embedded in a Bayesian paradigm, so that certain parameters can be informed from other experiments allowing portability of results across dierent studies. Inference is performed using Markov chain Monte Carlo. We demonstrate applicability of the method using a number of various examples that include a model of single gene expression with extrinsic noise, a Bayesian hierarchical model for estimation of degradation rates and stochastic differential equations with bifurcations.
Biomarker discovery is a major challenge in computational biology, with reported biomarkers often being treated with suspicion due to lack of reproducibility. Here we address this problem in the context of a cluster of disease phenotypes that are all characterised by a strong inflammatory response (including HTLV1-associated myelopathy and multiple sclerosis). We develop and discuss a general statistical procedure (stability selection) that allows us to identify reliable proteomic disease markers. Our aim is to find a robust minimal set of predictors (from mass spectrometry data) that permit successful diagnosis of disease outcome. The concept of robustness is here defined by how probable it is that we would select the same predictors if presented with a new data set. As we show, these approaches can result in robust, diagnostically useful biomarkers. We demonstrate that dependencies between predictors affect the robustness of the selected set and argue that stability selection methods allow subsequent experimental investigations to be targeted toward the most clinically useful biomarkers.
Catherine Higham, Christina Cobbold, Daniel Haydon, Darren Monckton, "A hierarchical Bayesian approach to quantify DNA instability in the inherited neuromuscular disease myotonic dystrophy"
However, applying the model to each patient does not provide a basis for inference about the population so we will present new work that investigates the distribution of DNA instability within the population using a hierarchical Bayesian approach. This is a richer statistical model that will provide more robust prognostic information for patients. DNA instability is also a quantitative trait that could be assessed in terms of its heritability and used as a biomarker to identify any trans-acting genetic, epigenetic or environmental effects. Our expectation is that these trans-acting genetic modifiers will also apply in the general population where they will affect ageing, cancer, inherited disease and human genetic variation.
In order to give a probabilistic explanation for the rapidness of cis-regulatory evolution, we have addressed the following questions: (1) How long do we have to wait until a given transcription factor (TF) binding site emerges at random in a promoter sequence (by single nucleotide mutations)? and (2) How does the composition of a TF binding site affect this waiting time?
Using a Markovian model of sequence evolution and assuming neutral evolution, we can approximate the expected waiting time for every k-mer, k ranging from 3 to 10, until it gets fixed in one promoter of a species. The evolutionary rates of nucleotide substitution are estimated from a multiple species alignment (Homo sapiens, Pan troglodytes and Macaca mulatta). Since the CpG methylation deamination process (CG->TG and CG->CA) is the predominant evolutionary substitution process, we have also incorporated these neighbor dependent substitution rates into our model.
Our findings indicate that new TF binding sites can appear on a small evolutionary time scale: for example, on average we only have to wait around 7,500 years for a given 5-mer to emerge in at least one of all the human promoters, for 8-mers around 350,000 years and for 10-mers about 4.8 million years, i.e. a time span below the speciation time of human and chimp. Furthermore, we can conclude that the composition of a TF binding site plays a crucial role concerning its waiting time, e.g. some particular 10-mers can be even created in only 700,000 years. Our results suggest that the CpG methylation deamination substitution process is probably one of the driving forces in generating new TF binding sites.
In this work we aim to assess the importance of several factors in determining the evolutionary forces acting at each gene in the Saccharomyces cerevisiae genome. We took 39 resequenced S. cerevisiae genomes and calculated Tajima’s D for each gene, producing a genome-wide map of selection in yeast. This statistic was then compared to several predictive variables of each gene in a linear regression to ascertain their ability to explain selection in the genome.
We found that the only significant predictor of Tajima’s D is a measure of the time since the last duplication of the gene in question. This is in contrast to previous analyses using the ratio of non-synonymous to synonymous substitutions as their outcome variable, which find factors such as expression level and degree in protein interaction nework to be predictive of evolutionary pressures.
This work shows us that recent duplicate genes are under different selective pressures to non-duplicates, showing a tendency to be under positive selection. The results also show that different tests for selection can give distinct and complementary information.
More accurate metabolic networks of pathogens and parasites are required to support the identification of important enzymes or transporters that could be potential targets for new drugs. This aim of this work is to build a probabilistic model of metabolic network evolution that can help us reach a new level of quality for automated metabolic network reconstruction.
We focus here on the task of filling pathway holes, given an initial metabolic network as a starting point. For this task we have developed a methodology that from a given set of holes will return candidate enzymes to fill the gaps. The key point of this methodology is its superfamily perspective, allowing us to search for remote homologues within our target species to domains known to perform the required enzymatic function in other species.
We expand this evolutionary perspective to extract further information that is useful for the ranking of our candidates, including the identification of evolutionary events that may indicate changes in enzymatic function. Other types of information, such as the analysis of network properties, are also taken into account.
We present the results of our method as applied to pathway holes for the human malaria parasite, Plasmodium falciparum.
Yang Luo, Chia Lim, Alfonso Martinez-Arias and Lorenz Wernisch, "Modelling stochastic heterogeneity in cultures of stem cells"
outlier cells with either extremely high or low expression reconstitute the two peak distribution after several days. Different cell culture media change the dynamics of the system in a way that enables the inference and fitting of dynamical models.
In our study, we analyse this behaviour using two complementary approaches. In the first, a system of stochastic differential equations describes the possible dynamics of the underlying genetic circuit. The second approach models the stochastic behaviour nonparametrically by fitting a Gaussian mixture model of the potential function of the system as well as a stochastic noise term. We use a range of methods for parameter fitting and model selection, such as MCMC, Approximate Bayesian Computation, and nested sampling. We demonstrate that stochastic fluctuation can have both a quantitative and qualitative impact on the behaviour of gene regulatory networks and is a major source of phenotypic variation.
We present a Bayesian inference methodology for learning the structure and the parameters of a transcription regulatory network where multiple transcription factors (TFs) can jointly regulate target genes. Gaussian process priors are used for the TF protein activity functions and spike and slab sparse priors are assigned to connectivity weights. Fully Bayesian inference is considered through the use of MCMC by exploiting the information given by a finite set of noisy observations of the gene expression functions.
Inference of the TF activities can be improved by incorporating observations of the corresponding gene expression in case they are available and the TF is under transcriptional control. This is achieved through the use of a protein translation ODE model, which allows incorporating the above observations into the Bayesian estimation framework.
The above model can be used for i) systems identification of moderate-sized networks (e.g.\ consisting of 5 TFs and 100 possible target genes) and ii) for genome-wide transcription factor target identification where the model is first fitted to a small set of {\em training} genes and then is used to identify the connections in {\em test} genes by computing Bayesian predictive distributions.
Frank Dondelinger, Sophie Lebre and Dirk Husmeier, "Inferring Developmental Gene Networks using Heterogeneous Dynamic Bayesian Networks with Information Sharing"
We evaluate our method on synthetic data, and show that it outperforms the unconstrained method without information sharing. We also investigate the differences between a global information sharing scheme (taking all segments into account) and a sequential information sharing scheme (where only information from the previous segment is propagated forward). We then apply our method to the problem of inferring the gene regulation networks for the embryo, larva, pupa and adult stages of Drosophila melanogaster from a muscle development gene expression time series, inferring both the change points and the network structure. We compare the results with those obtained by alternative published methods, and note that we get better agreement with the known morphogenic transitions. Furthermore, the changes we have detected in the gene regulatory interactions are consistent with independent biological findings.
Bayesian inference involving large numbers of correlated parameters can be difficult for both deterministic and stochastic models. Here we present methods to utilise the correlation structure of the posterior distribution - a generic feature of sloppy systems - in order create more efficient Monte Carlo samplers. These methods are here presented in an approximate Bayesian framework but are applicable more widely to MCMC and SMC methods. We show that it is possible to develop improved perturbation kernels for sloppy systems and discuss their use in the context of p53 oscillations and MAPK phosphorylation dynamics.
The success of current methods for estimating LEs from observed time-series relies on the availability of copious and virtually noise-free observations. This explains their apparent underuse in the analysis of biological system dynamics where the associated time series are often both “short” and “noisy”.
With reference to classical and biological dynamical systems, we identify the key sources of error in a standard LE estimation procedure. We then show that it is possible to estimate LEs of such systems, and their uncertainty, within a Bayesian framework. Here we can employ the flexibility of state-space models and obtain reliable estimates of both the noiseless states and parameters of the (embedded) system using a dual Unscented Kalman Filter. We will discuss how these methods can be applied in the context of innate immune response signalling (HES-1 and P38 MAP Kinase).