EPSRC Symposium Workshop on Information extraction from complex data sets (INF)

14-17 September 2009,

Organisers: D Wild, S Mukherjee, Z Ghahramani (Cambridge)

ABSTRACTS

Ziv Bar-Joseph

Cross species analysis of functional genomics data

Recent advances in genomics are enabling researchers to accumulate large
datasets in multiple species. These include sequence data as well as
functional information such as the level of gene expression and various
types of interactions. However, while the sequence and function of genes
are highly conserved between close species, expression and interaction
data appears to be much less conserved. For example, even though there
is 90% sequence similarity between human and mice, both their expression
and interaction similarities are is lower than 20%.

In this talk I will present methods that utilize graphical models and
constrained clustering for integrating sequence and functional data from
multiple species. We used these methods to study two biological systems:
cell cycle and immune response. As we show, using these methods we can
improve on the sets of genes recovered for each species independently.
More importantly, these methods allow us to recover the core set of
genes for specific biological systems indicating that data integration
across species can overcome problems associated with the analysis of
genomics data.

Mark Girolami

Riemann Manifold MCMC for very high dimensional data

Information extraction from complex data sets such as those
produced from functional genomic and proteomic technologies is typically
model-based. Statistical models of high-dimensional observations or
multiple-sources themselves have complex and oftentimes high-dimensional
parameterisations bringing with them challenges in performing inference.
In such a setting performing Markov Chain Monte Carlo based inference
efficiently is an ongoing theme of methodological research. This talk
presents a Riemannian Manifold Hamiltonian Monte Carlo sampler to
resolve the shortcomings of existing Monte Carlo algorithms when
sampling from target densities that may be high dimensional and exhibit
strong correlations. The method provides a fully automated adaptation
mechanism that circumvents the costly pilot runs required to tune
proposal densities for Metropolis-Hastings or indeed Hybrid Monte Carlo
and Metropolis Adjusted Langevin Algorithms. This allows for highly
efficient sampling even in very high dimensions where different scalings
may be required for the transient and stationary phases of the Markov
chain. The proposed method exploits the Riemannian structure of the
parameter space of statistical models and thus automatically adapts to
the local manifold structure at each step based on the metric tensor. A
semi-explicit second order symplectic integrator for non-separable
Hamiltonians is derived for simulating paths across this manifold which
provides highly efficient convergence and exploration of the target
density. The performance of the Riemannian Manifold Hamiltonian Monte
Carlo method is assessed by performing posterior inference on logistic
regression models, log-Gaussian Cox point processes, stochastic
volatility models, and Bayesian estimation of parameter posteriors of
dynamical systems described by nonlinear differential equations.
Substantial improvements in the time normalised Effective Sample Size
are reported when compared to alternative sampling approaches.

Christopher Holmes

Some issues in robust Bayesian inference for functional genomics

Experiments in functional genomics typically produce highly structured
data sets, with thousands of measurements on tens to hundreds of
individuals. The nature of the assays and the sheer number of
measurements taken makes analysis of such data prone to influence by
outliers that arise from bad samples or bad measurements. This
influence is especially problematic within discovery driven studies
which often apply simple statistical models to multiple subsets of the
data with the resulting findings ranked in some fashion, such as when
using microarrays to test for differential gene expression under two
treatments. In these scenarios, semi-automated robust Bayesian
inference provides an attractive inferential framework. We will
discuss our experience in the analysis of complex genomic data sets
using robust Bayesian methods via both parametric, robust Bayesian
ANOVA, and non-parametric, Bayesian Hidden Markov Models with mixture
of Dirichlet Process state sampling distributions (likelihoods), and
show these lead to substantial gains in inference and resulting
findings.

Dirk Husmeier

Joint work with Marco Grzegorczyk

Learning gene regulatory networks from gene expression time series with
non-linear/non-stationary dynamic Bayesian networks

Feedback loops and recurrent structures are essential to the
regulation and stable control of complex biological systems. The
application of dynamic as opposed to static Bayesian networks is
promising in that, in principle, these feedback loops can be learned
from gene expression time series. However, we will show that the
widely applied BGe model is susceptible to learning spurious feedback
loops, which are a consequence of non-linear regulation and
autocorrelation in the data. We propose a non-linear/non-stationary
generalisation of the BGe model, based on a mixture model and
change-point process. We demonstrate that this approach has the
potential to successfully avoid spurious feedback loops that BGe is
susceptible to, which leads to a more accurate network reconstruction.

Neil Lawrence

Efficient Multiple Output Convolution Processes for Multiple Task Learning

Learning multiple correlated outputs with a Gaussian process
presents problems both in specifying the covariance (kernel) function
and efficiently inverting it. We consider the convolution process route
to generating covariance functions over structured outputs. We will show
how sparse approximations based on conditional independence assumptions
and variational methods can be used to make inference and learning
efficient. Given time we will give examples from multi-task learning,
computational biology, financial time series and human motion modeling.

Jure Leskovec

Meme-tracking and the Dynamics of the News Cycle

Tracking new topics, ideas, and "memes'' across the Web has been an issue
of considerable interest. Recent work has developed methods for tracking
topic shifts over long time scales, as well as abrupt spikes in the
appearance of particular named entities. However, these approaches are
less well suited to the identification of content that spreads widely and
then fades over time scales on the order of days - the time scale at
which we perceive news and events.

We develop a framework for tracking short, distinctive phrases that travel
relatively intact through on-line text; developing scalable algorithms for
clustering textual variants of such phrases, we identify a broad class of
memes that exhibit wide spread and rich variation on a daily basis. As our
principal domain of study, we show how such a meme-tracking approach can
provide a coherent representation of the news cycle --- the daily
rhythms in the news media that have long been the subject of qualitative
interpretation but have never been captured accurately enough to permit
actual quantitative analysis. We tracked 1.6 million mainstream media
sites and blogs over a period of three months with the total of 90 million
articles and we find a set of novel and persistent temporal patterns in
the news cycle. In particular, we observe a typical lag of 2.5 hours
between the peaks of attention to a phrase in the news media and in blogs
respectively, with divergent behavior around the overall peak and a
"heartbeat''-like pattern in the handoff between news and blogs. We also
develop and analyze a mathematical model for the kinds of temporal
variation that the system exhibits.

Guido Sanguinetti

Approximate inference for Markov Jump Processes with applications
in systems and developmental biology

Markov Jump Processes represent a convenient mathematical
model of many chemical reactions involving low numbers of molecular
species. Inference in these models is hampered by the necessity to solve
very large systems of ODEs giving the forward backward relations. In
this talk, I will present some work (in collaboration with Manfred
Opper) on using a variational mean field approach to reduce the
inference problem to a more tractable size. I will give some examples of
applications, including an application to reaction-diffusion systems in
morphogenesis of Drosophila embryos (joint work with M. Dewar, M.
Opper, V. Kadirkamanathan).

Eric Schadt

Networks as the Sensors and Drivers of Disease

Common human diseases and drug response are complex traits
that involve entire networks of changes at the molecular level driven
by genetic and environmental perturbations. Efforts to elucidate
disease and drug response traits have focused on single dimensions of
the system. Studies focused on identifying changes in DNA that
correlate with changes in disease or drug response traits, changes in
gene expression that correlate with disease or drug response traits,
or changes in other molecular traits (e.g., metabolite, methylation
status, protein phosphorylation status, and so on) that correlate with
disease or drug response are fairly routine and have met with great
success in many cases. However, to further our understanding of the
complex network of molecular and cellular changes that impact disease
risk, disease progression, severity, and drug response, these multiple
dimensions must be considered together. Here I present an approach
for integrating a diversity of molecular and clinical trait data to
uncover models that predict complex system behavior. By integrating
diverse types of data on a large scale I demonstrate that some forms
of common human diseases are most likely the result of perturbations
to specific gene networks that in turn causes changes in the states of
other gene networks both within and between tissues that drive
biological processes associated with disease. These models elucidate
not only primary drivers of disease and drug response, but they
provide a context within which to interpret biological function,
beyond what could be achieved by looking at one dimension alone. That
some forms of common human diseases are the result of complex
interactions among networks has significant implications for drug
discovery: designing drugs or drug combinations to impact entire
network states rather than designing drugs that target specific
disease associated genes.

Ricardo Silva

Joint work with Katherine Heller, Zoubin Ghahramani and Edoardo Airoldi

Ranking Relations Using Analogies

We develop an approach to relational learning which, given a
set of pairs of objects S = {A1:B1,A2:B2, . . . ,AN:BN}, measures
how well other pairs A:B fit in with the set S. Our work addresses
the question: is the relation between objects A and B analogous to
those relations found in S? Such questions are particularly relevant
in information retrieval, where an investigator might want to search
for analogous pairs of objects that match the query set of interest.
Analogical reasoning depends fundamentally on the ability to learn
and generalize about relations between objects. There are many ways
in which objects can be related, making the task very challenging.
We recast this classical problem as a problem of Bayesian analysis
of relational data and function spaces, and illustrate its potential in
the domain of text analysis. A detailed application on searching for
protein-protein interactions is discussed.

John Skilling

The Nested Sampling Algorithm

The "Nested Sampling" algorithm is designed for probabilistic inference,
where a function of arbitrary complexity is to be both integrated (for
model selection) and sampled (for inferred parameters). It is an
iterative Monte Carlo scheme based on sampling within a progressive
constraint on function value. This constraint compresses the remaining
available volume smoothly and systematically, so that exploration is (at
least in principle) independent of quirks of function behaviour. This
property is particularly valuable in multi-modal problems, where peaks
of different heights and volumes need to be correctly balanced.

Early applications are in cosmological model selection and the modelling
of nano-materials.

Michael Stumpf

Model selection from single cell data

For the vast majority of biological systems we lack reliable models let
alone model parameters. Using well defined simulation models and real
biological data collected for a range of biological signalling systems,
we explore how much can be learned about biological systems from
temporally resolved transcriptomic or proteomic data. We pay particular
attention to qualitative properties of the underlying dynamical system
and their impact on our ability to infer the system's dynamics. We then
illustrate how approximate Bayesian computation approaches can be
employed to gain insights into the inferability of model parameters, and
for model selection in the context of dynamical systems of signalling
networks in systems biology. We will pay particular attention to the
analysis of single-cell data and discuss the relative advantages of
different experimental setups to study cellular variability.

Simon Tavaré

Joint work with Christiana Spyrou, Rory Stark, Andy Lynch

Some statistical issues in the analysis of Illumina sequencing experiments

High-throughput sequencing technologies have become popular for the
study of genome organization, gene expression, methylation and
protein-DNA interactions. For example, chromatin immunoprecipitation
followed by sequencing of the resulting samples produces large amounts
of data that can be used to map transcription factor binding sites,
histone modifications and origins of replication.

In this talk I will discuss some of the statistical issues from such
data, focussing primarily on ChIP-seq experiments. I will describe
some research from the CRI in which ChIP-seq has proved invaluable,
and illustrate a statistical method for calling enriched
regions. BayesPeak uses a fully Bayesian hidden Markov model to detect
enriched locations in the genome. The structure accommodates the
natural features of Illumina sequencing data and allows for
overdispersion in the abundance of reads in different
regions. Moreover, a control sample can be incorporated in the
analysis to account for experimental and sequence biases. Markov chain
Monte Carlo algorithms are applied to estimate the posterior
distributions of the model parameters, and posterior probabilities are
used to identify the sites of interest. I will give some comparisons
with existing approaches, and describe related applications such as
mapping origins of replication using BrdU-IP-seq and for which novel
statistical problems arise.

John Winn

Modelling complex disease phenotype data with Infer.NET

When trying to understand the genetic basis of disease, a common
approach is to treat presence or absence of a disease as a binary
target. Because many diseases involve multiple, complex systems,
disease symptoms may be due to a failure in a subset of a large
number of relevant cellular mechanisms across multiple systems. For
example, asthma symptoms may arise from problems with the immune
system, bronchial hyper-sensitivity or difficulties during lung
development - or some combination of these with varying severity.
Hence, before we can understand the genetic basis of a disease, it is
important to identify and decompose the system-level basis of the
disease. Genetic associations to these underlying system-level
factors can then be found, instead of to the disease label, making it
possible to detect associations that were previously lost.

Our approach to understanding the system-level basis of disease is to
construct a graphical model of rich disease phenotype data. This
approach allows us to combine physiological, clinical, environmental
and sociological variables relevant to the disease whilst also taking
into account expert clinical knowledge. To construct these rich
models, we use the Infer.NET graphical modelling and inference tool
developed at Microsoft Research Cambridge. Infer.NET allows very
rapid development, testing and refinement of the model, whilst also
scaling to very large datasets. I illustrate the talk with a
detailed example of how Infer.NET was used to model asthma phenotype
data as part of a project undertaken with the University of
Manchester.

Eric Xing

Time (and Space)-Varying Networks: Reverse engineering rewiring genetic interactions

A plausible representation of the relational information among entities
in dynamic systems such as a living cell is a stochastic network which
is topologically rewiring and semantically evolving over time (or
space). While there is a rich literature in modeling static or
temporally invariant networks, until recently, little has been done
toward modeling the dynamic processes underlying rewiring networks, and
on recovering such networks when they are not observable. In this talk,
I will present a new formalism for modeling network evolution over time
based on temporal exponential random graphs, and several new algorithms
based on temporal extensions of the sparse graphical logistic
regression, for reverse-engineering the latent time/space varying
networks. These algorithms can be cast as standard convex-optimization
problems and solved efficiently using generic solvers scalable to large
networks. I will show some promising results on recovering the latent
sequence of evolving gene networks over more than 4000 genes during the
life cycle of Drosophila melanogaster from microarray time course, at a
time resolution only limited by sample frequency (i.e., works even when
a single snapshot of node-values from each time-specific network is
available.) I will also sketch some theoretical results on asymptotic
sparsistency of the proposed methods, which differ significantly from
traditional sparsistency analysis of static structure estimation based
on iid samples because of the temporal relatedness of samples.

How to get here

Internet Access at Warwick:
Where possible, visitors should obtain an EDUROAM account from their own university to enable internet access whilst at Warwick.

If you need WiFi whilst at Warwick, click here for instructions (upon arrival at Warwick)

Registration:
You can register for any of the symposia or workshops online. To see which registrations are currently open and to submit a registration, please click here.

Contact:
Mathematics Research Centre
Zeeman Building
University of Warwick
Coventry CV4 7AL - UK
E-mail:
MRC@warwick.ac.uk