# EPSRC Symposium Workshop on Information extraction from complex data sets (INF)

### 14-17 September 2009,

Organisers: D Wild, S Mukherjee, Z Ghahramani (Cambridge)

### ABSTRACTS

**Ziv Bar-Joseph
Cross species analysis of functional genomics data
**

Recent advances in genomics are enabling researchers to accumulate large

datasets in multiple species. These include sequence data as well as

functional information such as the level of gene expression and various

types of interactions. However, while the sequence and function of genes

are highly conserved between close species, expression and interaction

data appears to be much less conserved. For example, even though there

is 90% sequence similarity between human and mice, both their expression

and interaction similarities are is lower than 20%.

In this talk I will present methods that utilize graphical models and

constrained clustering for integrating sequence and functional data from

multiple species. We used these methods to study two biological systems:

cell cycle and immune response. As we show, using these methods we can

improve on the sets of genes recovered for each species independently.

More importantly, these methods allow us to recover the core set of

genes for specific biological systems indicating that data integration

across species can overcome problems associated with the analysis of

genomics data.

**Mark Girolami**

**Riemann Manifold MCMC for very high dimensional data
**

Information extraction from complex data sets such as those

produced from functional genomic and proteomic technologies is typically

model-based. Statistical models of high-dimensional observations or

multiple-sources themselves have complex and oftentimes high-dimensional

parameterisations bringing with them challenges in performing inference.

In such a setting performing Markov Chain Monte Carlo based inference

efficiently is an ongoing theme of methodological research. This talk

presents a Riemannian Manifold Hamiltonian Monte Carlo sampler to

resolve the shortcomings of existing Monte Carlo algorithms when

sampling from target densities that may be high dimensional and exhibit

strong correlations. The method provides a fully automated adaptation

mechanism that circumvents the costly pilot runs required to tune

proposal densities for Metropolis-Hastings or indeed Hybrid Monte Carlo

and Metropolis Adjusted Langevin Algorithms. This allows for highly

efficient sampling even in very high dimensions where different scalings

may be required for the transient and stationary phases of the Markov

chain. The proposed method exploits the Riemannian structure of the

parameter space of statistical models and thus automatically adapts to

the local manifold structure at each step based on the metric tensor. A

semi-explicit second order symplectic integrator for non-separable

Hamiltonians is derived for simulating paths across this manifold which

provides highly efficient convergence and exploration of the target

density. The performance of the Riemannian Manifold Hamiltonian Monte

Carlo method is assessed by performing posterior inference on logistic

regression models, log-Gaussian Cox point processes, stochastic

volatility models, and Bayesian estimation of parameter posteriors of

dynamical systems described by nonlinear differential equations.

Substantial improvements in the time normalised Effective Sample Size

are reported when compared to alternative sampling approaches.

**Christopher Holmes**

**Some issues in robust Bayesian inference for functional genomics
**Experiments in functional genomics typically produce highly structured

data sets, with thousands of measurements on tens to hundreds of

individuals. The nature of the assays and the sheer number of

measurements taken makes analysis of such data prone to influence by

outliers that arise from bad samples or bad measurements. This

influence is especially problematic within discovery driven studies

which often apply simple statistical models to multiple subsets of the

data with the resulting findings ranked in some fashion, such as when

using microarrays to test for differential gene expression under two

treatments. In these scenarios, semi-automated robust Bayesian

inference provides an attractive inferential framework. We will

discuss our experience in the analysis of complex genomic data sets

using robust Bayesian methods via both parametric, robust Bayesian

ANOVA, and non-parametric, Bayesian Hidden Markov Models with mixture

of Dirichlet Process state sampling distributions (likelihoods), and

show these lead to substantial gains in inference and resulting

findings.

**Dirk Husmeier**

**Joint work with Marco Grzegorczyk**

**Learning gene regulatory networks from gene expression time series with
non-linear/non-stationary dynamic Bayesian networks
**

Feedback loops and recurrent structures are essential to the

regulation and stable control of complex biological systems. The

application of dynamic as opposed to static Bayesian networks is

promising in that, in principle, these feedback loops can be learned

from gene expression time series. However, we will show that the

widely applied BGe model is susceptible to learning spurious feedback

loops, which are a consequence of non-linear regulation and

autocorrelation in the data. We propose a non-linear/non-stationary

generalisation of the BGe model, based on a mixture model and

change-point process. We demonstrate that this approach has the

potential to successfully avoid spurious feedback loops that BGe is

susceptible to, which leads to a more accurate network reconstruction.

**Neil Lawrence**

**Efficient Multiple Output Convolution Processes for Multiple Task Learning**

Learning multiple correlated outputs with a Gaussian process

presents problems both in specifying the covariance (kernel) function

and efficiently inverting it. We consider the convolution process route

to generating covariance functions over structured outputs. We will show

how sparse approximations based on conditional independence assumptions

and variational methods can be used to make inference and learning

efficient. Given time we will give examples from multi-task learning,

computational biology, financial time series and human motion modeling.

**Jure Leskovec**

**Meme-tracking and the Dynamics of the News Cycle**

Tracking new topics, ideas, and "memes'' across the Web has been an issue

of considerable interest. Recent work has developed methods for tracking

topic shifts over long time scales, as well as abrupt spikes in the

appearance of particular named entities. However, these approaches are

less well suited to the identification of content that spreads widely and

then fades over time scales on the order of days - the time scale at

which we perceive news and events.

We develop a framework for tracking short, distinctive phrases that travel

relatively intact through on-line text; developing scalable algorithms for

clustering textual variants of such phrases, we identify a broad class of

memes that exhibit wide spread and rich variation on a daily basis. As our

principal domain of study, we show how such a meme-tracking approach can

provide a coherent representation of the news cycle --- the daily

rhythms in the news media that have long been the subject of qualitative

interpretation but have never been captured accurately enough to permit

actual quantitative analysis. We tracked 1.6 million mainstream media

sites and blogs over a period of three months with the total of 90 million

articles and we find a set of novel and persistent temporal patterns in

the news cycle. In particular, we observe a typical lag of 2.5 hours

between the peaks of attention to a phrase in the news media and in blogs

respectively, with divergent behavior around the overall peak and a

"heartbeat''-like pattern in the handoff between news and blogs. We also

develop and analyze a mathematical model for the kinds of temporal

variation that the system exhibits.

**Guido Sanguinetti
**

**Approximate inference for Markov Jump Processes with applications
in systems and developmental biology**

Markov Jump Processes represent a convenient mathematical

model of many chemical reactions involving low numbers of molecular

species. Inference in these models is hampered by the necessity to solve

very large systems of ODEs giving the forward backward relations. In

this talk, I will present some work (in collaboration with Manfred

Opper) on using a variational mean field approach to reduce the

inference problem to a more tractable size. I will give some examples of

applications, including an application to reaction-diffusion systems in

morphogenesis of Drosophila embryos (joint work with M. Dewar, M.

Opper, V. Kadirkamanathan).

**Eric Schadt**

**Networks as the Sensors and Drivers of Disease**

Common human diseases and drug response are complex traits

that involve entire networks of changes at the molecular level driven

by genetic and environmental perturbations. Efforts to elucidate

disease and drug response traits have focused on single dimensions of

the system. Studies focused on identifying changes in DNA that

correlate with changes in disease or drug response traits, changes in

gene expression that correlate with disease or drug response traits,

or changes in other molecular traits (e.g., metabolite, methylation

status, protein phosphorylation status, and so on) that correlate with

disease or drug response are fairly routine and have met with great

success in many cases. However, to further our understanding of the

complex network of molecular and cellular changes that impact disease

risk, disease progression, severity, and drug response, these multiple

dimensions must be considered together. Here I present an approach

for integrating a diversity of molecular and clinical trait data to

uncover models that predict complex system behavior. By integrating

diverse types of data on a large scale I demonstrate that some forms

of common human diseases are most likely the result of perturbations

to specific gene networks that in turn causes changes in the states of

other gene networks both within and between tissues that drive

biological processes associated with disease. These models elucidate

not only primary drivers of disease and drug response, but they

provide a context within which to interpret biological function,

beyond what could be achieved by looking at one dimension alone. That

some forms of common human diseases are the result of complex

interactions among networks has significant implications for drug

discovery: designing drugs or drug combinations to impact entire

network states rather than designing drugs that target specific

disease associated genes.

**Ricardo Silva**

**Joint work with Katherine Heller, Zoubin Ghahramani and Edoardo Airoldi**

**Ranking Relations Using Analogies**

We develop an approach to relational learning which, given a

set of pairs of objects S = {A1:B1,A2:B2, . . . ,AN:BN}, measures

how well other pairs A:B fit in with the set S. Our work addresses

the question: is the relation between objects A and B analogous to

those relations found in S? Such questions are particularly relevant

in information retrieval, where an investigator might want to search

for analogous pairs of objects that match the query set of interest.

Analogical reasoning depends fundamentally on the ability to learn

and generalize about relations between objects. There are many ways

in which objects can be related, making the task very challenging.

We recast this classical problem as a problem of Bayesian analysis

of relational data and function spaces, and illustrate its potential in

the domain of text analysis. A detailed application on searching for

protein-protein interactions is discussed.

**John Skilling**

**The Nested Sampling Algorithm**

The "Nested Sampling" algorithm is designed for probabilistic inference,

where a function of arbitrary complexity is to be both integrated (for

model selection) and sampled (for inferred parameters). It is an

iterative Monte Carlo scheme based on sampling within a progressive

constraint on function value. This constraint compresses the remaining

available volume smoothly and systematically, so that exploration is (at

least in principle) independent of quirks of function behaviour. This

property is particularly valuable in multi-modal problems, where peaks

of different heights and volumes need to be correctly balanced.

Early applications are in cosmological model selection and the modelling

of nano-materials.

**Michael Stumpf**

**Model selection from single cell data**

For the vast majority of biological systems we lack reliable models let

alone model parameters. Using well defined simulation models and real

biological data collected for a range of biological signalling systems,

we explore how much can be learned about biological systems from

temporally resolved transcriptomic or proteomic data. We pay particular

attention to qualitative properties of the underlying dynamical system

and their impact on our ability to infer the system's dynamics. We then

illustrate how approximate Bayesian computation approaches can be

employed to gain insights into the inferability of model parameters, and

for model selection in the context of dynamical systems of signalling

networks in systems biology. We will pay particular attention to the

analysis of single-cell data and discuss the relative advantages of

different experimental setups to study cellular variability.

**Simon Tavaré**

**J****oint work with Christiana Spyrou, Rory Stark, Andy Lynch**

**Some statistical issues in the analysis of Illumina sequencing experiments**

High-throughput sequencing technologies have become popular for the

study of genome organization, gene expression, methylation and

protein-DNA interactions. For example, chromatin immunoprecipitation

followed by sequencing of the resulting samples produces large amounts

of data that can be used to map transcription factor binding sites,

histone modifications and origins of replication.

In this talk I will discuss some of the statistical issues from such

data, focussing primarily on ChIP-seq experiments. I will describe

some research from the CRI in which ChIP-seq has proved invaluable,

and illustrate a statistical method for calling enriched

regions. BayesPeak uses a fully Bayesian hidden Markov model to detect

enriched locations in the genome. The structure accommodates the

natural features of Illumina sequencing data and allows for

overdispersion in the abundance of reads in different

regions. Moreover, a control sample can be incorporated in the

analysis to account for experimental and sequence biases. Markov chain

Monte Carlo algorithms are applied to estimate the posterior

distributions of the model parameters, and posterior probabilities are

used to identify the sites of interest. I will give some comparisons

with existing approaches, and describe related applications such as

mapping origins of replication using BrdU-IP-seq and for which novel

statistical problems arise.

**John Winn**

**Modelling complex disease phenotype data with Infer.NET
**

When trying to understand the genetic basis of disease, a common

approach is to treat presence or absence of a disease as a binary

target. Because many diseases involve multiple, complex systems,

disease symptoms may be due to a failure in a subset of a large

number of relevant cellular mechanisms across multiple systems. For

example, asthma symptoms may arise from problems with the immune

system, bronchial hyper-sensitivity or difficulties during lung

development - or some combination of these with varying severity.

Hence, before we can understand the genetic basis of a disease, it is

important to identify and decompose the system-level basis of the

disease. Genetic associations to these underlying system-level

factors can then be found, instead of to the disease label, making it

possible to detect associations that were previously lost.

Our approach to understanding the system-level basis of disease is to

construct a graphical model of rich disease phenotype data. This

approach allows us to combine physiological, clinical, environmental

and sociological variables relevant to the disease whilst also taking

into account expert clinical knowledge. To construct these rich

models, we use the Infer.NET graphical modelling and inference tool

developed at Microsoft Research Cambridge. Infer.NET allows very

rapid development, testing and refinement of the model, whilst also

scaling to very large datasets. I illustrate the talk with a

detailed example of how Infer.NET was used to model asthma phenotype

data as part of a project undertaken with the University of

Manchester.

**Eric Xing**

**Time (and Space)-Varying Networks: Reverse engineering rewiring genetic interactions
**

A plausible representation of the relational information among entities

in dynamic systems such as a living cell is a stochastic network which

is topologically rewiring and semantically evolving over time (or

space). While there is a rich literature in modeling static or

temporally invariant networks, until recently, little has been done

toward modeling the dynamic processes underlying rewiring networks, and

on recovering such networks when they are not observable. In this talk,

I will present a new formalism for modeling network evolution over time

based on temporal exponential random graphs, and several new algorithms

based on temporal extensions of the sparse graphical logistic

regression, for reverse-engineering the latent time/space varying

networks. These algorithms can be cast as standard convex-optimization

problems and solved efficiently using generic solvers scalable to large

networks. I will show some promising results on recovering the latent

sequence of evolving gene networks over more than 4000 genes during the

life cycle of Drosophila melanogaster from microarray time course, at a

time resolution only limited by sample frequency (i.e., works even when

a single snapshot of node-values from each time-specific network is

available.) I will also sketch some theoretical results on asymptotic

sparsistency of the proposed methods, which differ significantly from

traditional sparsistency analysis of static structure estimation based

on iid samples because of the temporal relatedness of samples.

See also:

Mathematics Research Centre

Mathematical Interdisciplinary Research at Warwick (MIR@W)

Past Events

Past Symposia

**Internet Access at Warwick**:

Where possible, visitors should obtain an EDUROAM account from their own university to enable internet access whilst at Warwick.

**WiFi**whilst at Warwick, click here for instructions (upon arrival at Warwick)

**Registration**:

You can register for any of the symposia or workshops online. To see which registrations are currently open and to submit a registration, please click here.

**Contact:**

Mathematics Research Centre

Zeeman Building

University of Warwick

Coventry CV4 7AL - UK

**E-mail:**

mrc@maths.warwick.ac.uk