# Never Mind the Big Data here's the Big Models

### December 15, 2015

Big Data and all that is promised it will deliver has captivated the interest of politicians, policy makers and leading scientists in all disciplines. With the growth of the amount of data that is feasible to collect and curate from advanced instrumentation technologies there has been a parallel growth in the size and complexity of mathematical models, usually based on differential equations, being developed in the study of natural and physical phenomena e.g. weather systems, polar ice sheet flows, cellular dynamic processes. It can be argued that the research challenges presented by the advent of Big Models and the need to quantify uncertainty over them is more demanding than those presented by Big Data alone and this workshop will concentrate on this theme. Four focussed talks by world leading experts in Uncertainty Quantification, Spatial Statistics, and Computational Mathematics will explore the current issues being addressed in the rise of the Big Models.

### Speakers

- Robert Scheichl (University of Bath, Dept of Mathematical Sciences)
- Shiwei Lan (University of Warwick, Dept of Statistics)
- Konstantinos Zygalakis (University of Southampton, Dept of Mathematical Sciences)
- Dan Simpson (University of Bath, Dept of Mathematical Sciences)

### Location

The workshop will take place in the Mathematics and Statistics Department, Zeeman Building, at the University of Warwick. Talks will be held in **MS.03**. Travel information.

### Programme

**REGISTRATION OPENS 11:30**

*LUNCH 12:00 - 12:50PM*

OPENING REMARKS (MARK GIROLAMI) 12:55 - 13:00

KONSTANTINOS ZYGALAKIS - 13:00 - 14:00

SHIWEI LAN - 14:00 - 15:00

*COFFEE 15:00 - 15:30*

DAN SIMPSON - 15:30 - 16:30

ROB SCHEICHL - 16:30 - 17:30

CLOSING REMARKS 17:30

### Registration

Registration for this workshop is now open.

### Abstracts

**Robert Scheichl, Petascale Multigrid Performance and Beyond with Applications in the Earth Sciences**

There is clear demand in the earth sciences (e.g. in weather, climate, ocean or oil reservoir simulations) for petascale elliptic solvers. For example, a 5-day global weather forecast that resolves all the physically relevant processes requires at least a 1km horizontal grid resolution and about 200 vertical layers, leading to about $10^{11}$ grid points. To improve stability and to allow for larger time steps, typically semi-implicit time stepping methods are used by the leading weather centres. These methods require a global pressure correction at each time step. Nevertheless, the method is not unconditionally stable and the time step size cannot exceed about 30 seconds on a 1km grid, so that a 5-day forecast requires about 15000 time steps. The current time window for the global weather forecast at the UK Met Office is about 1 hour every night. The allowed time for the pressure correction is about 15 minutes, which means that about 1000 elliptic systems (with $10^{11}$ unknowns each) have to be solved in about 1 minute, leading to a clear need for petascale capabilities.

I will show in this talk that multigrid methods (with their theoretically optimal algorithmic scalability) can offer these capabilities, and scale almost optimally also in practice up to $10^{12}$ unknowns on the currently largest supercomputers. We present results with black-box AMG solvers in DUNE (Heidelberg) and in Hypre (Lawrence Livermore Lab) and with bespoke geometric multigrid methods, both for CPU and GPU clusters. All the solvers scale almost optimally up to the currently maximally available processor counts on various architectures. Since elliptic solvers, like all grid-based codes, are essentially memory limited, the bespoke geometric multigrid algorithms which do not store the entire system matrix and interleave many of the algebraic operations ("kernel fusion") are significantly faster than the black-box algorithms. On the GPU-based Titan - the currently second largest supercomputer in the world with 18688 16-core CPUs each with an Nvidia Tesla GPU - we managed to achieve about 0.8 Petaflops per second with our bespoke multi-GPU elliptic solver using 16384 GPUs and exploiting up to 50% of the peak memory bandwith.

However, the demands do not end here. Such detailed computations will need to be fed with more accurate data, leading to significantly higher demands on the efficiency and accuracy of data assimilation and uncertainty quantification techniques than those currently available. In the last part of the talk I will present novel multilevel uncertainty quantification tools that have the potential to bring about unprecedented efficiency gains and may lead to truly exascale future applications. I will present these multilevel Monte Carlo methods for a model problem in subsurface flow where it can be shown that they are essentially "optimal", in the sense that their cost is proportional to the cost of one deterministic solve for an individual realisation.

**Shiwei Lan, Geometric Dimension Independent MCMC**

Bayesian inverse problems often involve sampling probability distributions on functions. Traditional MCMC algorithms fail under mesh refinement. Recently, a variety of dimension-independent MCMC methods have emerged, but few of them take the geometry of the posterior into account. In this work, we try to blend recent developments in finite dimensional manifold samplers with dimension-independent MCMC. The goal of such a marriage is to speed up the mixing of the Markov chain by using geometry, whilst remaining robust under mesh-refinement. The key idea is to employ the manifold methods on an appropriately chosen finite dimensional subspace.

**Konstantinos Zygalakis, Ergodic stochastic differential equations and big data**

Applying standard Markov chain Monte Carlo (MCMC) algorithms to large data sets is computationally expensive. Both the calculation of the acceptance probability and the creation of informed proposals usually require an iteration through the whole data set. The recently proposed stochastic gradient Langevin dynamics (SGLD) method circumvents this problem in three ways: it generates proposals which are only based on a subset of the data, it skips the accept-reject step and it uses sequences of decreasing step-sizes. In this talk using some recent developments in backward error analysis for Stochastic Differential Equations we will investigate the properties of the SGLD algorithm (and propose new variants of it) when the time-step remains fixed. Our findings will be illustrated by a variety of different examples.

**Dan Simpson, To avoid fainting, keep repeating 'It's only a model'...**

Big models are slightly terrifying statistical beasts. Festooned with complicated specifications of expert knowledge, it can be difficult to understand how the uncertainty is propagating through the model. In particular, it is difficult to assess in what way deep distributional assumptions affect the posterior quantities of interest. A second problem is the challenge of using data to inform these models. There is typically not enough information to overcome the prior modelling assumptions (“with low power comes great responsibility”), so any thinking on big models requires the concept of big data. This makes an already challenging computational task even more hair-raising, as in these types of problems the “big data” is often cobbled together from heterogenous data sources of varying quality. In this talk, I will forgo discussions of the computational problems and look through some examples at the statistical challenges associated with fitting these types of models.