# CRiSM Day on Bayesian Intelligence

Warwick CRiSM Day on "Bayesian Intelligence" will be held on 20th March 2019.

**Speakers**

Xiao-Li Meng, Professor, Department of Statistics, Harvard University, USA. Title: "Artificial Bayesian Monte Carlo Integration: A Practical Resolution to the Bayesian (Normalizing Constant) Paradox"

Julien Stoehr, Université Paris-Dauphine, France. Title: "Gibbs Sampler and ABC methods"

Arthur Ulysse Jacot-Guillarmod, Ecole polytechnique fédérale de Lausanne, Switzerland. Title: "Neural Tangent Kernel: Convergence and Generalization of Deep Neural Networks"

Antonietta Mira, Professor, Università della Svizzera italiana, Switzerland and University of Insubria, Italy. Title: "Bayesian identifications of the data intrinsic dimensions"

**Schedule**

9:00-10:00 Registration (To pre-register for lunch, see below. Pre-registration ends on 7th March.)

10:00-11:00 Xiao-Li Meng

11:00-12:00 Julien Stoehr

12:00-14:00 Lunch Buffet (Zeeman Building Attrium)

14:00-15:00 Arthur Ulysse Jacot-Guillarmod

15:00-16:00 Antonietta Mira

16:00 Snacks and Wine (Zeeman Building Attrium)

**Pre-registration for lunch**

The pre-registration is used for the arrangement of lunch only. So if you want to have hot-buffet lunch with us, please register by **7 March, 2019. **

**Venue **

University of Warwick, Zeeman Building, *MS.01*

**(****How to get to Zeeman building?)**

**Abstract**

**Professor Xiao-Li Meng: "Artificial Bayesian Monte Carlo Integration: A Practical Resolution to the Bayesian (Normalizing Constant) Paradox"**

Advances in Markov chain Monte Carlo in the past 30 years have made Bayesian analysis a routine practice. However, there is virtually no practice of performing Monte Carlo integration from the Bayesian perspective; indeed, this problem has earned the “paradox” label in the context of computing normalizing constants (Wasserman, 2013). We first use the modeling-what-we-ignore idea of Kong et al. (2003) to explain that the crux of the paradox is not with the likelihood theory, which is essentially the same as for a standard non-parametric probability/density estimation (Vardi, 1985); though via using group theory, it provides a richer framework for modeling the trade-off between statistical efficiency and computational efficiency. But there is a real Bayesian paradox: Bayesian analysis cannot be applied exactly for solving Bayesian computation, because to perform the exact Bayesian Monte Carlo integration would require more computation than needed to solve the original Monte Carlo problem. We then show that there is a practical resolution to this paradox using the profile likelihood obtained in Kong et al. (2006) and that this approximation is second-order valid asymptotically. We also investigate a more computationally efficient approximation via an artificial likelihood of Geyer (1994). This artificial likelihood approach is only first-order valid, but there is a computationally trivial adjustment to render its second-order validity. We demonstrate empirically the efficiency of these approximated Bayesian estimators, compared to the usual frequentist-based Monte Carlo estimators, such as bridge sampling estimators (Meng and Wong, 1996). [joint work with: Masatoshi Uehara]

###### Julien Stoehr: Gibbs Sampler and ABC methods

ABC methods are now well known methods to deal with intractable models. A limitation of such methods is that they become exponentially less efficient as the parameter space grows in dimension. We propose to reduce this difficulty by exploring the parameter space according to the Gibbs sampler, namely ABC approximations of the conditionals are used as surrogates into the scheme. The resulting algorithm is shown to be convergent under some conditions. We also get a loose upper bound on the total variation distance between the resulting law and the law of the exact Gibbs sampler. Several numerical simulations show the efficiency of this approach, although it shares the limitations of both Gibbs sampler and ABC methods such as the compatibility of the conditionals and the choice of summary statistics.

**Arthur Ulysse Jacot-Guillarmod: Neural Tangent Kernel: Convergence and Generalization of Deep Neural Networks**

We show that the behaviour of a Deep Neural Network (DNN) during gradient descent is described by a new kernel: the Neural Tangent Kernel (NTK). More precisely, as the parameters are trained using gradient descent, the network function (which maps the network inputs to the network outputs) follows a so-called kernel gradient descent w.r.t. the NTK. We prove that as the network layers get wider and wider, the NTK converges to a deterministic limit at initialization, which stays constant during training. This implies in particular that if the NTK is positive definite, the network function converges to a global minimum. The NTK also describes how DNNs generalise outside the training set: for a least squares cost, the network function converges in expectation to the NTK kernel ridgeless regression, explaining how DNNs generalise in the so-called overparametrized regime, which is at the heart of most recent developments in deep learning.

**Professor Antonietta Mira: Bayesian identifications of the data intrinsic dimensions**

Even if they are defined on a space with a large dimension, data points usually lie onto a hypersurface, or manifold, with a much smaller intrinsic dimension (ID). The recent TWO-NN method (Facco et al., 2017, Scientific Report), allows estimating the ID when all points lie onto a single manifold. TWO-NN only assumes that the density of points is approximately constant in a small neighborhood around each point. Under this hypothesis, the ratio of the distances of a point from its first and second neighbour follows a Pareto distribution that depends parametrically only on the ID, allowing for an immediate estimation of the latter. We extend the TWO-NN model to the case in which the data lie onto several manifolds with different ID. While the idea behind the extension is simple (the Pareto is replaced by a mixture of $K$ Pareto distributions), a non-trivial Bayesian scheme is required for estimating the model and assigning each point to the correct manifold. Applying this method, which we dub Hidalgo (heterogeneous intrinsic dimension algorithm), we uncover a surprising ID variability in several real-world datasets. Hidalgo obtains remarkable results, but its main limitation consists in fixing a priori the number of component in the mixture. To adopt a fully Bayesian approach, a possible extension would be the specification of a prior distribution for the parameter $K$. Instead, we employ a flexible Bayesian Nonparametric approach and model the data as an infinite mixture of Pareto distributions using a Dirichlet Process Mixture Model. The approach allows to evaluate the uncertainty relative to the number of mixture components. Since the posterior distribution has no closed form, we employ the Slice Sampler algorithm for posterior inference. From preliminary analyses performed on simulated data, the model provides promising results. [joint work with: Michele Allegra, Francesco Denti, Elena Facco, Alessandro Laio, Michele Guindani]

**References**

Wasserman, L. (2013) All of Statistics: A Concise Course in Statistical Inference. Springer Science & Business Media. Also see https://normaldeviate.wordpress.com/2012/10/05/the-normalizing-constant-paradox/

Kong, A., P. McCullagh, X.-L. Meng, D. Nicolae, and Z. Tan (2003). A theory of statistical models for Monte Carlo integration (with Discussions). J. R. Statist. Soc. B 65, 585-604. http://stat.harvard.edu/XLM/JRoyStatSoc/JRoyStatSocB65-3_585-618_2003.pdf

Vardi, Y. (1985). Empirical distributions in selection bias models. Ann. Statist. 13 (1), 178-203.https://projecteuclid.org/download/pdf_1/euclid.aos/1176346585

Kong, A., P. McCullagh, X.-L. Meng, and D. Nicolae (2006). Further explorations of likelihood theory for Monte Carlo integration. In Advances in Statistical Modeling and Inference: Essays in Honor of Kjell A. Doksum (Ed: V. Nair), 563-592. World Scientific Press. http://www.stat.harvard.edu/XLM/books/kmmn.pdf

Geyer, C. J. (1994). Estimating normalizing constants and reweighting mixtures in Markov chain Monte Carlo. Technical Report, School of Statistics, University of Minnesota, Minneapolis 568

Geyer, C. J. (1994). Estimating normalizing constants and reweighting mixtures in Markov chain Monte Carlo. Technical Report, School of Statistics, University of Minnesota, Minneapolis 568https://scholar.google.com/scholar?cluster=6307665497304333587&hl=en&as_sdt=0,22

Meng, X.-L. and Wong, W.H. (1996). Simulating ratios of normalizing constants via a simple identity: A theoretical exploration. Statistics Sinica 6, 831-860