# Big Data and Computational Scalability

### July 1, 2015

The most obvious challenge of working with "big" data is that the volume can exceed what is feasible to compute with. Traditional methods can fail to scale up. Instead, we need new ways to scale state-of-the-art methods, or new methods with tractable computational complexity. Working in our favour is the fact that data often exhibits sparsity, and high support for important features. Speakers from across mathematics, statistics, machine learning and computer science will highlight techniques for scaling up to big data.

### Speakers

- Emmanuel Candes (Professor, Stanford University) [slides]
- Martin Wainwright (Professor, University of California at Berkeley) [slides]
- Alexander J. Smola (Researcher, Google and Professor, Carnegie Mellon University)
- Graham Cormode (Professor, University of Warwick) [slides]

Graham Cormode (11:00) and Martin Wainwright's (1:15:30) presentations. [Download]

### Location

The workshop will take place in the Mathematics and Statistics Department, Zeeman Building, at the University of Warwick. Talks will be held in MS.02. Travel information

### Programme

10.15--11.00 Registration

11.00--12.00 **Emmanuel Candes
**12.00--13.00 Lunch

13.30--14.30

**Graham Cormode**

14.30--15.30

**Martin Wainwright**

15.30--16.00 Coffee

16.00--17.00

**Alexander Smola**

### Registration

Registration for this workshop has now closed.

### Abstracts

**Emmanuel Candes,** **Around the Reproducibility of Scientific Research in the Big Data Era: A Knockoff Filter for Controlling the False Discovery Rate**

The big data era has created a new scientific paradigm: collect data first, ask questions later. When the universe of scientific hypotheses that are being examined simultaneously is not taken account, inferences are likely to be false. The consequence is that follow up studies are likely not to be able to reproduce earlier reported findings or discoveries. This reproducibility failure bears a substantial cost and this talk is about new statistical tools to address this issue. Imagine that we observe a response variable together with a large number of potential explanatory variables, and would like to be able to discover which variables are truly associated with the response. At the same time, we need to know that the false discovery rate (FDR)---the expected fraction of false discoveries among all discoveries---is not too high, in order to assure the scientist that most of the discoveries are indeed true and replicable. We discuss the knockoff filter, a new variable selection procedure controlling the FDR in the statistical linear model whenever there are at least as many observations as variables. This method achieves exact FDR control in finite sample settings no matter the design or covariates, the number of variables in the model, and the amplitudes of the unknown regression coefficients, and does not require any knowledge of the noise level. This work is joint with Rina Foygel Barber.

**Graham Cormode, Compact summaries over large datasets**

A fundamental challenge in processing the massive quantities of information generated by modern applications is in extracting suitable representations of the data that can be stored, manipulated and interrogated on a single machine. A promising approach is in the design and analysis of compact summaries: data structures which capture key features of the data, and which can be created effectively over distributed data sets. Popular summary structures include the count distinct algorithms, which compactly approximate item set cardinalities, and sketches which allow vector norms and products to be estimated. These are very attractive, since they can be computed in parallel and combined to yield a single, compact summary of the data. This talk gives an overview of the concepts behind, and examples of, compact summaries.

**Martin Wainwright, Randomized algorithms for high-dimensional optimization: Statistical and computational guarantees**

Sketches and other randomized projection schemes are generic tools for dimensionality reduction. In this talk, we discuss some recent developments in the use of sketching for obtaining fast but approximate solutions to various types of high-dimensional convex programs, ranging from relatively simple least-squares problems to more complex semidefinite programs. In the context of least-squares programs, we begin by proving that the standard version of least-squares sketching is highly suboptimal for solution approximation. Motivated by this deficiency, we propose a novel scheme known as the iterative Hessian sketch, and provide sharp bounds on its solution error. This scheme also has a generalization to arbitrary twice-differentiable objectives, leading to a randomized optimization algorithm known as the Newton sketch.

Based on joint work with Mert Pilanci and Yun Yang, UC Berkeley.

**Alex Smola, Random Function Classes for Machine Learning**

Random function classes offer an extremely versatile tool for describing nonlinearities, as they are commonly employed in machine learning. This ranges from compact summaries of distributions to nonlinear function expansions. We show that Bloom Filters, the Count-Min sketch, and a new family of Semidefinite Sketches can all be viewed as attempts at finding the most conservative solution of a convex optimization problem (and with matching guarantees) when querying properties of a distribution. Moreover, the sketches themselves prove useful, e.g. when representing high-dimensional functions, thus leading to the hash kernel for generalized linear models and recommender systems. Next we discuss random kitchen sinks and their accelerated variants, fastfood, a-la-carte and deep-fried convnets. They offer memory-efficient alternatives to incomplete matrix factorization and decomposition for kernel functions. Finally, we combine this approach with sketches, using Pagh's compressed matrix multiplication construction, yielding computationally efficient two-sample and independence tests.