Skip to main content Skip to navigation

Event Diary

Show all calendar items

CRiSM Seminar - Nicolai Meinshausen

- Export as iCalendar
Location: A1.01

Nicolai Meinshausen (University of Oxford)

Min-wise hashing for large-scale regression and classification.

We take a look at large-scale regression analysis in a "large p, large n" context for a linear regression or classification model.

In a high-dimensional "large p, small n" setting, we can typically only get good estimation if there exists a sparse regression vector that approximates the observations. No such assumptions are required for large-scale regression analysis where the number of observations n can (but does not have to) exceed the number of variables p. The main difficulty is that computing an OLS or ridge-type estimator is computationally infeasible for n and p in the millions and we need to find computationally efficient ways to approximate these solutions without increasing the prediction error by a large amount. Trying to find interactions amongst millions of variables seems to be an even more daunting task. We study a small variation of the b-bit minwse-hashing scheme (Li and Konig, 2011) for sparse datasets and show that the regression problem can be solved in a much lower-dimensional setting as long as the product of the number of non-zero elements in each observation and the l2-norm of a good approximation vector is small. We get finite-sample bounds on the prediction error. The min-wise hashing scheme is also shown to fit interaction models. Fitting interactions does not require an adjustment to the method used to approximate linear models, it just requires a higher-dimensional mapping.

Show all calendar items