Events

2024 2023 2022 2021 2020 2019 2018 2017 2016 2015 2014 2013 2012

CRiSM Seminar - Nicolai Meinshausen

16:00 - 17:00, Thu, 27 Jun '13 Export as iCalendar

Location: A1.01

Nicolai Meinshausen (University of Oxford)

Min-wise hashing for large-scale regression and classification.

We take a look at large-scale regression analysis in a "large p, large n" context for a linear regression or classification model.

In a high-dimensional "large p, small n" setting, we can typically only get good estimation if there exists a sparse regression vector that approximates the observations. No such assumptions are required for large-scale regression analysis where the number of observations n can (but does not have to) exceed the number of variables p. The main difficulty is that computing an OLS or ridge-type estimator is computationally infeasible for n and p in the millions and we need to find computationally efficient ways to approximate these solutions without increasing the prediction error by a large amount. Trying to find interactions amongst millions of variables seems to be an even more daunting task. We study a small variation of the b-bit minwse-hashing scheme (Li and Konig, 2011) for sparse datasets and show that the regression problem can be solved in a much lower-dimensional setting as long as the product of the number of non-zero elements in each observation and the l2-norm of a good approximation vector is small. We get finite-sample bounds on the prediction error. The min-wise hashing scheme is also shown to fit interaction models. Fitting interactions does not require an adjustment to the method used to approximate linear models, it just requires a higher-dimensional mapping.

Tags: CRiSM Seminars Seminars CRiSM

Show all calendar items