Statistics » Forum
https://warwick.ac.uk/fac/sci/statistics/news/mlrg/ml-seminar/
The latest posts to Statistics » Forumen-GB(C) 2022 University of WarwickFri, 27 Oct 2017 14:34:01 GMThttp://blogs.law.harvard.edu/tech/rssSiteBuilder2, University of Warwick, http://go.warwick.ac.uk/sitebuildersupplements on RKHS and basics on GP
https://warwick.ac.uk/fac/sci/statistics/news/mlrg/ml-seminar/?post=8a1785d866535dc90166785d16401560
<p>Dear all,</p>
<p>Thanks again to Sigurd for today, I personally wasn’t expecting an RKHS/Functional Analysis perspective on this for GPLVM but it makes sense to wanna look at it like that first so thank you twice.</p>
<p>A) For those not that familiar to the RKHS view (I think the reading group went over that last year a bit) the following resources will be useful:</p>
<p>https://alex.smola.org/papers/2001/SchHerSmo01.pdf</p>
<p>The main book in this area again by Scholkopf and Smola (Sigurd I think your worldview preference matches nicely with Scholkopf's as he is one of the main ML theoreticians in this space):<br>https://mitpress.mit.edu/books/learning-kernels [I have a copy if anyone wants to borrow, library should have more]</p>
<p>J.S-T and N. C. book:<br>https://people.eecs.berkeley.edu/~jordan/kernels/0521813972pre_pi-xiv.pdf [chapter 3 has most of basic intro and theory for RKHS and Mercer’s theorem etc]</p>
<p>B) Theory and GPs in terms of compactness, universal approximation properties, proof of dense-ness.</p>
<p>Thats one of the main ones in this area (in ML-land, surely there is much more in Stats/Math land disguised under Brownian motion view or Krigging estimators)<br>http://jmlr.csail.mit.edu/papers/volume7/micchelli06a/micchelli06a.pdf</p>
<p>A lot of it is a bit beyond me atm but maybe someone one day can guide us through it. Relevant follow ups:</p>
<p>http://www.jmlr.org/papers/volume12/sriperumbudur11a/sriperumbudur11a.pdf [section 3.1 is relevant]</p>
<p>https://arxiv.org/pdf/1708.08157.pdf [on kernel embeddings of measures, mmd, quite heavy in theory]</p>
<p> </p>
<p>Now of course this is for the standard setting and I am not sure how much of it would carry through given the RV inputs of the GPLVM/ Deep GP construction. I would guess a lot since its a composition of GPs but it sounds messy to show and I haven’t seen any theory like that developed in that space. There might be a nice theoretical paper in this.. I don’t think Damianou does/shows anything like that in his PhD/papers.</p>
<p>Finally, for the DGP construction a somewhat relevant paper (although not on the RKHS view) on understanding better what that composition entails for effective depth, ergodicity, sampling etc is by some well-known suspects to us:<br>https://arxiv.org/pdf/1711.11280.pdf</p>
<p>Best, Theo</p>Mon, 15 Oct 2018 15:34:18 GMTSigurd Assing8a1785d866535dc90166785d16401560Re: Neural Networks Slides
https://warwick.ac.uk/fac/sci/statistics/news/mlrg/ml-seminar/?post=8a17841b619e73380161e2500e633221
<p>Oh also, the link to the optimiser visualisations are here, the link in th slides are wrong</p>
<p>https://imgur.com/a/Hqolp</p>
<p> </p>Thu, 01 Mar 2018 16:05:52 GMTHenry Jia8a17841b619e73380161e2500e633221Neural Networks Slides
https://warwick.ac.uk/fac/sci/statistics/news/mlrg/ml-seminar/?post=8a17841b619e73380161e24db32a3215
<p>Hey</p>
<p>Here's the slides for my final neural networks talk</p>
<p>Hengjian (Henry) Jia</p>
<div>
<p><strong>Attachments</strong> <small class="muted text-muted">(follow link to download)</small></p>
<ul>
<li>
<a href="https://warwick.ac.uk/sitebuilder2/file/fac/sci/statistics/news/mlrg/ml-seminar/8a17841b619e73380161e24db2573214/neural_networks_sota.pdf?sbrPage=/fac/sci/statistics/news/mlrg/ml-seminar&attachment=8a17841b619e73380161e24db32b3216&forceOpenSave=true">neural_networks_sota.pdf</a> <small class="muted text-muted">(297 KB)</small>
</li>
</ul>
</div>Thu, 01 Mar 2018 16:03:18 GMTHenry Jia8a17841b619e73380161e24db32a3215Re: Stein paradox
https://warwick.ac.uk/fac/sci/statistics/news/mlrg/ml-seminar/?post=8a17841a6012ba7a01605cb581354d69
<p>when I scrolled through this, it rang a bell. I know of course Stein's lemma which is simply Gaussian partial integration, and when I heard of Stein's lemma long time ago, people must have mentioned Stein's paradox, too.</p>
<p>however, I noticed that the shrinkage estimator used for showing the paradox does NOT require any knowledge of the true parameter, while the shrunken estimators I discussed, but also those shalin referenced in his second forum post, depend on the true parameter which would be the true distribution function in the "worst" case. and, different to the paradox, these shrunken estimators work in all dimensions. and the linear shrinkage ones would shrink both mean square error and variance---checked for the unbiased case. but bias would go up after performing shrinkage, though I could calculate the bound [latex]min\{(c-\theta_0)/var\hat{\theta},\,var\hat{\theta}/(c-\theta_0)\}[/latex] again in the unbiased case (the lower bound would be the reciprocal of the sum of the two terms under the min).</p>
<p>so, doing shrinkage repeatedly may give a biased estimator, in the end, but there is hope that the bias is on the same scale as the shrunken variance in the penultimate step if [latex]c-\theta_0[/latex] is good enough, AND THIS WOULD JUSTIFY SHRINKING THE VARIANCE WITHOUT THINKING ABOUT CREATING BIAS. verifying the latter seems easy for linear shrinkage but might be involved for other shrinkage methods---care has to be taken. eventually, if the sample size is big enough, then the empirical distribution should be close enough to the true one, and hence it should be fine to start the shrinkage iteration with the empirical distribution which means bootstrapping.</p>Sat, 16 Dec 2017 00:24:47 GMTSigurd Assing8a17841a6012ba7a01605cb581354d69Re: Examples of shrinkage estimators
https://warwick.ac.uk/fac/sci/statistics/news/mlrg/ml-seminar/?post=8a17841b60025f91016059f2450d4860
<p>shalin, thanks for these links---great stuff. I screened through the papers, and below is a short sketch of how this connects to what was discussed yesterday. but first, a quick reminder of the linear shrinkage ansatz I stressed yesterday several times: the shrunken estimator of the functional relation between X and Y is [latex](1-t_0)\hat{f}+t_0c[/latex] where [latex]t_0[/latex] is chosen in a way minimizing some distance to the true function [latex]f[/latex]. for me [latex]\hat{f}(\cdot)=\sum_{j=1}^N N_j(\cdot)\hat{\theta}_j[/latex] where the [latex]N_j[/latex] might be eigenvectors with respect to some PDE problem, or cubic splines, or whatever.</p>
<p>COVARIANCE SHRINKAGE: this is about a shrinkage method for sample eigenvalues [latex]\hat{\lambda}_i\,,\,i=1,\dots,p[/latex] when estimating a covariance. therefore, the corresponding eigenvectors don't play the same role as the [latex]N_j[/latex] in my structure for [latex]\hat{f}[/latex]. the shrinkage is nonlinear because the shrunken estimator is obtained by [latex]\hat{\lambda}_i/[1-c-c\hat{\lambda}_i\check{m}_F(\hat{\lambda}_i)]^2[/latex]. here [latex]\check{m}_F(\hat{\lambda}_i)[/latex] plays the role of [latex]t_0[/latex] but it is not found by simply minimizing some distance, and hence further thinking is required when applying this in other contexts. Furthermore, the reference point [latex]c[/latex] is not a vector in [latex]R^p[/latex] but a constant being the same for all principal directions, though the shrinkage is different in each direction. Applying this to my smoothing spline example would mean to shrink each [latex]\hat{\theta}_j[/latex] separately by a similar method, which is different to shrinking things [latex]x[/latex] by [latex]x[/latex] as discussed yesterday.</p>
<p>SPECTRAL DENSITY MATRIX SHRINKAGE: this is about a [latex]p[/latex]-dimensional weakly stationary discrete time series, and one is after the spectral density matrix denoted by [latex]f[/latex]. so, this [latex]f[/latex] is a matrix-valued function depending on frequencies [latex]\omega[/latex], and the shrunken estimate for [latex]f(\omega)[/latex] is [latex](1-W_T(\omega))\hat{f}(\omega)+W_T(\omega)\tilde{V}(\omega)[/latex]. this is linear shrinkage as discussed yesterday. First, the shrinkage is worked out [latex]\omega[/latex] by [latex]\omega[/latex], and [latex]\omega[/latex] plays the role of [latex]x[/latex], of course. Second, [latex]W_T(\omega)[/latex] plays the role of [latex]t_0[/latex] with [latex]T[/latex] being the time horizon of the time series (this time dependence should change when the series is NOT stationary). Third, [latex]\tilde{V}(\omega)[/latex] and [latex]\hat{f}(\omega)[/latex] are a parametric and a nonparametric estimate of [latex]f\,[/latex], respectively, with [latex]\tilde{V}(\omega)[/latex] playing the role of [latex]c[/latex]. So, this is indeed the same I wanted to do, because I wanted to take a linear regression for [latex]c[/latex] and do the whole thing [latex]x[/latex] by [latex]x[/latex].</p>
<p>ALL IN ALL, what was suggested yesterday for smoothing splines is applied in practice, and one could even go beyond the linear shrinkage ansatz though this might be less inuitive. the next step to be discussed in january would be to bring in bootstrapping and to understand how this can be applied repeatedly. for christmas, everybody is welcome to think about what shrinks to what and how much when bootstrapping repeatedly.</p>Fri, 15 Dec 2017 11:32:17 GMTSigurd Assing8a17841b60025f91016059f2450d4860Examples of shrinkage estimators
https://warwick.ac.uk/fac/sci/statistics/news/mlrg/ml-seminar/?post=8a17841a6012ba7a016055846cef0198
<p>## covariance shrinkage:</p>
<p>nonlinear: http://www.econ.uzh.ch/dam/jcr:ffffffff-935a-b0d6-ffff-ffff9fd1f079/AOS989.pdf</p>
<p>http://www.econ.uzh.ch/dam/jcr:ed299b2a-26bb-45cd-a563-bb795a4139d3/jmva_2015.pdf</p>
<p>## spectral density matrix shrinkage</p>
<p>http://www.jstor.org/stable/23024918</p>Thu, 14 Dec 2017 14:53:50 GMTShahin Tavakoli8a17841a6012ba7a016055846cef0198Stein paradox
https://warwick.ac.uk/fac/sci/statistics/news/mlrg/ml-seminar/?post=8a17841b60025f91016055769d0945c3
<p>Here is a link to the wiki page concerning the topic I mentionned towards the end of todays reading group:</p>
<p>https://en.wikipedia.org/wiki/Stein%27s_example</p>
<p>It is worth knowing about the Stein paradox since it is the precursor of the idea of shrinkage and penalized estimation in statistics.</p>Thu, 14 Dec 2017 14:38:45 GMTShahin Tavakoli8a17841b60025f91016055769d0945c3Convolutional Neural Networks and Capsuling
https://warwick.ac.uk/fac/sci/statistics/news/mlrg/ml-seminar/?post=8a17841a6012ba7a01601445750530f9
<p>During the dicsussions in/after the last reading group, I mentioned this talk to do with convolutional network which explains some major discrepancies between convolutional nets and thee human visual cortex</p>
<p>Here it is:</p>
<p>https://www.youtube.com/watch?v=rTawFwUvnLE</p>Fri, 01 Dec 2017 22:49:44 GMTHenry Jia8a17841a6012ba7a01601445750530f9effective degrees of freedom
https://warwick.ac.uk/fac/sci/statistics/news/mlrg/ml-seminar/?post=8a17841a5f2ace02015f641b034456c2
<p>want to compare the two different smoother matrices used for smoothing splines and kernel smoothers, respectively. for unexplained notation please refer to the book.</p>
<p>in the case of smoothing splines, the smoother matrix</p>
<p>[latex]S_\lambda=N(N^TN+\lambda\Omega_N)^{-1}N^T[/latex]</p>
<p>is symmetric + positive semidefinite, and it can be expressed in Reinsch form because the lambda-dependence is based on a very simple operation of perturbation type. this has two consequences. first, the eigenvalues are all positive, and their size might decay to zero quickly except for a few bigger ones, and hence [latex]trace(S_\lambda)[/latex], which is the sum of these eigenvalues, kind of gives the dimension of the subspace associated with the bigger eigenvalues justifying that this trace has something to do with degrees of freedom. second, the Reinsch form makes it possible to understand how [latex]trace(S_\lambda)[/latex] depends on lambda, and therefore one can calculate meaningful lambdas for given degrees of freedom.</p>
<p>i now think that both consequences break down in the case of kernel smoothers in the following sense: one would have to do some extra work to find out whether the situation is APPROXIMATELY the same as in the case of smoothing splines.</p>
<p>let's have a look at the smoother matrix itself whose ith row reads</p>
<p>[latex]S_\lambda[i,]=b(x_i)^T(B^TW_\lambda(x_i)B)^{-1}B^TW_\lambda(x_i)[/latex]</p>
<p>where [latex]x_1\dots x_N[/latex] is the training input. it has to be given row by row and is NOT just a product of matrices like in the case of smoothing splines. furthermore, the lambda-dependence is much more complex and non-linear, and there won't be a Reinsch form in general. worse, such smoother matrices are usually neither symmetric nor positive semidefinite. </p>
<p>to demonstrate this, i have calculated two examples---see the attached pdf. the first kernel is linear and the corresponding smoother matrix is almost symmetric and there is symmetry in the rows. this can easily be broken by making the kernel kind of non-linear as in the case of my second kernel. still one sees some symmetries in the smoother matrix, and as a consequence the eigenvalues are still real. i think this is due to the symmetry and homogeneity of my [latex]x_i values -2,-1,0,1,2[/latex]. i am pretty sure i can construct smoother matrices with complex eigenvalues using less homogeneous training input.</p>
<p>but what is then the connection between the trace and the decay of eigenvalues?? i think one should rather look at the sum of the absolute values of the eigenvalues. but then the problem would be how to choose the equivalent kernel as this choice would be based on possibly complex eigenvectors...</p>
<p>all in all, i think that the MEANINGFUL use of [latex]trace(S_\lambda)[/latex] as a kind of number of degrees of freedom would pretty much depend on the structure of the training data. but even if the data is nice and the meaning justified, as there is no Reinsch form, it's not clear that the trace depends on lambda in a monotone way, and hence one would have to dicuss all lambdas when working backward from a given trace (I would not know if standard software produces all possible lambdas or if it would only return a convenient one).</p>
<div>
<p><strong>Attachments</strong> <small class="muted text-muted">(follow link to download)</small></p>
<ul>
<li>
<a href="https://warwick.ac.uk/sitebuilder2/file/fac/sci/statistics/news/mlrg/ml-seminar/8a17841a5f2ace02015f641b02f856c1/smoothing-kernels.pdf?sbrPage=/fac/sci/statistics/news/mlrg/ml-seminar&attachment=8a17841a5f2ace02015f641b034456c3&forceOpenSave=true">smoothing-kernels.pdf</a> <small class="muted text-muted">(1.6 MB)</small>
</li>
</ul>
</div>Sat, 28 Oct 2017 17:50:12 GMTSigurd Assing8a17841a5f2ace02015f641b034456c2Re: kernel of penalty functional
https://warwick.ac.uk/fac/sci/statistics/news/mlrg/ml-seminar/?post=8a17841b5f2acca4015f5e4faf442fc2
<p>hi shahin,</p>
<p>this makes sense to me, indeed. and thanks for bringing this up. was great you came<br>today.</p>
<p>one way to bring in intercepts could be to use a cut-off for R^d. the corresponding<br>kernels would then be based on eigen-expansions where the eigenfunctions would play the role of "waves" with certain frequencies. have to think about this.</p>
<p>cheers<br>sigurd</p>Fri, 27 Oct 2017 14:50:01 GMTSigurd Assing8a17841b5f2acca4015f5e4faf442fc2kernel of penalty functional
https://warwick.ac.uk/fac/sci/statistics/news/mlrg/ml-seminar/?post=8a17841a5f2ace02015f5e4ed1f94b93
<p><strong>Shahin send the below on 19 Oct:</strong></p>
<p>I was just thinking again about what we discussed, and I noticed that constant functions<br>are in the null of J(f) = \int |f’’|^2, but this couldn’t be seen with the formula J(f)<br>= \int | \tilde (\omega) |^2 \omega^4 d\omega because it’s only valid for square<br>integrable functions. If f(x) is square-integrable then f(x) + ax + b is square<br>integrable iff a=b=0, so the formula is valid only for the representative with 0<br>coefficient in the null space. This is why it seemed that the null space is zero for the<br>frequency domain formula of J(f) given above.</p>
<p>I guess this makes things more complicated, as we *want* to have use basis functions in<br>the kernel of J(.) for expressing f (we want to allow for vertical shifts).</p>
<p>I think we can come to the following remarks:</p>
<p>1- If the penalty J(.) is given in the L2 domain, the kernel of J() is important. But we<br>can use the frequency domain representation of J() to understand what it does.<br>2- If the penalty J(.) is given in the frequency domain, we have to remember that it is<br>only valid on L2 integrable functions f. It is not clear what we do with intercept<br>shifts, etc (or do we?)<br>3- If a kernel is specified, then it is implicitly assumed that we only look at functions<br>orthogonal to the null space of the induced J(.) — that’s what the representer theorem<br>says, essentially.</p>
<p>Does this make sense?</p>
<p>Cheers,<br>Shahin</p>Fri, 27 Oct 2017 14:49:04 GMTSigurd Assing8a17841a5f2ace02015f5e4ed1f94b93