Least Squares Approximation

The method of least squares minimisation is a well studied and well used one. It is a technique that originates with Gauss and Legendre. It consists of the following method.

Given N pairs of data $(x^i,y^i)\in \mathbb{R}^{d} \mathbb{R}^{d}$ and a chosen function space $\mathscr{W}$ , one finds the function $f$ in $\mathscr{W}$ with $f:\mathbb{R}^d\to\mathbb{R}^d$ which minimises:

$S= \sum_{i=1}^N |f_k(x^i)-y^i_k|^2$
for $k=1,\ldots,d$ . This is of particular interest and of use when one has error with the values $y^i$ , as then one does not try to perfectly fit the information given.

By considering the error in observing $y^i$ , we find ourselves within a probabilistic setting. Suppose the function $f$ can be described by choosing some parameters $\beta$ which are viewed as random, then since $y^i$ is considered to be a Gaussian random sample drawn from a distribution centred at the true state of the system $\widetilde{y}^i$ , we have that the maximum likelihood estimation of the $\beta$ coincides with the minimum least squares distance, since the likelihood here is
$\mathcal{L}\left(x^1,\dots , x^N\right)=\prod_{i=1}^N f_k(x^i|\beta) \propto e^{-\sum_{i=1}^N\left|f_k( x^i|\beta) -y^i_k\right|^2}$

The minimisation problem first defined of finding a function $f\colon\mathbb{R}^d\to\mathbb{R}^d$ is actually $d$ minimisation problems of finding $f_k\colon\mathbb{R}^d\to\mathbb{R}$ where $\{f_k\}_{k=1}^d$ are the $d$ components of $f$ . To implement on a computer, one must parameterise a projection of the function space onto each of its components by finitely many parameters. In other words, a basis $\{\phi_j:\mathbb{R}^d\to\mathbb{R}^d\}_{j=1}^K$ of $\mathscr{W}$ is chosen and then projected onto each of the coordinate axes of $\mathbb{R}^d$ . Then, for each $f$ in $\mathscr{W}$ and $k=1,\ldots,d$ , one can write the functions $f_k$ by truncating its expansion:
$f_k(x)= f_k(x,\beta_k) =\sum_{j=1}^K \beta_{j,k} \phi_{j,k}(x),$
where $\phi_{jk}$ is clearly the projection of $\phi_j$ onto the $k^{\text{th}}$ coordinate. For the sake of brevity, we will suppress the index $k$ and treat $f$ as a function from $\mathbb{R}^d$ to $\mathbb{R}$ . A minimiser, if one exists, of the problem must have differential zero in $\beta$ , namely
$\nabla_\beta S(\beta) = 0$
and one can write this, using $X_{ij}= \frac{\partial }{\partial \beta_j}(f(x^i,\beta))$
as
$\beta = (X^T X)^{-1} X^T y$
using the Gauss Markov theorem.

Gauss-Markov Theorem
Suppose that $X^TX$ is invertible. Then among all unbiased linear estimators $K$ such that $\beta=Ky$ the one that minimises the least squares error is
$K= (X^T X)^{-1} X^T.$

Constrained Minimisation

Furthermore, if it is physically relevant to consider some constraint in the situation, then a constrained least squares minimisation will be more suitable. While no results are given numerically later on this, it is still a useful consideration to make. This is of the form
$\text{Minimise } \sum_{i=1}^N |f(x^i)-y^i_k|^2 \text{ in } \mathscr{W} \text{ subject to } \|\nabla f\|_{L^2}^2 \leq \gamma$
for some $\gamma >0$ . and can be considered equivalent to
$\text{Minimise } \|C\alpha - d\|_{2}^2 + \|A\alpha\|_2^2 \text{ in } \mathscr{W}.$
This equivalence can be seen by considering the minimisation problem
$\min_{\gamma}\left(\min_{\|C\alpha\|\leq\gamma} \|A\alpha - d\|_{2}^2 + \gamma^2\right).$
This is the situation of Tikhonov regularisation:

Tikhonov Regularisation
Let $A:\mathcal{H}\to \mathcal{K}$ be a linear operator between Hilbert Spaces such that $R(A)$ is a closed subspace of $\mathcal{K}$ . Let $Q:\mathcal{H}\to \mathcal{H}$ be self adjoint and positive definite, and $b\in \mathcal{K}$ and $x_0\in \mathcal{H}$ be given as well. Then
$\hat{x}\in \mathrm{argmin}_{x\in \mathcal{H}}\left(\|Ax-b\|^2+\|x-x_0\|_Q^2\right)\iff (A^\star A+Q)\hat{x}=A^\star b +Qx_0$

Choice of Minimisation Space

An important decision to make in the minimisation is the choice of space where one minimises. One can choose a basis of this space which has local support, for example

$\psi_{z}(w_1,\cdots,w_d)= \begin{cases}0 &\text{if } \max\limits_{k=1,\ldots,d}{\frac{|w_k-z_k|}{l_k}} \geq 1\\1- \max\limits_{k=1,\ldots,d}{\frac{|w_k-z_k|}{l_k}}& \text{else}\end{cases}$

This has advantages that the matrix $C$ as given above will be sparse, so computation is faster, although if the data that one is using is clustered in a small subset of the space one is considering then this will lead to ill posedness of the minimisation.

An alternative type of basis is one which has global support, for example
$\gamma_{i_1,\ldots,i_d}(x_1,x_2,\ldots,x_d) = \cos\left(\frac{x_1-\min I_1}{2\pi|I_1|}i_1\right)\times \cdots \times \cos\left(\frac{x_d-\min I_d}{2\pi|I_d|}i_d\right)$

This is particularly useful if one wants to approximate the function from data clustered in a small subset of the space and extrapolate information outside of this region. However it does require more computation as in general the matrix $C$ will have all non zero entries.