Regression

Supervised Learning

Training set of \(n\) observations: \(\mathcal{D} = \{(x_1, y_1) \dots (x_n, y_n)\}\).

The input vectors are \(x_i \in \mathbb{R}^D\) and the output (dependent variable) is \(y_i \in \mathbb{R}\).

Let \(X \in \mathbb{R}^{D \times n}\) be the design matrix: column \(i\) of \(X\) corresponds to the input vector \(x_i\):

\begin{equation} X = \begin{bmatrix} x_1 & \dots & x_n \end{bmatrix} \in \mathbb{R}^{D \times n} \end{equation}

and let \(y \in \mathbb{R}^n\) be a vector where each element is the output:

\begin{equation} y = \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix}. \end{equation}

Given a previously unseen input \(x_\ast\) can we predict the output \(y_\ast\)?

Given \(\mathcal{D}\) find \(y = f(x)\)

Applications

Depth data from monocular images
Mapping terrain
Mapping an environmental quantity (such as temperature)
Classification

Linear Regression

The Model

\(y = w_0 + w_1 x_1 + \cdots + w_n x_n\). The goal is to find the weights \(w\).
By setting \(x_0 = 1\), we can write the model as \(y = w^T x\).

Matrix Form

Stack all the \(x\) vectors: \(X = \begin{bmatrix}x_1 & \dots & x_n\end{bmatrix}\)
Stack all the measurements \(y = \begin{bmatrix}y_1 & \dots y_n \end{bmatrix}\)
The overall equation is
\begin{equation} y = X^T w \end{equation}
Note that in general, \(X^T \in \mathbb{R}^{n \times D}\)

Least Squares

The regression equation \(y = X^T w\) cannot be solved precisely because there are more measurements than independent variables. Instead we try to minimize the square error:

\begin{equation} w_\ast = \mathrm{arg}\min_w ||X^Tw - y||_2^2 \end{equation}

So \(w_\ast = (X X^T)^{-1} X y\) when \(n > D\) (there are more measurements then the dimension of the data \(x\)).

Minimizing with the SVD

Take the Singular Value Decomposition (SVD) of \(X^T \in \mathbb{R}^{n \times D}\):
1. \(X^T = U \Sigma V^T\).
2. All matrices have an SVD.
3. \(U \in \mathbb{R}^{n \times n}\), \(U^T U = U U^T = I\) (\(U\) is an orthogonal matrix).
4. \(V \in \mathbb{R}^{D \times D}\), \(V^T V = V V^T = I\) (\(V\) is an orthogonal matrix).
5. \(\Sigma = \begin{bmatrix} \bar{\Sigma} & 0 \\ 0 & 0\end{bmatrix} \in \mathbb{R}^{n \times D}\), where \(r = \mathrm{rank{X}}\) \(\bar{\Sigma} \in \mathbb{R}^{r \times r}\) is diagonal with positive diagonal entries.
We have \(||U A ||_2^2 = ||A||_2^2\) for orthogonal matrix \(U\) because
- \(||U A||_2 = (UA)^T(UA) = A^T U^T UA = A^TA = ||A||_2^2\)
Also, \(||X^Tw - y||_2^2 = ||U \Sigma V^T w -y ||_2^2 = || U^T (U \Sigma V^T w -y) ||_2^2 = ||\Sigma V^T w -U^Ty||_2^2\)
In block form the equations are separated
\begin{align} ||\begin{bmatrix} \bar{\Sigma} & 0 \\ 0 & 0 \end{bmatrix} \begin{bmatrix}V_R^T \\ V_N^T\end{bmatrix} w - \begin{bmatrix}U_R^T \\ U_N^T\end{bmatrix}y||_2^2 &= \\ &= || \begin{bmatrix}\bar{\Sigma}V_R^T w - U_R^T y \\ - U_N^T y \end{bmatrix}||_2^2 \\ &=\begin{bmatrix}(\bar{\Sigma}V_R^T w - U_R^T y)^T & (- U_N^T y)^T \end{bmatrix}\begin{bmatrix}\bar{\Sigma}V_R^T w - U_R^T y \\ - U_N^T y \end{bmatrix}\\ &=||\bar{\Sigma}V_R^T w - U_R^T y||_2^2 + ||U_N^Ty||_2^2 \end{align}
The above is minimized when \(w = V_R\bar{\Sigma}^{-1}U_R^Ty\)
When \(n > D\), then \(\mathrm{rank}(X^T) = D\) and thus \(V = V_R\) and \(X^T = \begin{bmatrix}U_R & U_N\end{bmatrix}\begin{bmatrix}\bar{\Sigma} \\ 0\end{bmatrix}V_R^T = U_R \bar{\Sigma}V_R\)
- Therefore, \((X X^T)^{-1}X = (V_R \bar{\Sigma} U_R^T U_R \bar{\Sigma} V_R^T)^{-1} V_R \bar{\Sigma} U_R^T = V_R \bar{\Sigma}^{-1}\bar{\Sigma}^{-1}V_R^{T}V_R \bar{\Sigma}U_R^{T}= V_R\bar{\Sigma}^{-1}U_R^{T}\)

Linear Regression, Gaussian Noise

The Model

The model: \(y_i = f(x_i) + \epsilon_i\), where \(f(x) = x^T w\) and \(\epsilon_i \sim \mathcal{N}(0, \sigma^2)\).
- Noise is independent of the function value and other noise values: that is
  \begin{equation} \epsilon = \begin{bmatrix}\epsilon_1 \\ \vdots \\ \epsilon_n\end{bmatrix} \sim \mathcal{N}(0, \sigma^2 I_{n \times n}) \end{equation}
- The goal of the regression is to find \(f(x)\), which in this case consists of finding the weights \(w \in \mathbb{R}^D\)
- Without seeing any data, the weights are randomly distributed:
  \begin{equation} w \sim \mathcal{N}(0, \Sigma) \end{equation}

Finding the Weights

Write the posterior distribution of the weights using Bayes rule:
\begin{equation} p(w | y, X) = \frac{p(y | X, w) p(w)}{p(y | X)} \end{equation}
- We can apply Bayes rule directly, because we know \(p(w)\) and can compute \(p(y | X, w)\) from the formula for the model.
The likelihood is \(p(y | X, w) = \mathcal{N}(X^T w, \sigma^2 I_{n \times n})\).
- This formula follows from the independence of each \(\epsilon_i\) and because \(y_i\) is a linear transform of a Gaussian random variable (\(\epsilon\)):
  \begin{align} \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix} &= \begin{bmatrix} x_1^T \\ \vdots \\ x_n^T \end{bmatrix} w + \begin{bmatrix} \epsilon_1 \\ \vdots \\ \epsilon_n \end{bmatrix}\\ y &= X^T w + \epsilon \end{align}
The normalization constant (marginal likelihood) can be computed as
\begin{align} p(y | X) &= \int_{-\infty}^{\infty}p(y, w | x)dw \\ & = \int_{-\infty}^{\infty}p(y | x, w)p(w)dw \end{align}
After substituting the distributions into Bayes rule, we find
\begin{equation} p(w | y, X) \sim \mathcal{N}(\bar{w}, A^{-1}), \end{equation}
with
\begin{equation} A = \frac{1}{\sigma^2}X X^T + \Sigma^{-1} \end{equation}
and
\begin{equation} \bar{w}= \frac{1}{\sigma^2}A^{-1} X y. \end{equation}
To make predictions from the model—that is, what is \(y_\ast = f(x_\ast)\) given \(x_\ast\)— use the probability distribution
\begin{align} p(y_\ast | x_\ast, X, y) &= \int_{-\infty}^{\infty}p(y_\ast, w | x_\ast, X, y) dw \\ &=\int_{-\infty}^{\infty}p(y_\ast | x_\ast, w)p(w | X, y) dw\\ &= \mathcal{N}(\frac{1}{\sigma^2}x_\ast^T A^{-1} X y, x_\ast^T A^{-1}x_\ast)\\ &= \mathcal{N}(x_\ast^T \bar{w}, x_\ast^T A^{-1}x_\ast) \end{align}
- The mean of the prediction is just the weighted sum of elements in the input vector, where the weights are the most probable weights
- Variance grows with magnitude of input

Non-linear functions:

Instead of using \(x_i\), for each input data point, map to an \(N\) dimensional feature space \(\phi(x_i)\). Since \(\phi(x_i)\) does not depend on weights, we can just replace \(x_i\) with \(\phi(x_i)\) in the previous expression (and let \(\Phi(X)\) be \(\phi\) applied to the columns of \(x\)).
Thus, for non-linear basis functions, we get \begin{equation} p(f_∗ | x_∗, X, y) = \mathcal{N}(\frac{1}{\sigma^2}φ(x_∗)^T A^{-1}Φ y, φ(x_∗)^T A^{-1} φ(x_∗)). \end{equation}.
Here \(A = \frac{1}{\sigma^2}\Phi \Phi^T + \Sigma^{-1}\).

Some Manipulations

In its original form, \(p(f_\ast | x_\ast, X, y)\) requires inverting an \(N \times N\) matrix
- With some manipulation we can make it so that we only need to invert an \(n \times n\) matrix.
\(A \in \mathbb{R}^{N \times N}\), that is its dimensions scale with the number of features
Let \(K = \Phi^T \Sigma \Phi\). Then note that
\begin{equation} A^{-1} \Phi = \Sigma \Phi (K + \sigma^2 I)^{-1} \end{equation}
and (from the Matrix Inversion Lemma a.k.a Woodbury Matrix Identity)
\begin{equation} \phi_\ast^T A^{-1} \phi_\ast = \phi_\ast^T \Sigma \phi - \phi_\ast \Sigma \Phi(K + \sigma^2 I)^{-1}\Phi^T \Sigma \phi_\ast, \end{equation}
After applying these transformations, the estimate becomes
\begin{equation} \mathcal{N}(\phi_\ast^T\Sigma\Phi(K + \sigma^2I)^{-1}y, \phi_\ast^T\Sigma\phi_\ast - \phi_\ast^T\Sigma\Phi(K + \sigma^2)^{-1}\Phi^T\Sigma\phi_\ast \end{equation}
Hand-wave thought about dimensionality:
- We have an infinite number of basis functions \(N \to \infty\) with a finite number of inputs \(n\).
The covariance kernel \(k(x, x') = \phi(x)^T \Sigma \phi(x')\).
Note that \(k(x, x')\) is an inner product since \(\Sigma > 0\).
The kernel trick:
- A computational trick
- If we know \(k(x, x')\), we can replace all these feature vector evaluations directly by using the kernel function on the original inputs.

Summary

Given labeled training data
Map inputs into a feature space
Given a prior on the weights and some noise statistics, we can predict the distribution of the output given a new input
Using the kernel trick avoids needing to explicitly evaluate the features.

Random Process

A random vector is a vector of random variables with a joint distribution
- For example, Jointly Gaussian random vector
(Loosely) when the length of a random vector grows to infinity it becomes a random process
A random process is an infinite collection of random variables (indexed by elements of some set) on the same probability space
- Any finite collection of these random variables has a joint distribution.
A Gaussian Random process, the joint distribution of any finite subset of the collection has a jointly Gaussian Distribution
- can be indexed by elements in \(R^n\) (for example, time, spatial dimensions, etc).
- Defined by a mean function \(m(x)\) and covariance function \(k(x, x')\).
  - The mean function provides the mean at the given index and the covariance function computes the covariance between two indices.
  - Since the joint distribution between any two indices is Gaussian, these functions fully specify the distribution
In Gaussian process regression, the function we are attempting to learn is assumed to be described by a Gaussian process
- This is less restrictive then it seems
  - We saw how the linear regression model could be expressed in terms of the covariance function \(k(x, x')\) via the "kernel trick"
  - Thus, the inner product and the linear basis functions defines a "kernel"
  - It turns out that we can choose a kernel and that kernel defines some set of basis functions
  - Some kernels correspond an infinite number of basis functions, thus the kernel enables us to express functions that have an infinite-dimensional basis in a finite manner.

Another look at linear regression

Given the mean and covariance of a Gaussian process (and we'll assume mean 0 here), we can write a distribution for the process, partitioning into points we have measured and points at which we want a prediction

\begin{equation} \begin{bmatrix} f_\ast y \\ \end{bmatrix} \sim \mathcal{N}(0, \begin{bmatrix} K(X_\ast, X_\ast) & K(X_\ast, X) \\ K(X, X_\ast) & K(X, X) \end{bmatrix}), \end{equation}

where \(K(X, X_\ast)\) is the \(n \times n_\ast\) covariance matrix formed by \(k(x, x_\ast)\) at all training and test point combinations.

The above is just a jointly Gaussian distribution, with the prediction points and input points given. Thus, we can simply apply the formula for a conditional Gaussian to predict everything at \(f_\ast\).

\begin{equation} \mathcal{N}(K(X_\ast, X)(K(X, X) + \sigma^2 I)^{-1}y, K(X_\ast, X_\ast) - K(X_\ast, X)(K(X, X) + \sigma^2 I)^{-1}K(X, X_\ast)) \end{equation}

Model Selection and Hyperparameters

The data can be used to select the covariance model and hyperparameters
There is a hierarchy of parameters
1. The actual regression itself (the weights)
2. The hyperparameters: These are parameters that form part of the covariance function
3. The model structure itself.
Apply Bayes rule each time:
\begin{equation} p(w | y, X, \theta, H_i) = \frac{p(y | X, W, H_i)p(w|\theta, H_i)}{p(y | X, \theta, H_i)} \end{equation} \begin{equation} p(\theta | y, X, H_i) = \frac{p(y | X, H_i, \theta)p(\theta |H_i)}{p(y | X, H_i)} \end{equation}

Prior of models \(p(H_i)\) (often flat to not bias the model).

\begin{equation} p(H_i | y, X) = \frac{p(y | X, H_i)p(H_i)}{p(y | X)} \end{equation} \begin{equation} p(y | X) = \sum_i p(y | X, H_i) p(H_i) \end{equation}

Resources

Much of these notes are derived from Gaussian Processes for Machine Learning, Carl Edward Rasmussen and Chris Williams, The MIT Press, 2006 Avaiable at http://gaussianprocess.org/