Regression
Supervised Learning
Training set of \(n\) observations: \(\mathcal{D} = \{(x_1, y_1) \dots (x_n, y_n)\}\).
The input vectors are \(x_i \in \mathbb{R}^D\) and the output (dependent variable) is \(y_i \in \mathbb{R}\).
Let \(X \in \mathbb{R}^{D \times n}\) be the design matrix: column \(i\) of \(X\) corresponds to the input vector \(x_i\):
\begin{equation} X = \begin{bmatrix} x_1 & \dots & x_n \end{bmatrix} \in \mathbb{R}^{D \times n} \end{equation}and let \(y \in \mathbb{R}^n\) be a vector where each element is the output:
\begin{equation} y = \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix}. \end{equation}Given a previously unseen input \(x_\ast\) can we predict the output \(y_\ast\)?
- Given \(\mathcal{D}\) find \(y = f(x)\)
Applications
- Depth data from monocular images
- Mapping terrain
- Mapping an environmental quantity (such as temperature)
- Classification
Linear Regression
The Model
- \(y = w_0 + w_1 x_1 + \cdots + w_n x_n\). The goal is to find the weights \(w\).
- By setting \(x_0 = 1\), we can write the model as \(y = w^T x\).
Matrix Form
- Stack all the \(x\) vectors: \(X = \begin{bmatrix}x_1 & \dots & x_n\end{bmatrix}\)
- Stack all the measurements \(y = \begin{bmatrix}y_1 & \dots y_n \end{bmatrix}\)
The overall equation is
\begin{equation} y = X^T w \end{equation}- Note that in general, \(X^T \in \mathbb{R}^{n \times D}\)
Least Squares
The regression equation \(y = X^T w\) cannot be solved precisely because there are more measurements than independent variables. Instead we try to minimize the square error:
\begin{equation} w_\ast = \mathrm{arg}\min_w ||X^Tw - y||_2^2 \end{equation}So \(w_\ast = (X X^T)^{-1} X y\) when \(n > D\) (there are more measurements then the dimension of the data \(x\)).
Minimizing with the SVD
- Take the Singular Value Decomposition (SVD) of \(X^T \in \mathbb{R}^{n \times D}\):
- \(X^T = U \Sigma V^T\).
- All matrices have an SVD.
- \(U \in \mathbb{R}^{n \times n}\), \(U^T U = U U^T = I\) (\(U\) is an orthogonal matrix).
- \(V \in \mathbb{R}^{D \times D}\), \(V^T V = V V^T = I\) (\(V\) is an orthogonal matrix).
- \(\Sigma = \begin{bmatrix} \bar{\Sigma} & 0 \\ 0 & 0\end{bmatrix} \in \mathbb{R}^{n \times D}\), where \(r = \mathrm{rank{X}}\) \(\bar{\Sigma} \in \mathbb{R}^{r \times r}\) is diagonal with positive diagonal entries.
- We have \(||U A ||_2^2 = ||A||_2^2\) for orthogonal matrix \(U\) because
- \(||U A||_2 = (UA)^T(UA) = A^T U^T UA = A^TA = ||A||_2^2\)
- Also, \(||X^Tw - y||_2^2 = ||U \Sigma V^T w -y ||_2^2 = || U^T (U \Sigma V^T w -y) ||_2^2 = ||\Sigma V^T w -U^Ty||_2^2\)
In block form the equations are separated
\begin{align} ||\begin{bmatrix} \bar{\Sigma} & 0 \\ 0 & 0 \end{bmatrix} \begin{bmatrix}V_R^T \\ V_N^T\end{bmatrix} w - \begin{bmatrix}U_R^T \\ U_N^T\end{bmatrix}y||_2^2 &= \\ &= || \begin{bmatrix}\bar{\Sigma}V_R^T w - U_R^T y \\ - U_N^T y \end{bmatrix}||_2^2 \\ &=\begin{bmatrix}(\bar{\Sigma}V_R^T w - U_R^T y)^T & (- U_N^T y)^T \end{bmatrix}\begin{bmatrix}\bar{\Sigma}V_R^T w - U_R^T y \\ - U_N^T y \end{bmatrix}\\ &=||\bar{\Sigma}V_R^T w - U_R^T y||_2^2 + ||U_N^Ty||_2^2 \end{align}- The above is minimized when \(w = V_R\bar{\Sigma}^{-1}U_R^Ty\)
- When \(n > D\), then \(\mathrm{rank}(X^T) = D\) and thus \(V = V_R\) and \(X^T = \begin{bmatrix}U_R & U_N\end{bmatrix}\begin{bmatrix}\bar{\Sigma} \\ 0\end{bmatrix}V_R^T = U_R \bar{\Sigma}V_R\)
- Therefore, \((X X^T)^{-1}X = (V_R \bar{\Sigma} U_R^T U_R \bar{\Sigma} V_R^T)^{-1} V_R \bar{\Sigma} U_R^T = V_R \bar{\Sigma}^{-1}\bar{\Sigma}^{-1}V_R^{T}V_R \bar{\Sigma}U_R^{T}= V_R\bar{\Sigma}^{-1}U_R^{T}\)
Linear Regression, Gaussian Noise
The Model
- The model: \(y_i = f(x_i) + \epsilon_i\), where \(f(x) = x^T w\) and \(\epsilon_i \sim \mathcal{N}(0, \sigma^2)\).
Noise is independent of the function value and other noise values: that is
\begin{equation} \epsilon = \begin{bmatrix}\epsilon_1 \\ \vdots \\ \epsilon_n\end{bmatrix} \sim \mathcal{N}(0, \sigma^2 I_{n \times n}) \end{equation}- The goal of the regression is to find \(f(x)\), which in this case consists of finding the weights \(w \in \mathbb{R}^D\)
Without seeing any data, the weights are randomly distributed:
\begin{equation} w \sim \mathcal{N}(0, \Sigma) \end{equation}
Finding the Weights
Write the posterior distribution of the weights using Bayes rule:
\begin{equation} p(w | y, X) = \frac{p(y | X, w) p(w)}{p(y | X)} \end{equation}- We can apply Bayes rule directly, because we know \(p(w)\) and can compute \(p(y | X, w)\) from the formula for the model.
- The likelihood is \(p(y | X, w) = \mathcal{N}(X^T w, \sigma^2 I_{n \times n})\).
This formula follows from the independence of each \(\epsilon_i\) and because \(y_i\) is a linear transform of a Gaussian random variable (\(\epsilon\)):
\begin{align} \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix} &= \begin{bmatrix} x_1^T \\ \vdots \\ x_n^T \end{bmatrix} w + \begin{bmatrix} \epsilon_1 \\ \vdots \\ \epsilon_n \end{bmatrix}\\ y &= X^T w + \epsilon \end{align}
The normalization constant (marginal likelihood) can be computed as
\begin{align} p(y | X) &= \int_{-\infty}^{\infty}p(y, w | x)dw \\ & = \int_{-\infty}^{\infty}p(y | x, w)p(w)dw \end{align}After substituting the distributions into Bayes rule, we find
\begin{equation} p(w | y, X) \sim \mathcal{N}(\bar{w}, A^{-1}), \end{equation}with
\begin{equation} A = \frac{1}{\sigma^2}X X^T + \Sigma^{-1} \end{equation}and
\begin{equation} \bar{w}= \frac{1}{\sigma^2}A^{-1} X y. \end{equation}To make predictions from the model—that is, what is \(y_\ast = f(x_\ast)\) given \(x_\ast\)— use the probability distribution
\begin{align} p(y_\ast | x_\ast, X, y) &= \int_{-\infty}^{\infty}p(y_\ast, w | x_\ast, X, y) dw \\ &=\int_{-\infty}^{\infty}p(y_\ast | x_\ast, w)p(w | X, y) dw\\ &= \mathcal{N}(\frac{1}{\sigma^2}x_\ast^T A^{-1} X y, x_\ast^T A^{-1}x_\ast)\\ &= \mathcal{N}(x_\ast^T \bar{w}, x_\ast^T A^{-1}x_\ast) \end{align}- The mean of the prediction is just the weighted sum of elements in the input vector, where the weights are the most probable weights
- Variance grows with magnitude of input
Non-linear functions:
- Instead of using \(x_i\), for each input data point, map to an \(N\) dimensional feature space \(\phi(x_i)\). Since \(\phi(x_i)\) does not depend on weights, we can just replace \(x_i\) with \(\phi(x_i)\) in the previous expression (and let \(\Phi(X)\) be \(\phi\) applied to the columns of \(x\)).
- Thus, for non-linear basis functions, we get \begin{equation} p(f_∗ | x_∗, X, y) = \mathcal{N}(\frac{1}{\sigma^2}φ(x_∗)^T A^{-1}Φ y, φ(x_∗)^T A^{-1} φ(x_∗)). \end{equation}.
- Here \(A = \frac{1}{\sigma^2}\Phi \Phi^T + \Sigma^{-1}\).
Some Manipulations
- In its original form, \(p(f_\ast | x_\ast, X, y)\) requires inverting an \(N \times N\) matrix
- With some manipulation we can make it so that we only need to invert an \(n \times n\) matrix.
- \(A \in \mathbb{R}^{N \times N}\), that is its dimensions scale with the number of features
Let \(K = \Phi^T \Sigma \Phi\). Then note that
\begin{equation} A^{-1} \Phi = \Sigma \Phi (K + \sigma^2 I)^{-1} \end{equation}and (from the Matrix Inversion Lemma a.k.a Woodbury Matrix Identity)
\begin{equation} \phi_\ast^T A^{-1} \phi_\ast = \phi_\ast^T \Sigma \phi - \phi_\ast \Sigma \Phi(K + \sigma^2 I)^{-1}\Phi^T \Sigma \phi_\ast, \end{equation}After applying these transformations, the estimate becomes
\begin{equation} \mathcal{N}(\phi_\ast^T\Sigma\Phi(K + \sigma^2I)^{-1}y, \phi_\ast^T\Sigma\phi_\ast - \phi_\ast^T\Sigma\Phi(K + \sigma^2)^{-1}\Phi^T\Sigma\phi_\ast \end{equation}- Hand-wave thought about dimensionality:
- We have an infinite number of basis functions \(N \to \infty\) with a finite number of inputs \(n\).
- The covariance kernel \(k(x, x') = \phi(x)^T \Sigma \phi(x')\).
- Note that \(k(x, x')\) is an inner product since \(\Sigma > 0\).
- The kernel trick:
- A computational trick
- If we know \(k(x, x')\), we can replace all these feature vector evaluations directly by using the kernel function on the original inputs.
Summary
- Given labeled training data
- Map inputs into a feature space
- Given a prior on the weights and some noise statistics, we can predict the distribution of the output given a new input
- Using the kernel trick avoids needing to explicitly evaluate the features.
Random Process
- A random vector is a vector of random variables with a joint distribution
- For example, Jointly Gaussian random vector
- (Loosely) when the length of a random vector grows to infinity it becomes a random process
- A random process is an infinite collection of random variables (indexed by elements of some set) on the same probability space
- Any finite collection of these random variables has a joint distribution.
- A Gaussian Random process, the joint distribution of any finite subset of the collection has a jointly Gaussian Distribution
- can be indexed by elements in \(R^n\) (for example, time, spatial dimensions, etc).
- Defined by a mean function \(m(x)\) and covariance function \(k(x, x')\).
- The mean function provides the mean at the given index and the covariance function computes the covariance between two indices.
- Since the joint distribution between any two indices is Gaussian, these functions fully specify the distribution
- In Gaussian process regression, the function we are attempting to learn is assumed to be described by a Gaussian process
- This is less restrictive then it seems
- We saw how the linear regression model could be expressed in terms of the covariance function \(k(x, x')\) via the "kernel trick"
- Thus, the inner product and the linear basis functions defines a "kernel"
- It turns out that we can choose a kernel and that kernel defines some set of basis functions
- Some kernels correspond an infinite number of basis functions, thus the kernel enables us to express functions that have an infinite-dimensional basis in a finite manner.
- This is less restrictive then it seems
Another look at linear regression
Given the mean and covariance of a Gaussian process (and we'll assume mean 0 here), we can write a distribution for the process, partitioning into points we have measured and points at which we want a prediction
\begin{equation} \begin{bmatrix} f_\ast y \\ \end{bmatrix} \sim \mathcal{N}(0, \begin{bmatrix} K(X_\ast, X_\ast) & K(X_\ast, X) \\ K(X, X_\ast) & K(X, X) \end{bmatrix}), \end{equation}where \(K(X, X_\ast)\) is the \(n \times n_\ast\) covariance matrix formed by \(k(x, x_\ast)\) at all training and test point combinations.
The above is just a jointly Gaussian distribution, with the prediction points and input points given. Thus, we can simply apply the formula for a conditional Gaussian to predict everything at \(f_\ast\).
\begin{equation} \mathcal{N}(K(X_\ast, X)(K(X, X) + \sigma^2 I)^{-1}y, K(X_\ast, X_\ast) - K(X_\ast, X)(K(X, X) + \sigma^2 I)^{-1}K(X, X_\ast)) \end{equation}Model Selection and Hyperparameters
- The data can be used to select the covariance model and hyperparameters
- There is a hierarchy of parameters
- The actual regression itself (the weights)
- The hyperparameters: These are parameters that form part of the covariance function
- The model structure itself.
Apply Bayes rule each time:
\begin{equation} p(w | y, X, \theta, H_i) = \frac{p(y | X, W, H_i)p(w|\theta, H_i)}{p(y | X, \theta, H_i)} \end{equation} \begin{equation} p(\theta | y, X, H_i) = \frac{p(y | X, H_i, \theta)p(\theta |H_i)}{p(y | X, H_i)} \end{equation}
Prior of models \(p(H_i)\) (often flat to not bias the model).
\begin{equation} p(H_i | y, X) = \frac{p(y | X, H_i)p(H_i)}{p(y | X)} \end{equation} \begin{equation} p(y | X) = \sum_i p(y | X, H_i) p(H_i) \end{equation}Resources
Much of these notes are derived from Gaussian Processes for Machine Learning, Carl Edward Rasmussen and Chris Williams, The MIT Press, 2006 Avaiable at http://gaussianprocess.org/