Least Squares Regression

Introduction

For this article, we will first consider the case of a linear model with one variable for the illustration. However, the formula will be as general as possible. Hyperplan will be used to refer to a line in the case of a linear model with one variable.

To illustrate the problem, we will take the following example:

The least squares method is used to find the best fit hyperplane (line) for a set of data points. To find this line, we are looking to minimize the error (the distance between the points and the line). The least the error, the better the fit.

If we plot the error in function of the slope and intercept of the line, we obtain a parabola. The argmin of this parabola is the best parameter for the hyperplane.

Single variable optimization

In the case of dimension 2, the error formula is :
$$ E(m, b) = \sum_{i=1}^{n} (y_i - (mx_i + b))^2 $$ Where $y_i$ is the $i^{th}$ point of the dataset, $m$ is the slope of the line and $b$ is the intercept of the line.
If we plot the mean error in function of $m$ and $b$ for the previous data :
Visualy, we can see that the minimum seems to be:
$b = 3$
$m = 2$
(to show this minimum, I just tried many value for $b$ and $m$ and plot the scatter plot of the result) Finding the best fitting line is a matter of optimization. Finding this line is equivalent to finding the minimum of this surface.

We want to find the minimum of the error function $E(b, m)$.

The analytical solution for this problem (with one variable) is the following: $$ argmin_{b, m} E(b, m) = argmin_{b, m} \sum_{i=1}^n (y_i - (b + m x_i))^2 $$

The analytical solution is the following:

$m = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}$
$b = \bar{y} - m \bar{x}$

where $\bar{x}$ and $\bar{y}$ are the mean of $x$ and $y$ respectively. The proof is not left as an exercise ! You can find it here.

However, what append if we have more than one variable ?

Multiple variable optimization

We have now not a line $y = ax + b$ to find but an hyperplan $y = a_1 x_1 + a_2 x_2 + … + a_n x_n + b$ to find.

We first rewrite the equation $y$ as a matrix multiplication : $y = Ax $ where $A$ is a matrix and $x$ is a vector. $A$ is a matrix containing all the $a_i$ and a column of 1 (for including $b$ in the matrix $X$).

It implies now that we have to find the nearest solution for the following system:

$$\begin{cases} y_1 = a_{1,1}x_1 + a_{1,2}x_2 + \ldots + a_{1,n}x_n + x_0 \\ y_2 = a_{2,1}x_1 + a_{2,2}x_2 + \ldots + a_{2,n}x_n + x_0 \\ \ldots \\ y_n = a_{n,1}x_1 + a_{n,2}x_2 + \ldots + a_{n,n}x_n + x_0 \\ \end{cases} $$

Just for a quick example, if you have the points $(1, 1)$ and $(2, 2)$, your matrix $A$ will be : $A = \begin{pmatrix} 1 & 1 & 1 \\ 2 & 2 & 1 \end{pmatrix}$ and your vector $X$ will be : $X = \begin{pmatrix} a_1 \\ a_2 \\ a_0 \end{pmatrix}$. We need to find this vector $X$.

If the points are alined, you can used a methods like gaussian elimination to find the vector $X$ resulting with an error of 0. However, in any other cases, we need to find the best vector $X$ minimizing the error.

The error function is now :

$$ E(x) = \sum_{i=1}^n (y_i - (a_{i,1}x_1 + a_{i,2}x_2 + \ldots + a_{i,n}x_n + x_0))^2 $$

Minimizing this error is equivalent to finding the minimum of the surface. Like for simple equation, to find the minimum of a curve, we need to compute the derivative of the function and set it to 0. We can do the same for the error function.

$$ x_{opt} = argmin_{x} E(x) = argmin_x\nabla_x E(x)$$

We can rewrite it as:

$$ x_{opt} = argmin_x \sum_{i=1}^n (A_i x - y_i)^2 $$

$$ x_{opt} = argmin_x ||Ax - y||^2_2 $$

The notation $||x||_2$ is the euclidean norm of $x$ (the length of the vector $x$). The notation $||x||_2^2$ is the square of the norm of $x$. Just for an example, if $x = \begin{pmatrix} 1 \ 2 \end{pmatrix}$, then $||x||_2 = \sqrt{1^2 + 2^2} = \sqrt{5}$ and $||x||_2^2 = 1^2 + 2^2 = 5$.

We can now compute the derivative and try to solve the following equation:

$$ \nabla_x \frac{1}{2}||Ax - y||^2_2 = 0 $$

The 1/2 is here to simplify the calculus. It will not change the result. Because the argmin is the same for $f(x)$ and $\frac{1}{2}f(x)$.

$$ \frac{1}{2} \nabla_x ||Ax - y||^2_2 = 0 $$

By the chain rule:

$$ A^T \nabla_{Ax} ||Ax - y||^2_2 = 0 $$

With the formula of the norm : $\nabla_x ||x||^2_2 = 2x$

$$ A^T(Ax - y) = 0 $$

$$ A^TAx = A^Ty $$

$$ x = (A^TA)^{-1}A^Ty $$

$$ x_{opt} = (A^TA)^{-1}A^Ty $$

We now have the optimal solution for each $x_i$.

In the next section, we will try to apply this formula on a real dataset with python.

With Python

To illustrate the least squares regression, I will generate some points following a linear model and try to find a straight line that passes near all the points.

import numpy as np

Generate the data

They are many ways to generate data. For this example :

x = np.linspace(0, 10, 100)
y = 2*x + 3 + np.random.normal(0, 1, 100)

As expected, we can perceive a linear relationship between $x$ and $y$.

Computation

The first step is to create the matrix A. The first column of A will be the vector x, the second column will be the 1 vector.

We added the 1 vector to the matrix A to be able to calculate the y-intercept (the value of b when x = 0). The first column of A will be used to calculate the slope (the value of b when x = 1). If you don’t understand (and want to), you may read the first section of this article “In theory”.

A = np.vstack([x, np.ones(len(x))]).T

Then we just need no apply the formula : $X = (A^T \cdot A)^{-1} \cdot A^T \cdot y$

X = np.dot((np.dot(np.linalg.inv(np.dot(A.T,A)),A.T)),y)
line = [X[0]*i + X[1] for i in x] # y = ax + b

We have here : $b_0 = a = 2.0$ and $b_1 = b = 2.8$ ! Fair enough.

If we plot the line with the points, we finally get the graph at the beginning of the article.

We can do the same thing for higher dimensions by taking the intersection of two hyperplanes.

mowse

Least Squares Regression - Examples