The Mathematics of Least Squares

The term least squares is something most have heard of at one point or another. It is arguably one of the most fundamental concepts in areas such as regression modeling and machine learning. The familiarity of the term least squares can also lead to perhaps the false assumption of having a good foundational understanding. This article will present the mathematics of least squares independent of any application such as linear regression (this will be covered in another article).

The mathematics of least squares will review the calculus and the linear algebra approaches. The focus will be on the univariate case (that is, with only one \(x\) variable) though multivariate and polynomials beyond degree 1 (a line) will be covered. From a mathematical point-of-view these are mere extensions of the univariate case. The concept of weighted least squares will also be explored as well as comments on the software implementation of the presented mathematics. In particular, the following eight sections will be presented:

Section 1: The Concept of Least Squares

To begin the exploration of least squares, consider the following simple dataset and the corresponding scatter plot:

Simple Dataset
\(x\) \(y\)
1 1.5
2 0.5
3 3.5
4 2.0
5 1.7

From the scatter plot one might suspect that a linear relationship exists between \(x\) and \(y\). The goal, then, would be to find the equation of the line expressed in the slope-intercept form, \(y = mx + b\), where \(m\) is the slope and \(b\) is the y-intercept, which best fits the data. To be consistent with commonly used notation this line is usually expressed as \(\hat{y} = \beta_{0} + \beta_{1}x\). Said another way, for the \(x\) values in the dataset, are looking for the equation which produces output values, here referred to \(\hat{y}\), that best fit, or are close to, the observed \(y\) values. To do so, requires the finding of the coefficients \(\beta_{0}\) (the y-intercept) and \(\beta_{1}\) (the slope). This line will be referred to as the linear regression line. The term linear is used due to the fact that \(\hat{y} = \beta_{0} + \beta_{1}x\) is a linear function of the coefficients \(\beta_{0}\) and \(\beta_{1}\) and not the variable, \(x\) (as will be seen when polynomials greater than degree 1 are discussed below).

The obvious question becomes what is meant by best fits. As one can imagine, there are an infinite number of lines that can be constructed. Which is the best one? To answer this question, consider any one line under consideration as the linear regression line. This line will have some selected value for \(\beta_{0}\) and \(\beta_{1}\). The line will most likely not pass through each observed data point so there will be a vertical difference between the observed \(y\) value and the \(\hat{y}\) value produced by the equation of the line for any given \(x\). For a given \(x_{i}\) in the dataset, there is an associated observed value \(y_{i}\) and a generated \(\hat{y}_{i}\) value from which the vertical distance can be computed, \(e_{i} = y_{i} - \hat{y}_{i}\). The line that best fits the data will somehow minimize these distances across all \(i\). The following diagram illustrates this idea for the above simple dataset.

An intuitive approach to minimizing these differences would be to find the coefficients \(\beta_{0}\) and \(\beta_{1}\) of the least squares regression line which would minimize \(e_{i} = y_{i} - \hat{y}_{i}\) across all \(i\). Assume that \(\beta_{0} = 1.27\) and \(\beta_{1} = 0.19\) which yields the least squares regression line \(\hat{y} = 1.27 + 0.19x\). The following table presents the values of \(\hat{y}_{i}\) and \(e_{i}\) for all \(i\).

Simple Dataset
\(x\) \(y\) \(\hat{y}\) \(e = y - \hat{y}\)
1 1.5 1.46 0.04
2 0.5 1.65 -1.15
3 3.5 1.84 1.66
4 2.0 2.03 -0.03
5 1.7 2.22 -0.52

From the diagram, it is clear that some observed points fall above the least squares regression line and some below. This means that the sum of the \(e_{i}\)’s will be zero, which is indeed the case here. To eliminate this problem, it is the squared vertical distances which are of interest. That is, want to minimize \(e^{2}_{i} = (y_{i} - \hat{y}_{i})^{2}\) across all \(i\). The line with values \(\beta_{0}\) and \(\beta_{1}\) which result will produce the least squares regression line, \(\hat{y} = \beta_{0} + \beta_{1}x\), which best fits the data.

It is this idea of wanting to minimize \(e^{2}_{i} = (y_{i} - \hat{y}_{i})^{2}\) (the minimization of the square of the differences between the observed data and the line which best fits the data), that is at the heart of the least squares process.

top

Section 2: The Calculus Approach To Least Squares

The minimization of \(e^{2}_{i}\) across all \(i\) implies wanting to minimize \(\sum_{i=1}^{n} e^{2}_{i}\), where \(n\) is the number of observations. That is

\[ \begin{align} \sum_{i=1}^{n} e^{2}_{i} &= \sum_{i=1}^{n} (y_{i} - \hat{y}_{i})^{2} \\ &= \sum_{i=1}^{n} [y_{i} - (\beta_{0} + \beta_{1}x_{i})]^{2} \end{align} \]

Notice that there are two unknowns, \(\beta_{0}\) and \(\beta_{1}\). Defining this as a function of the unknowns,

\[ E(\beta_{0},\beta_{1}) = \sum_{i=1}^{n} [y_{i} - (\beta_{0} + \beta_{1}x_{i})]^{2} \]

the minimization of \(E(\beta_{0},\beta_{1})\) is required. From calculus, it is known that a function \(f(x)\) will have a minimum (or maximum) when \(f'(x) = 0\) (assuming \(f(x)\) is differentiable). Since \(E\) is a function of two variables, the goal is to find the partial derivatives of \(E\), set equal to zero and solve.

\[ \begin{align} \frac{\partial E(\beta_{0},\beta_{1})}{\partial \beta_{0}} &= 0 \\ \frac{\partial E(\beta_{0},\beta_{1})}{\partial \beta_{1}} &= 0 \end{align} \]

Before moving forward it is helpful to rewrite \(E(\beta_{0},\beta_{1})\) expanding the summation:

\[ \begin{align} E(\beta_{0},\beta_{1}) &= \sum_{i=1}^{n} [y_{i} - (\beta_{0} + \beta_{1}x_{i})]^{2} \\ &= [y_{1} - (\beta_{0} + \beta_{1}x_{1})]^{2} + [y_{2} - (\beta_{0} + \beta_{1}x_{2})]^{2} + \dots + [y_{n} - (\beta_{0} + \beta_{1}x_{n})]^{2} \\ &= [y_{1} - \beta_{0} - \beta_{1}x_{1}]^{2} + [y_{2} - \beta_{0} - \beta_{1}x_{2}]^{2} + \dots + [y_{n} - \beta_{0} - \beta_{1}x_{n}]^{2} \end{align} \]

Consider \(\frac{\partial E(\beta_{0},\beta_{1})}{\partial \beta_{0}}\) first.

\[ \begin{align} \frac{\partial E(\beta_{0},\beta_{1})}{\partial \beta_{0}} &= 2(y_{1} - \beta_{0} - \beta_{1}x_{1})(-1) + 2(y_{1} - \beta_{0} - \beta_{1}x_{2})(-1) + \dots + 2(y_{1} - \beta_{0} - \beta_{1}x_{n})(-1) \\ &= -2\left[(y_{1} - \beta_{0} - \beta_{1}x_{1}) + (y_{1} - \beta_{0} - \beta_{1}x_{2}) + \dots + (y_{1} - \beta_{0} - \beta_{1}x_{n})\right] \\ &= -2 \sum_{i=1}^{n} (y_{i} - \beta_{0} - \beta_{1}x_{i}) \end{align} \]

Now set \(\frac{\partial E(\beta_{0},\beta_{1})}{\partial \beta_{0}} = 0\) and solve for \(\beta_{0}\):

\[ \begin{align} -2 \sum_{i=1}^{n} (y_{i} - \beta_{0} - \beta_{1}x_{i}) &= 0 \\ \sum_{i=1}^{n} (y_{i} - \beta_{0} - \beta_{1}x_{i}) &= 0 \\ \sum_{i=1}^{n} y_{i} - \sum_{i=1}^{n} \beta_{0} - \sum_{i=1}^{n} \beta_{1}x_{i} &= 0 \\ \sum_{i=1}^{n} y_{i} - n\beta_{0} - \beta_{1} \sum_{i=1}^{n} x_{i} &= 0 \\ n\beta_{0} &= \sum_{i=1}^{n} y_{i} - \beta_{1} \sum_{i=1}^{n} x_{i} \\ \beta_{0} &= \frac{\sum_{i=1}^{n} y_{i} - \beta_{1} \sum_{i=1}^{n} x_{i}}{n} \\ \beta_{0} &= \frac{\sum_{i=1}^{n} y_{i}}{n} - \beta_{1}\frac{\sum_{i=1}^{n} x_{i}}{n} \\ \beta_{0} &= \bar{y} - \beta_{1} \bar{x} \end{align} \]

Hence, \(\beta_{0} = \bar{y} - \beta_{1} \bar{x}\). NOtice that \(\beta_{0}\) is not only expressed in terms of \(\beta{1}\), but in terms of the means of \(y\) and \(x\), \(\bar{y}\) and \(\bar{x}\), respectively.

Next consider \(\frac{\partial E(\beta_{0},\beta_{1})}{\partial \beta_{1}}\).

\[ \begin{align} \frac{\partial E(\beta_{0},\beta_{1})}{\partial \beta_{1}} &= 2(y_{1} - \beta_{0} - \beta_{1}x_{1})(-x_{1}) + 2(y_{1} - \beta_{0} - \beta_{1}x_{2})(-x_{2}) + \dots + 2(y_{1} - \beta_{0} - \beta_{1}x_{n})(-x_{n}) \\ &= -2\left[(y_{1} - \beta_{0} - \beta_{1}x_{1})(x_{1}) + (y_{1} - \beta_{0} - \beta_{1}x_{2})(x_{2}) + \dots + (y_{1} - \beta_{0} - \beta_{1}x_{n})(x_{n})\right] \\ &= -2 \sum_{i=1}^{n} x_{i}(y_{i} - \beta_{0} - \beta_{1}x_{i}) \end{align} \]

Now set \(\frac{\partial E(\beta_{0},\beta_{1})}{\partial \beta_{1}} = 0\), substituting \(\beta_{0} = \bar{y} - \beta_{1} \bar{x}\) and solve for \(\beta_{1}\):

\[ \begin{align} -2 \sum_{i=1}^{n} x_{i}(y_{i} - \beta_{0} - \beta_{1}x_{i}) &= -2 \sum_{i=1}^{n} x_{i}(y_{i} - (\bar{y} - \beta_{1} \bar{x}) - \beta_{1}x_{i}) \\ -2 \sum_{i=1}^{n} x_{i}(y_{i} - (\bar{y} - \beta_{1} \bar{x}) - \beta_{1}x_{i}) &= -2 \sum_{i=1}^{n} (x_{i} y_{i} - x_{i}\bar{y} + \beta_{1}x_{i} \bar{x} - \beta_{1}x_{i}^{2}) \\ -2 \sum_{i=1}^{n} (x_{i} y_{i} - x_{i}\bar{y} + \beta_{1}x_{i} \bar{x} - \beta_{1}x_{i}^{2}) &= 0 \\ \sum_{i=1}^{n} (x_{i} y_{i} - x_{i}\bar{y} + \beta_{1}x_{i} \bar{x} - \beta_{1}x_{i}^{2}) &= 0 \\ \sum_{i=1}^{n} (x_{i} y_{i} - x_{i}\bar{y}) - \beta_{1} \sum_{i=1}^{n} (x_{i}^{2} - x_{i} \bar{x}) &= 0 \\ \beta_{1} \sum_{i=1}^{n} (x_{i}^{2} - x_{i} \bar{x}) &= \sum_{i=1}^{n} (x_{i} y_{i} - x_{i}\bar{y}) \\ \beta_{1} &= \frac{\sum_{i=1}^{n} (x_{i} y_{i} - x_{i}\bar{y})}{\sum_{i=1}^{n} (x_{i}^{2} - x_{i} \bar{x})} \\ \beta_{1} &= \frac{\sum_{i=1}^{n} x_{i}y_{i} - \bar{y}\sum_{i=1}^{n}x_{i}}{\sum_{i=1}^{n} x_{i}^{2} - \bar{x} \sum_{i=1}^{n} x_{i}} \end{align} \]

But \(\frac{\sum_{i=1}^{n} x_{i}}{n} = \bar{x}\), thus \(\sum_{i=1}^{n} x_{i} = n \bar{x}\). So:

\[ \beta_{1} = \frac{\sum_{i=1}^{n} x_{i}y_{i} - n\bar{x}\bar{y}}{\sum_{i=1}^{n} x_{i}^{2} - n\bar{x}^2} \]

This expression for \(\beta_{1}\) can be algebraically manipulated using a variety of equivalences, thus can be expressed in a number of forms. One of the most common algebraically equivalent forms is the ratio of the covariance of \(x\) and \(y\) to the variance of \(x\):

\[ \beta_{1} = \frac{\sum_{i=1}^{n} (x_{i} - \bar{x})(y_{i} - \bar{y})}{\sum_{i=1}^{n}(x_{i} - \bar{x})^{2}} \]

Now that the solutions for the coefficients \(\beta_{0}\) and \(\beta_{1}\) are known, can explicity define the best fit least squares regression line \(\hat{y} = \beta_{0} + \beta_{1}x\) for a given dataset. Here are the computations for the above simple dataset where \(\bar{x} = 3.0\) and \(\bar{y} = 1.84\):

Simple Dataset Computations
\(i\) \(x_{i}\) \(y_{i}\) \((x_{i} - \bar{x})\) \((y_{i} - \bar{y})\) \((x_{i} - \bar{x})^{2}\) \((x_{i} - \bar{x})(y_{i} - \bar{y})\)
1 1 1.5 -2 -0.34 4 0.68
2 2 0.5 -1 -1.34 1 1.34
3 3 3.5 0 1.66 0 0.00
4 4 2.0 1 0.16 1 0.16
5 5 1.7 2 -0.14 4 -0.28

The \(\sum_{i=1}^{n} (x_{i} - \bar{x})(y_{i} - \bar{y}) = 1.9\) and \(\sum_{i=1}^{n} (x_{i} - \bar{x})^{2} = 10\). The coefficients are:

\[ \beta_{1} = \frac{1.9}{10} = 0.19 \\ \beta_{0} = 1.84 - 0.19(3.0) = 1.27 \]

Hence, the equation of the best fit least squares regression line is: \(\hat{y} = 1.27 + 0.19x\). The plot of the simple dataset with the best fit least squares regression line included is as follows:

An interesting thing to note is that the point \((\bar{x},\bar{y})\) is on the least squares regression line. In this simple dataset example \(\bar{x} = 3.0\) and \(\bar{y} = 1.84\), and the point \((3.0,1.84)\) can be easily verified to be a solution to \(\hat{y} = 1.27 + 0.19x\). The significance of this will be made clearer in a subsequent article.

In summary, then, to find the linear regression line that best fits the data, \(\hat{y} = \beta_{0} + \beta_{1}x\), the coefficients \(\beta_{0}\) and \(\beta_{1}\) are:

\[ \begin{align} \beta_{1} &= \frac{\sum_{i=1}^{n} (x_{i} - \bar{x})(y_{i} - \bar{y})}{\sum_{i=1}^{n}(x_{i} - \bar{x})^{2}} \\ \beta_{0} &= \bar{y} - \beta_{1} \bar{x} \end{align} \]

In the multivariate case, where there are \(n\) \(x\) variables, the linear regression equation is \(\hat{y} = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \cdots + \beta_{n}x_{n}\) (note, again, this is a linear function of the coefficients \(\beta_{0}, \cdots, \beta_{n}\)). Though the mathematics would, by definition, become a bit more involved, the approach taken to find the coefficients would be the same.

top

Section 3: The Linear Algebra Approach

This section will present the linear algebra least squares approach to finding the linear least squares regression line that best fits the data. Some familiarity of linear alegra concepts is assumed, but sufficient information will be provided to develop and conceptually understand the linear algebra least squares approach. The reader should refer to any suffficiently detailed linear algebra text for a deeper understanding and explanation of concepts such as column and null spaces, span, and projection.

The linear algebra least squares solution is the finding of the best approximate solution to the matrix equation \(A\vec{x} = \vec{b}\) should it not have a solution (an inconsistent system). Consider the following set of \(n\) equations with \(n\) unknowns:

\[ \begin{align} a_{11}x_{1} + a_{12}x_{2} + \cdots + a_{1n}x_{n} &= b_{1} \\ a_{21}x_{1} + a_{22}x_{2} + \cdots + a_{2n}x_{n} &= b_{2} \\ \vdots &= \vdots \\ a_{n1}x_{1} + a_{n12}x_{2} + \cdots + a_{nn}x_{n} &= b_{n} \end{align} \]

This system of linear equations can be expressed as \(A\vec{x} = \vec{b}\), where \(A\) is a matrix whose columns are the coefficients of the unknowns, \(\vec{x}\) is the vector of unknowns and \(\vec{b}\) is the solution vector, as follows:

\[ \begin{align} \begin{bmatrix} a_{11} & a_{12} & \dots & a_{1n} \\ a_{21} & a_{22} & \dots & a_{2n} \\ \vdots & \vdots & \vdots & \vdots \\ a_{n1} & a_{n2} & \dots & a_{nn} \end{bmatrix} \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{bmatrix} &= \begin{bmatrix} b_{1} \\ b_{2} \\ \vdots \\ b_{n} \end{bmatrix} \end{align} \]

\(A\vec{x} = \vec{b}\) will have a solution if \(\vec{b}\) is in the column space of \(A\). That is, \(\vec{b}\) must be in the span of the columns of \(A\) (the linear combination of the column vectors of \(A\)). The following diagram illustrates this idea.

Matrix \(A\) would then be non-singular which implies that the inverse matrix, \(A^{-1}\), exists. The solution to this system would then be:

\[ \begin{align} A\vec{x} &= \vec{b} \\ A^{-1}A\vec{x} &= A^{-1}\vec{b} \\ \vec{x} &= A^{-1}\vec{b} \end{align} \]

Next, consider the system of linear equations with \(m\) equations and \(n\) unknowns, where \(m > n\):

\[ \begin{align} a_{11}x_{1} + a_{12}x_{2} + \cdots + a_{1n}x_{n} &= b_{1} \\ a_{21}x_{1} + a_{22}x_{2} + \cdots + a_{2n}x_{n} &= b_{2} \\ \vdots &= \vdots \\ a_{m1}x_{1} + a_{m12}x_{2} + \cdots + a_{mn}x_{n} &= b_{m} \end{align} \]

This system of linear equations can be expressed as \(A\vec{x} = \vec{b}\) as follows:

\[ \begin{align} \begin{bmatrix} a_{11} & a_{12} & \dots & a_{1n} \\ a_{21} & a_{22} & \dots & a_{2n} \\ \vdots & \vdots & \vdots & \vdots \\ a_{m1} & a_{m2} & \dots & a_{mn} \end{bmatrix} \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{bmatrix} &= \begin{bmatrix} b_{1} \\ b_{2} \\ \vdots \\ b_{m} \end{bmatrix} \end{align} \]

Since there are more equations than uknowns, this system is referred to as an overdetermined system. In most cases an overdetermined system would not have a solution which will be the focus here. The coefficient matrix, \(A\), would be singular meaning that the inverse matrix, \(A^{-1}\), would not exist. This implies that there is no solution and \(\vec{b}\) is not in the column space of \(A\). The following diagram illustrates this idea.

The method of least squares can be used to find an approximate solution where the goal is to find a vector, \(\vec{\hat{x}}\), such that \(A\vec{\hat{x}}\) will produce a vector which is in the column space of \(A\) and which is closest to \(\vec{b}\). This is accomplished by the projection of \(\vec{b}\) onto the column space of \(A\) which produces the smallest error (the vector difference between \(\vec{b}\) and \(A\vec{\hat{x}}\)). Calling this vector \(\vec{p} = A\vec{\hat{x}}\), then the goal is to minimize \(\vec{e} = \vec{b} - \vec{p}\). It turns out that \(\vec{\hat{x}}\) should be chosen such that \(\vec{p} = A\vec{\hat{x}}\) is in the same direction of \(\vec{b}\) and is perpendicular to the error vector, \(\vec{e}\). The following diagram illustrates this idea:

Since \(\vec{e}\) is perpendicular to the column space of \(A\), the dot product \(A \cdot \vec{e}\) would be zero. Using properties of linear algebra it is known that the error vector is in the null space of \(A^{T}\) (where \(A^{T}\) is the transpose of \(A\)), thus

\[ \begin{align} A \cdot \vec{e} &= 0 \\ A^{T}\vec{e} &= 0 \\ A^{T} (\vec{b} - \vec{p}) &= 0 \\ A^{T} (\vec{b} - A\vec{\hat{x}}) &= 0 \\ A^{T}A\vec{\hat{x}} &= A^{T} \vec{b} \end{align} \]

To solve for \(\vec{\hat{x}}\) apply the theorem which says \(A^{T}A\) will be non-singular (will have an inverse). Making use of this:

\[ \begin{align} A^{T}A\vec{\hat{x}} &= A^{T}\vec{b} \\ (A^{T}A)^{-1}A^{T}A\vec{\hat{x}} &= (A^{T}A)^{-1}A^{T}\vec{b} \\ \vec{\hat{x}} &= (A^{T}A)^{-1}A^{T}\vec{b} \end{align} \]

This equation, \(\vec{\hat{x}} = (A^{T}A)^{-1}A^{T}\vec{b}\) is the linear algebra least squares solution.

Now that the linear algebra least squares solution is defined, the simple dataset described above will be used to apply this solution within the context of finding the linear least squares regression line. For convenience the simple dataset and the associated scatter plot is represented here:

Simple Dataset
\(x\) \(y\)
1 1.5
2 0.5
3 3.5
4 2.0
5 1.7

The goal is to find \(\beta_0\) and \(\beta_{1}\) such that \(\hat{y} = \beta_{0} + \beta_{1}x\) is the line with best fits the data. From the data there are five observations from which the following five equations result:

\[ \begin{align} \beta_{0} + \beta_{1}(1) &= 1.5 \\ \beta_{0} + \beta_{1}(2) &= 0.5 \\ \beta_{0} + \beta_{1}(3) &= 3.5 \\ \beta_{0} + \beta_{1}(4) &= 2.0 \\ \beta_{0} + \beta_{1}(5) &= 1.7 \end{align} \]

This linear system of equations can be expressed as \(X \vec{\beta} = \vec{y}\) as follows:

\[ \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \\ 1 & 5 \end{bmatrix} \begin{bmatrix} \beta_{0} \\\beta_{1} \end{bmatrix} = \begin{bmatrix} 1.5 \\ 0.5 \\ 3.5 \\ 2.0 \\ 1.7 \end{bmatrix} \]

There is no solution to this system. An approximate solution using least squares would be \(\vec{\hat{\beta}} = (X^{T}X)^{-1}X^{T}\vec{y}\), where

\[ \vec{\hat{\beta}} = \begin{bmatrix} \beta_{0} \\ \beta_{1} \end{bmatrix} \]

represents the best approximate solution vector which is in the column space of \(X\) (note, some liberty was taken in the notation to not overly complicate things for the solution for the coefficients are the estimates and should be noted as \(\hat{\beta_{0}}\) and \(\hat{\beta_{1}}\)). Stepping through the computations:

\[ \begin{align} (X^{T}X) &= \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 & 5 \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \\ 1 & 5 \end{bmatrix} \\ (X^{T}X) &= \begin{bmatrix} 5 & 15 \\ 15 & 55\end{bmatrix} \end{align} \]

To find \((X^{T}X)^{-1}\) recall that for a 2 x 2 matrix, \(A\), the inverse is:

\[ \begin{align} A &= \begin{bmatrix} a & b \\ c & d \end{bmatrix} \\ A^{-1} &= \frac{1}{ad - bc} \begin{bmatrix} d & -b \\ -c & a \end{bmatrix} \end{align} \]

so:

\[ \begin{align} (X^{T}X)^{-1} &= \frac{1}{(5)(55) - (15)(15)} \begin{bmatrix} 55 & -15 \\ -15 & 5\end{bmatrix} \\ (X^{T}X)^{-1} &= \frac{1}{50} \begin{bmatrix} 55 & -15 \\ -15 & 5\end{bmatrix} \end{align} \]

Then

\[ \begin{align} (X^{T}X)^{-1}X^{T} &= \frac{1}{50} \begin{bmatrix} 55 & -15 \\ -15 & 5\end{bmatrix} \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 & 5 \end{bmatrix} \\ (X^{T}X)^{-1}X^{T} &= \frac{1}{50} \begin{bmatrix} 40 & 25 & 10 & -5 & -20 \\ -10 & -5 & 0 & 5 & 10 \end{bmatrix} \end{align} \]

Putting it all together:

\[ \begin{align} (X^{T}X)^{-1}X^{T}\vec{y} &= \frac{1}{50} \begin{bmatrix} 40 & 25 & 10 & -5 & -20 \\ -10 & -5 & 0 & 5 & 10 \end{bmatrix} \begin{bmatrix} 1.5 \\ 0.5 \\ 3.5 \\ 2.0 \\ 1.7 \end{bmatrix} \\ (X^{T}X)^{-1}X^{T}\vec{y} &= \frac{1}{50} \begin{bmatrix} 63.5 \\ 9.5\end{bmatrix} \end{align} \]

Since \(\vec{\beta} = (X^{T}X)^{-1}X^{T}\vec{y}\)

\[ \begin{align} \vec{\hat{\beta}} &= (X^{T}X)^{-1}X^{T}\vec{y} \\ \vec{\hat{\beta}} &= \frac{1}{50} \begin{bmatrix} 63.5 \\ 9.5\end{bmatrix} \\ \vec{\hat{\beta}} &= \begin{bmatrix} 1.27 \\ 0.19 \end{bmatrix} \end{align} \]

Hence, the equation of the best fit least squares regression line is \(\hat{y} = 1.27 + 0.19x\), which was what was determined above using calculus. The following is the plot of the data along with the regression line which is, again, the same as above:

top

Section 4: Higher Degree Polynomials

The discussion thus far has been the finding of the least squares regression line, \(\hat{y} = \beta_{0} + \beta_{1}x\). This is a degree one polynomial. There is nothing to preclude the finding of a higher order polynomial which best fits the data. For example, for a dataset with \(n\) observations and with one variable, \(x\) (univariate), a degree \(m\) polynomial that best fits the data would be:

\[ \hat{y} = \beta_{0} + \beta_{1}x + \beta_{2}x^{2} + \cdots + \beta_{m}x^{m} \]

Note that this is still a linear function with respect to the coefficients even though there are higher powers of \(x\). To find the coefficients will use the linear algebra approach. The corresponding system of equations for \(n\) observations for which a degree \(m\) regression polynomial is desired become:

\[ \begin{align} \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{1}^{2} + \cdots + \beta_{m} x_{1}^{m} &= y_{1} \\ \beta_{0} + \beta_{1} x_{2} + \beta_{2} x_{2}^{2} + \cdots + \beta_{m} x_{2}^{m} &= y_{2} \\ \vdots &= \vdots \\ \beta_{0} + \beta_{1} x_{n} + \beta_{2} x_{n}^{2} + \cdots + \beta_{m} x_{n}^{m} &= y_{n} \\ \end{align} \]

This system of linear equations can be expressed as

\[ \begin{bmatrix} 1 & x_{1} & x_{1}^{2} & \dots & x_{1}^{m} \\ 1 & x_{2} & x_{2}^{2} & \dots & x_{2}^{m} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{n} & x_{n}^{2} & \dots & x_{n}^{m} \end{bmatrix} \begin{bmatrix} \beta_{0} \\\beta_{1} \\ \beta_{2} \\ \vdots \\ \beta_{m} \end{bmatrix} = \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{bmatrix} \]

The solution is the same as described above \(\vec{\hat{\beta}} = (X^{T}X)^{-1}X^{T}\vec{y}\), where (again, taking some liberties with notation)

\[ \vec{\hat{\beta}} = \begin{bmatrix} \beta_{0} \\ \beta_{1} \\ \beta_{2} \\ \vdots \\ \beta_{m} \end{bmatrix} \]

Using the simple dataset as an example, a degree two polynomial that best fits the data would be \(\hat{y} = \beta_{0} + \beta_{1}x + \beta_{2}x^{2}\). The system is:

\[ \begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 4 \\ 1 & 3 & 9 \\ 1 & 4 & 16 \\ 1 & 5 & 25 \end{bmatrix} \begin{bmatrix} \beta_{0} \\\beta_{1} \\ \beta_{2} \end{bmatrix} = \begin{bmatrix} 1.5 \\ 0.5 \\ 3.5 \\ 2.0 \\ 1.7 \end{bmatrix} \]

Solving \(\vec{\hat{\beta}} = (X^{T}X)^{-1}X^{T}\vec{y}\) yields

\[ \vec{\hat{\beta}} = \begin{bmatrix} -0.28 \\ 1.518571 \\ -0.2214289 \end{bmatrix} \]

Hence, the degree two polynomial that best fits the data would be \(\hat{y} = -0.28 + 1.518571x - 0.2214289x^{2}\).

As an example of fitting a degree 3 polynomial consider another simple dataset and its scatter plot:

Simple Dataset 2
\(x\) \(y\)
1 2.1
2 3.5
3 4.2
4 3.1
5 4.4
6 6.8

The following are the systems to find degree 1, 2 and 3 regression ploynomials. First consider a degree 1 regression polynomial:

\[ \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \\ 1 & 5 \\ 1 & 6 \end{bmatrix} \begin{bmatrix} \beta_{0} \\\beta_{1} \end{bmatrix} = \begin{bmatrix} 2.1\\ 3.5 \\ 4.2 \\ 3.1 \\ 4.4 \\ 6.8 \end{bmatrix} \]

The solution for the coeffients \(\beta_{0}\) and \(\beta_{1}\) is:

\[ \begin{bmatrix} \beta_{0} \\\beta_{1} \end{bmatrix} = \begin{bmatrix} 1.5067 \\ 0.71714\end{bmatrix} \]

The resulting regression polynomial is: \(\hat{y} = 1.5067 + 0.71714x\). Next consider a degree 2 regression polynomial:

\[ \begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 4 \\ 1 & 3 & 9 \\ 1 & 4 & 16 \\ 1 & 5 & 25 \\ 1 & 6 & 36 \end{bmatrix} \begin{bmatrix} \beta_{0} \\\beta_{1} \\ \beta_{2} \end{bmatrix} = \begin{bmatrix} 2.1\\ 3.5 \\ 4.2 \\ 3.1 \\ 4.4 \\ 6.8 \end{bmatrix} \]

The solution for the coeffients \(\beta_{0}\), \(\beta_{1}\) and \(\beta_{2}\) is:

\[ \begin{bmatrix} \beta_{0} \\\beta_{1} \\ \beta_{2} \end{bmatrix} = \begin{bmatrix} 2.74 \\ -0.207857 \\ 0.132143 \end{bmatrix} \]

The resulting regression polynomial is: \(\hat{y} = 2.74 - 0.207857x + 0.132143x^{2}\). Finally, consider a degree 3 regression polynomial:

\[ \begin{bmatrix} 1 & 1 & 1 & 1\\ 1 & 2 & 4 & 8\\ 1 & 3 & 9 & 27 \\ 1 & 4 & 16 & 64 \\ 1 & 5 & 25 & 125 \\ 1 & 6 & 36 & 216 \end{bmatrix} \begin{bmatrix} \beta_{0} \\\beta_{1} \\ \beta_{2} \\ \beta_{3} \end{bmatrix} = \begin{bmatrix} 2.1\\ 3.5 \\ 4.2 \\ 3.1 \\ 4.4 \\ 6.8 \end{bmatrix} \]

The solution for the coeffients \(\beta_{0}\), \(\beta_{1}\), \(\beta_{2}\) and \(\beta_{3}\) is:

\[ \begin{bmatrix} \beta_{0} \\\beta_{1} \\ \beta_{2} \\ \beta_{3} \end{bmatrix} = \begin{bmatrix} -2.3 \\ 6.13214 \\ -1.96786 \\ 0.2 \end{bmatrix} \]

The resulting regression polynomial is: \(\hat{y} = -2.3 + 6.13214x - 1.96786x^{2} + 0.2x^{3}\). The following plot shows all three regression polynomials together, thus giving a visualization of how each best fits the data:

top

Section 5: Additional Examples

Here are three more (simple) datasets which can explore the concept of finding the linear least squares solution that best fits the data. The linear algebra approach,\(\vec{\hat{\beta}} = (X^{T}X)^{-1}X^{T}\vec{y}\), will be taken though certainly the calculus approach would yield the same results.

Consider the first example dataset:

Example Dataset 1
\(x\) \(y\)
0.21 0.1726
0.20 0.1707
0.19 0.1637
0.18 0.1640
0.17 0.1613
0.16 0.1617
0.15 0.1598

The matrix system is:

\[ \begin{bmatrix} 1 & 0.21 \\ 1 & 0.2 \\ 1 & 0.19 \\ 1 & 0.18 \\ 1 & 0.17 \\ 1 & 0.16 \\ 1 & 0.15 \end{bmatrix} \begin{bmatrix} \beta_{0} \\\beta_{1} \end{bmatrix} = \begin{bmatrix} 0.1726 \\ 0.1707 \\ 0.1637 \\ 0.164 \\ 0.1613 \\ 0.1617 \\ 0.1598 \end{bmatrix} \]

The solution is:

\[ \vec{\hat{\beta}} = \begin{bmatrix} 0.127028 \\ 0.21 \end{bmatrix} \]

Hence, the equation of the best fit least squares regression line is \(\hat{y} = 0.127028 + 0.21x\). The scatter plot and the best fit line is:

Consider the second example dataset:

Example Dataset 2
\(x\) \(y\)
2 20
6 18
20 10
30 6
40 2

The matrix system is:

\[ \begin{bmatrix} 1 & 2 \\ 1 & 6 \\ 1 & 20 \\ 1 & 30 \\ 1 & 40 \end{bmatrix} \begin{bmatrix} \beta_{0} \\\beta_{1} \end{bmatrix} = \begin{bmatrix} 20 \\ 18 \\ 10 \\ 6 \\ 2 \end{bmatrix} \]

The solution is:

\[ \vec{\hat{\beta}} = \begin{bmatrix} 20.615385 \\ -0.480377 \end{bmatrix} \]

Hence, the equation of the best fit least squares regression line is \(\hat{y} = 20.615385 - 0.480377x\). The scatter plot and the best fit line is:

And, finally, consider the third example dataset:

Example Dataset 3
\(x\) \(y\)
50 3.3
50 2.8
50 2.9
70 2.3
70 2.6
70 2.1
80 2.5
80 2.9
80 2.4
90 3.0
90 3.1
90 2.8
100 3.3
100 3.5
100 3.0

The matrix system is:

\[ \begin{bmatrix} 1 & 50 \\ 1 & 50 \\ 1 & 50 \\ 1 & 70 \\ 1 & 70 \\ 1 & 70 \\ 1 & 80 \\ 1 & 80 \\ 1 & 80 \\ 1 & 90 \\ 1 & 90 \\ 1 & 90 \\ 1 & 100 \\ 1 & 100 \\ 1 & 100 \end{bmatrix} \begin{bmatrix} \beta_{0} \\\beta_{1} \end{bmatrix} = \begin{bmatrix} 3.3 \\ 2.8 \\ 2.9 \\ 2.3 \\ 2.6 \\ 2.1 \\ 2.5 \\ 2.9 \\ 2.4 \\ 3.0 \\ 3.1 \\ 2.8 \\ 3.3 \\ 3.5 \\ 3.0 \end{bmatrix} \]

The solution is:

\[ \vec{\hat{\beta}} = \begin{bmatrix} 2.306306 \\ 0.0067568 \end{bmatrix} \]

Hence, the equation of the best fit least squares regression line is \(\hat{y} = 2.306306 + 0.0067568x\). The scatter plot and the best fit line is:

Using this third example dataset the following is the matrix system to find a degree 2 polynomial:

\[ \begin{bmatrix} 1 & 50 & 2500 \\ 1 & 50 & 2500 \\ 1 & 50 & 2500 \\ 1 & 70 & 4900 \\ 1 & 70 & 4900 \\ 1 & 70 & 4900 \\ 1 & 80 & 6400 \\ 1 & 80 & 6400 \\ 1 & 80 & 6400 \\ 1 & 90 & 8100 \\ 1 & 90 & 8100 \\ 1 & 90 & 8100 \\ 1 & 100 & 10000 \\ 1 & 100 & 10000 \\ 1 & 100 & 10000 \end{bmatrix} \begin{bmatrix} \beta_{0} \\\beta_{1} \\ \beta_{2} \end{bmatrix} = \begin{bmatrix} 3.3 \\ 2.8 \\ 2.9 \\ 2.3 \\ 2.6 \\ 2.1 \\ 2.5 \\ 2.9 \\ 2.4 \\ 3.0 \\ 3.1 \\ 2.8 \\ 3.3 \\ 3.5 \\ 3.0 \end{bmatrix} \]

The solution is:

\[ \vec{\hat{\beta}} = \begin{bmatrix} 7.960481 \\ -0.153711 \\ 0.00107560 \end{bmatrix} \]

Hence, the degree two polynomial that best fits the data would be \(\hat{y} = 7.960481 - 0.1537111x + 0.00107560x^{2}\). The scatter plot and the best fit degree two polynomial is:

top

Section 6: Multivariate Data

Up to now all the discussion has been focused on the univariate data case, the pairing of one \(x\) and \(y\). in the multivariate case there would be more than one input variables, \(x_{1}, \cdots, x_{n}\). The goal remains the same, namely to find the linear regression equation that best fits the data (note that geometry of the resulting plot can become complicated as the number of variables increases). That is, for \(n\) \(x\) variables, the linear regression equation would be:

\[ \hat{y} = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \cdots + \beta_{n}x_{n} \]

The corresponding system of equations for \(m\) observations with \(n\) variables become:

\[ \begin{align} \beta_{0} + \beta_{1} x_{11} + \beta_{2} x_{12} + \cdots + \beta_{n} x_{1n} &= y_{1} \\ \beta_{0} + \beta_{1} x_{21} + \beta_{2} x_{22} + \cdots + \beta_{n} x_{2n} &= y_{2} \\ \vdots &= \vdots \\ \beta_{0} + \beta_{1} x_{m1} + \beta_{2} x_{m2} + \cdots + \beta_{n} x_{mn} &= y_{m} \\ \end{align} \]

In general, \(y_{i} = \beta_{0} + \beta_{1}x_{i1} + \beta_{2}x_{i2} + \cdots + \beta_{n}x_{in}\) for \(i = 1, \dots , m\). This system of linear equations can be expressed as

\[ \begin{bmatrix} 1 & x_{11} & x_{12} & \dots & x_{1n} \\ 1 & x_{21} & x_{22} & \dots & x_{2n} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{m1} & x_{m2} & \dots & x_{mn} \end{bmatrix} \begin{bmatrix} \beta_{0} \\\beta_{1} \\ \beta_{2} \\ \vdots \\ \beta_{n} \end{bmatrix} = \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{bmatrix} \]

The solution is the same as described above \(\vec{\hat{\beta}} = (X^{T}X)^{-1}X^{T}\vec{y}\), where

\[ \vec{\hat{\beta}} = \begin{bmatrix} \beta_{0} \\ \beta_{1} \\ \beta_{2} \\ \vdots \\ \beta_{n} \end{bmatrix} \]

To show the ability to use many \(x\) variables, consider the following example with 6 variables and 17 observations:

Multivariate Dataset
\(x_{1}\) \(x_{2}\) \(x_{3}\) \(x_{4}\) \(x_{5}\) \(x_{6}\) \(y\)
3149 1 7500 220 0 40 3389
653 1 1975 200 0 100 1101
810 0 600 205 60 111 1131
448 1 675 160 60 120 596
844 1 750 185 70 83 896
1450 1 2500 80 60 80 1767
493 1 350 154 80 98 807
941 0 1500 200 70 93 1111
547 1 375 137 60 105 645
392 1 1050 167 60 74 628
1283 1 0 180 60 80 1360
458 1 450 160 64 60 652
722 1 1750 135 90 79 860
384 0 2000 160 60 80 500
501 0 4500 180 0 100 781
405 0 1500 170 90 120 1070
1520 1 3000 180 0 129 1754

The matrix representation of this data is:

\[ \begin{bmatrix} 1 & 3149 & 1 & 7500 & 220 & 0 & 40 \\ 1 & 653 & 1 & 1975 & 200 & 0 & 100 \\ 1 & 810 & 0 & 600 & 205 & 60 & 111 \\ 1 & 448 & 1 & 675 & 160 & 60 & 120 \\ 1 & 844 & 1 & 750 & 185 & 70 & 83 \\ 1 & 1450 & 1 & 2500 & 80 & 60 & 80 \\ 1 & 493 & 1 & 350 & 154 & 80 & 98 \\ 1 & 941 & 0 & 1500 & 200 & 70 & 93 \\ 1 & 547 & 1 & 375 & 137 & 60 & 105 \\ 1 & 392 & 1 & 1050 & 167 & 60 & 74 \\ 1 & 1283 & 1 & 0 & 180 & 60 & 80 \\ 1 & 458 & 1 & 450 & 160 & 64 & 60 \\ 1 & 722 & 1 & 1750 & 135 & 90 & 79 \\ 1 & 384 & 0 & 2000 & 160 & 60 & 80 \\ 1 & 501 & 0 & 4500 & 180 & 0 & 100 \\ 1 & 405 & 0 & 1500 & 170 & 90 & 120 \\ 1 & 1520 & 1 & 3000 & 180 & 0 & 129 \end{bmatrix} \begin{bmatrix} \beta_{0} \\\beta_{1} \\ \beta_{2} \\ \beta_{3} \\ \beta_{4} \\ \beta_{5} \\ \beta_{6} \end{bmatrix} = \begin{bmatrix} 3389 \\ 1101 \\ 1131 \\ 596 \\ 896 \\ 1767 \\ 807 \\ 1111 \\ 645 \\ 628 \\ 1360 \\ 652 \\ 860 \\ 500 \\ 781 \\ 1070 \\ 1754 \end{bmatrix} \]

The solution is

\[ \vec{\hat{\beta}} = \begin{bmatrix} -186.34654 \\ 0.94921 \\ -13.15023 \\ 0.05278 \\ 0.29434 \\1.30413 \\ 2.91795 \end{bmatrix} \]

Hence, the regression equation:

\[ \hat{y} = -186.34654 + 0.94921x_{1} + -13.15023x_{2} + 0.05278x_{3}+ 0.29434x_{4} + 1.30413x_{5}+ 2.91795x_{6} \]

top

Section 7: Weighted Least Squares

There are times when weights need, or should, be used. Without going into detail, which is beyond the scope here, a common use for weights is when it is thought that every observation should not be treated equally when determining the best fit line (or polynomial). That is, when determining the coefficients each observation would be assigned a particular influence, or weight. Though not a perfect analogy, it might help to relate this to a weighted average. Weighted least squares is the process used to compute the coefficients under these conditions.

The solution to weighted least squares is exactly the same as seen above with the exception of the inclusion of a matrix, \(W\), which is a diagonal matrix with the assigned weights for each observation, \(w_{i}\), on the diagonal:

\[ W = \begin{bmatrix} w_{1} & 0 & \cdots & 0 \\ 0 & w_{2} & \cdots & 0 \\ \vdots & \vdots &\ddots & \vdots\\ 0 & 0 & \cdots & w_{n} \end{bmatrix} \]

Hence, the solution to weighted least squares becomes:

\[ \beta = \left(X^{T}WX\right)^{-1}X^{T}WY \]

Consider the following dataset with no specific weights assigned:

Weight Least Squares Example Dataset
\(x\) \(y\)
3000 81464
3150 72661
3085 72344
5225 90743
5350 98588
6090 96507
8925 126574
9015 114133
8885 115814
8950 123181
9000 131434
11345 140564
12275 151352
12400 146926
12525 130963
12310 144630
13700 147041
15000 179021
15175 166200
14995 180732
15050 178187
15200 185304
15150 155931
16800 172579
16500 188851
17830 192424
19500 203112
19200 192482
19000 218715
19350 214317

Finding the linear regression line that best fits the data using the method described above \(\vec{\hat{\beta}} = (X^{T}X)^{-1}X^{T}\vec{y}\) yields the solution

\[ \vec{\hat{\beta}} = \begin{bmatrix} 49443.38381 \\ 8.04844 \end{bmatrix} \]

The linear least squares regression line is \(\hat{y} = 49443.38381 + 8.04844x\). The plot of the data and the regression line is:

This solution was found under the condition that all observations were equal in terms of their influence; that is, they were of equal weights. To see this let the weight matrix, \(W\), be defined as

\[ W = \begin{bmatrix} w & 0 & \cdots & 0 \\ 0 & w & \cdots & 0 \\ \vdots & \vdots &\ddots & \vdots\\ 0 & 0 & \cdots & w \end{bmatrix} \]

In other words, the weights are the same. The value of \(w\) is not important as long as they are the same. Finding the solution using \(\beta = \left(X^{T}WX\right)^{-1}X^{T}WY\) yields:

\[ \vec{\hat{\beta}} = \begin{bmatrix} 49443.38381 \\ 8.04844 \end{bmatrix} \]

which is exactly as found above. The weight matrix, \(W\), was first created with \(w = 1\) and then with \(w = \frac{1}{n}\) where \(n = 30\) (the number of observations). In each case, the solutions were the same.

Next consider that there were specific weights assigned via a process beyond the scope here as follows:

Weight Least Squares Example Dataset
\(x\) \(y\) \(w\)
3000 81464 0.00000006220
3150 72661 0.00000005800
3085 72344 0.00000005970
5225 90743 0.00000002990
5350 98588 0.00000002900
6090 96507 0.00000002480
8925 126574 0.00000001600
9015 114133 0.00000001580
8885 115814 0.00000001610
8950 123181 0.00000001600
9000 131434 0.00000001590
11345 140564 0.00000001230
12275 151352 0.00000001130
12400 146926 0.00000001120
12525 130963 0.00000001100
12310 144630 0.00000001130
13700 147041 0.00000001000
15000 179021 0.00000000910
15175 166200 0.00000000899
14995 180732 0.00000000910
15050 178187 0.00000000907
15200 185304 0.00000000897
15150 155931 0.00000000900
16800 172579 0.00000000806
16500 188851 0.00000000822
17830 192424 0.00000000757
19500 203112 0.00000000689
19200 192482 0.00000000700
19000 218715 0.00000000708
19350 214317 0.00000000695

Finding the solution using \(\beta = \left(X^{T}WX\right)^{-1}X^{T}WY\) yields:

\[ \vec{\hat{\beta}} = \begin{bmatrix} 50974.30026 \\ 7.92263 \end{bmatrix} \]

The linear least squares regression line using weighted least squares is \(\hat{y} = 50974.30026 + 7.92263x\). Though close to the solution found above with no weights this small difference may actually have significant implications. The plot of the data and the regression line found using weighted least squares is:

The following plot shows both regression lines superimposed on the data:

Due to the scale the two look lines show little difference but, again, this difference can have significant implications.

top

Section 8: Software Implementation

The mathematics of least squares was detailed above. The solutions were presented for both the calculus approach (for univariate data) and the linear algebra approach. The linear algebra approach is perhaps more universal in that by using matrices, the solution remains the same regardless of whether using univariate or multivariate data, seeking a higher degree polynomial or applying weights. The question becomes how to implement these solutions in software for though doable by hand for low number of observations the computations would quickly become unfathomable.

All the mathematics presented, and, thus the solutions, were done in Python using personally developed routines:

  • ols.py
    • Implements both the calculus approach (for univariate data) and the linear algebra approach for univariate, multivariate and higher order polynomials
    • That is, for the calculus approach: \[ \begin{align} \beta_{1} &= \frac{\sum_{i=1}^{n} (x_{i} - \bar{x})(y_{i} - \bar{y})}{\sum_{i=1}^{n}(x_{i} - \bar{x})^{2}} \\ \beta_{0} &= \bar{y} - \beta_{1} \bar{x} \end{align} \]
    • For the linear algebra approach: \[ \beta = (X^{T}X)^{-1}X^{T}Y \]
  • wls.py
    • Implements the linear algebra approach: \[\beta = \left(X^{T}WX\right)^{-1}X^{T}WY \]

There are enormous benefits for implementing these algorithms one self. First and foremost is the fact that by doing so enhances one’s understanding of the algorithms and the mathematics. This allows for a deeper understanding and appreciation of the mathematics involved and can lay the foundation for the understanding of related, yet more complex, algorithms. In addition, this allows for the integration of these algorithms into applications being developed, perhaps adding any unique tweaks for real-time performance, parameter modification or other reasons, and not having to rely on any “out-of-the-box” implementations.

As important and useful these benefits are, an often overlooked reason for the implementation of the algorithms one self is the verification that the “out-of-the-box” implementations are indeed correct. This not only gives one the confidence that their understanding is correct but assures them that they can use the “out-of-the-box” implementations with confidence.

There are many least squares “out-of-the-box” implementations. One highly recommended is the statistical computing package R. The solutions presented above were verified in R via the lm command which is used to fit linear models.

For the simple dataset first seen above, here is the R solution for the best-fit linear least squares line; it is in full agreement with the computed solution:

## 
## Call:
## lm(formula = y ~ x, data = df)
## 
## Coefficients:
## (Intercept)            x  
##        1.27         0.19

The construction of the best-fit degree 2 polynomial in R was done and it, too, is in full agreement with the computed solution:

## 
## Call:
## lm(formula = y ~ poly(x, 2, raw = TRUE), data = df)
## 
## Coefficients:
##             (Intercept)  poly(x, 2, raw = TRUE)1  poly(x, 2, raw = TRUE)2  
##                 -0.2800                   1.5186                  -0.2214

For the multivariate case the R solution was in complete agreement with the computed solution:

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6, data = dfM)
## 
## Coefficients:
## (Intercept)           x1           x2           x3           x4  
##  -186.34654      0.94921    -13.15023      0.05278      0.29434  
##          x5           x6  
##     1.30413      2.91795

And, finally, for the weighted least squares case, the R solution was in complete agreement with the computed solution:

## 
## Call:
## lm(formula = y ~ x, data = dfW, weights = w)
## 
## Coefficients:
## (Intercept)            x  
##   50974.300        7.923

top