Visual Tools
Calculators
Tables
Mathematical Keyboard
Converters
Other Tools


Least Squares






The Best Approximate Solution

When a linear system has no solution, the least-squares method finds the vector that comes closest — the one minimizing the squared distance between Ax and b. The answer is a projection: the least-squares solution produces the point in the column space nearest to b, and the normal equations encode the orthogonality condition that defines "nearest."



The Problem

The system Ax=bA\mathbf{x} = \mathbf{b} may have no solution — b\mathbf{b} may not lie in the column space of AA. This is typical when the system is overdetermined: more equations than unknowns, with the equations imposing contradictory constraints.

When no exact solution exists, the goal shifts from solving Ax=bA\mathbf{x} = \mathbf{b} to minimizing the error:

x^=argminxAxb2\hat{\mathbf{x}} = \arg\min_{\mathbf{x}} \|A\mathbf{x} - \mathbf{b}\|^2


The quantity Axb2=i(Axb)i2\|A\mathbf{x} - \mathbf{b}\|^2 = \sum_i (A\mathbf{x} - \mathbf{b})_i^2 is the sum of squared residuals. The vector x^\hat{\mathbf{x}} that minimizes this sum is the least-squares solution.

The Geometric Interpretation

The set of all vectors AxA\mathbf{x} as x\mathbf{x} ranges over Rn\mathbb{R}^n is the column space of AA. Minimizing Axb\|A\mathbf{x} - \mathbf{b}\| means finding the point in the column space closest to b\mathbf{b}. That closest point is the orthogonal projection b^=projCol(A)b\hat{\mathbf{b}} = \text{proj}_{\text{Col}(A)}\mathbf{b}.

The least-squares solution x^\hat{\mathbf{x}} satisfies Ax^=b^A\hat{\mathbf{x}} = \hat{\mathbf{b}} — it produces the projection, not the original b\mathbf{b}. The residual r=bAx^=bb^\mathbf{r} = \mathbf{b} - A\hat{\mathbf{x}} = \mathbf{b} - \hat{\mathbf{b}} is the component of b\mathbf{b} orthogonal to the column space. It lies in Col(A)=Null(AT)\text{Col}(A)^\perp = \text{Null}(A^T).

The orthogonality condition ATr=0A^T\mathbf{r} = \mathbf{0} — the residual is perpendicular to every column of AA — is the geometric content of the least-squares solution. It is this condition that leads to the normal equations.

The Normal Equations

The orthogonality condition AT(bAx^)=0A^T(\mathbf{b} - A\hat{\mathbf{x}}) = \mathbf{0} rearranges to

ATAx^=ATbA^TA\hat{\mathbf{x}} = A^T\mathbf{b}


These are the normal equations. They form a square n×nn \times n system regardless of the shape of AA.

The matrix ATAA^TA is always symmetric and positive semi-definite. When AA has full column rank (the columns are linearly independent), ATAA^TA is positive definite and invertible, giving a unique least-squares solution:

x^=(ATA)1ATb\hat{\mathbf{x}} = (A^TA)^{-1}A^T\mathbf{b}


When AA does not have full column rank, ATAA^TA is singular and the normal equations have infinitely many solutions. All produce the same projection b^=Ax^\hat{\mathbf{b}} = A\hat{\mathbf{x}}, but the x^\hat{\mathbf{x}} vectors differ. The minimum-norm solution — the one with smallest x^\|\hat{\mathbf{x}}\| — is selected by the pseudoinverse.

Worked Example: Fitting a Line

Fit a line y=c0+c1xy = c_0 + c_1 x to the data points (1,2)(1, 2), (2,3)(2, 3), (3,6)(3, 6), (4,7)(4, 7).

The model yi=c0+c1xiy_i = c_0 + c_1 x_i for each data point gives the system Ac=yA\mathbf{c} = \mathbf{y} with

A=(11121314),y=(2367)A = \begin{pmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \end{pmatrix}, \quad \mathbf{y} = \begin{pmatrix} 2 \\ 3 \\ 6 \\ 7 \end{pmatrix}


Four equations in two unknowns — overdetermined. The normal equations:

ATA=(4101030),ATy=(1853)A^TA = \begin{pmatrix} 4 & 10 \\ 10 & 30 \end{pmatrix}, \quad A^T\mathbf{y} = \begin{pmatrix} 18 \\ 53 \end{pmatrix}


Solving: det(ATA)=120100=20\det(A^TA) = 120 - 100 = 20. (ATA)1=120(3010104)(A^TA)^{-1} = \frac{1}{20}\begin{pmatrix} 30 & -10 \\ -10 & 4 \end{pmatrix}.

c^=120(3010104)(1853)=120(540530180+212)=120(1032)=(0.51.6)\hat{\mathbf{c}} = \frac{1}{20}\begin{pmatrix} 30 & -10 \\ -10 & 4 \end{pmatrix}\begin{pmatrix} 18 \\ 53 \end{pmatrix} = \frac{1}{20}\begin{pmatrix} 540 - 530 \\ -180 + 212 \end{pmatrix} = \frac{1}{20}\begin{pmatrix} 10 \\ 32 \end{pmatrix} = \begin{pmatrix} 0.5 \\ 1.6 \end{pmatrix}


The best-fit line is y=0.5+1.6xy = 0.5 + 1.6x. The residuals are 22.1=0.12 - 2.1 = -0.1, 33.7=0.73 - 3.7 = -0.7, 65.3=0.76 - 5.3 = 0.7, 76.9=0.17 - 6.9 = 0.1. Their sum of squares 0.01+0.49+0.49+0.01=1.00.01 + 0.49 + 0.49 + 0.01 = 1.0 is the minimum achievable error for any line through these data.

Worked Example: Fitting a Parabola

Fit a parabola y=c0+c1x+c2x2y = c_0 + c_1 x + c_2 x^2 to the same data (1,2)(1, 2), (2,3)(2, 3), (3,6)(3, 6), (4,7)(4, 7).

The design matrix gains a third column:

A=(1111241391416)A = \begin{pmatrix} 1 & 1 & 1 \\ 1 & 2 & 4 \\ 1 & 3 & 9 \\ 1 & 4 & 16 \end{pmatrix}


The normal equations ATAc^=ATyA^TA\hat{\mathbf{c}} = A^T\mathbf{y} now form a 3×33 \times 3 system. The machinery is identical — only the model matrix changes. A higher-degree model provides a closer fit (the residual sum of squares cannot increase when the model gains flexibility), but it also risks fitting noise rather than signal.

The framework generalizes to any linear model: y=c0f0(x)+c1f1(x)++ckfk(x)y = c_0 f_0(x) + c_1 f_1(x) + \cdots + c_k f_k(x) where the functions fif_i are chosen in advance. Each choice produces a different design matrix AA, and the normal equations produce the best coefficients in the least-squares sense.

The Projection Matrix

The projection of b\mathbf{b} onto the column space is b^=Pb\hat{\mathbf{b}} = P\mathbf{b} where

P=A(ATA)1ATP = A(A^TA)^{-1}A^T


When the columns of AA are orthonormal (A=QA = Q with QTQ=IQ^TQ = I), this simplifies to P=QQTP = QQ^T.

The projection matrix is symmetric (PT=PP^T = P) and idempotent (P2=PP^2 = P). The complementary matrix IPI - P projects onto the orthogonal complement Col(A)\text{Col}(A)^\perp and extracts the residual: r=(IP)b\mathbf{r} = (I - P)\mathbf{b}.

The minimum squared error is

r2=(IP)b2=b2Pb2\|\mathbf{r}\|^2 = \|(I - P)\mathbf{b}\|^2 = \|\mathbf{b}\|^2 - \|P\mathbf{b}\|^2


by the Pythagorean theorem, since PbP\mathbf{b} and (IP)b(I - P)\mathbf{b} are orthogonal.

The Pseudoinverse

The matrix A+=(ATA)1ATA^+ = (A^TA)^{-1}A^T (when AA has full column rank) is called the left pseudoinverse of AA. The least-squares solution is x^=A+b\hat{\mathbf{x}} = A^+\mathbf{b}.

The pseudoinverse satisfies A+A=InA^+A = I_n (it is a left inverse of AA), but AA+ImAA^+ \neq I_m in general — the product AA+AA^+ equals the projection matrix PP.

When AA does not have full column rank, the Moore-Penrose pseudoinverse A+A^+ is defined through the singular value decomposition: if A=UΣVTA = U\Sigma V^T, then A+=VΣ+UTA^+ = V\Sigma^+ U^T, where Σ+\Sigma^+ inverts the nonzero singular values and transposes the shape. The Moore-Penrose pseudoinverse gives the minimum-norm least-squares solution — the x^\hat{\mathbf{x}} of smallest length among all minimizers of Axb\|A\mathbf{x} - \mathbf{b}\|.

Least Squares via QR

The QR decomposition A=QRA = QR provides a numerically superior method for solving the normal equations.

Substituting A=QRA = QR into ATAx^=ATbA^TA\hat{\mathbf{x}} = A^T\mathbf{b} gives RTQTQRx^=RTQTbR^TQ^TQR\hat{\mathbf{x}} = R^TQ^T\mathbf{b}. Since QTQ=IQ^TQ = I, this simplifies to RTRx^=RTQTbR^TR\hat{\mathbf{x}} = R^TQ^T\mathbf{b}. Multiplying both sides by (RT)1(R^T)^{-1}:

Rx^=QTbR\hat{\mathbf{x}} = Q^T\mathbf{b}


The right-hand side QTbQ^T\mathbf{b} is a vector of nn dot products. The left-hand side is an upper triangular system, solved by back substitution.

The critical advantage is numerical. Forming ATAA^TA explicitly squares the condition number of AA, amplifying rounding errors. The QR approach works with QQ and RR directly, preserving the original conditioning. This is why QR-based least squares is the standard algorithm in numerical software, from MATLAB to Python's NumPy.

Regression as Least Squares

The entire framework of linear regression is a least-squares problem in matrix form.

Simple linear regression fits y=β0+β1x+ϵy = \beta_0 + \beta_1 x + \epsilon to data (xi,yi)(x_i, y_i). The design matrix has a column of ones and a column of xix_i values. The normal equations produce the slope β^1\hat{\beta}_1 and intercept β^0\hat{\beta}_0 that minimize (yiβ0β1xi)2\sum(y_i - \beta_0 - \beta_1 x_i)^2.

Multiple linear regression fits y=Xβ+ϵ\mathbf{y} = X\boldsymbol{\beta} + \boldsymbol{\epsilon} where XX is the design matrix with rows for observations and columns for predictors. The normal equations XTXβ^=XTyX^TX\hat{\boldsymbol{\beta}} = X^T\mathbf{y} give the coefficient estimates.

The projection matrix P=X(XTX)1XTP = X(X^TX)^{-1}X^T is called the hat matrix in statistics because it puts the "hat" on y\mathbf{y}: y^=Py\hat{\mathbf{y}} = P\mathbf{y}. The residual vector e=yy^=(IP)y\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = (I - P)\mathbf{y} is the component of y\mathbf{y} orthogonal to the column space of XX, and e2\|\mathbf{e}\|^2 is the residual sum of squares.

The Minimum Error

The least-squares error r=bAx^\|\mathbf{r}\| = \|\mathbf{b} - A\hat{\mathbf{x}}\| is the distance from b\mathbf{b} to the column space of AA. It is the length of the orthogonal component of b\mathbf{b} with respect to Col(A)\text{Col}(A).

By the Pythagorean theorem, since b^=Ax^\hat{\mathbf{b}} = A\hat{\mathbf{x}} and r=bb^\mathbf{r} = \mathbf{b} - \hat{\mathbf{b}} are perpendicular:

b2=b^2+r2\|\mathbf{b}\|^2 = \|\hat{\mathbf{b}}\|^2 + \|\mathbf{r}\|^2


The error is what remains after the projection accounts for as much of b\mathbf{b} as possible. The ratio b^2/b2\|\hat{\mathbf{b}}\|^2 / \|\mathbf{b}\|^2 (computed with centered data in regression) is the coefficient of determination R2R^2 — the fraction of the total variation explained by the model. An R2R^2 close to 11 means the column space captures nearly all of b\mathbf{b}; close to 00 means the model explains little.

The error is zero if and only if bCol(A)\mathbf{b} \in \text{Col}(A) — if and only if the original system Ax=bA\mathbf{x} = \mathbf{b} has an exact solution. In that case, the least-squares solution is the exact solution, and the two problems coincide.