When a linear system has no solution, the least-squares method finds the vector that comes closest — the one minimizing the squared distance between Ax and b. The answer is a projection: the least-squares solution produces the point in the column space nearest to b, and the normal equations encode the orthogonality condition that defines "nearest."
The Problem
The systemAx=b may have no solution — b may not lie in the column space of A. This is typical when the system is overdetermined: more equations than unknowns, with the equations imposing contradictory constraints.
When no exact solution exists, the goal shifts from solving Ax=b to minimizing the error:
x^=argxmin∥Ax−b∥2
The quantity ∥Ax−b∥2=∑i(Ax−b)i2 is the sum of squared residuals. The vector x^ that minimizes this sum is the least-squares solution.
The Geometric Interpretation
The set of all vectors Ax as x ranges over Rn is the column space of A. Minimizing ∥Ax−b∥ means finding the point in the column space closest to b. That closest point is the orthogonal projectionb^=projCol(A)b.
The least-squares solution x^ satisfies Ax^=b^ — it produces the projection, not the original b. The residual r=b−Ax^=b−b^ is the component of borthogonal to the column space. It lies in Col(A)⊥=Null(AT).
The orthogonality condition ATr=0 — the residual is perpendicular to every column of A — is the geometric content of the least-squares solution. It is this condition that leads to the normal equations.
The Normal Equations
The orthogonality condition AT(b−Ax^)=0 rearranges to
ATAx^=ATb
These are the normal equations. They form a square n×n system regardless of the shape of A.
The matrixATA is always symmetric and positive semi-definite. When A has full column rank (the columns are linearly independent), ATA is positive definite and invertible, giving a unique least-squares solution:
x^=(ATA)−1ATb
When A does not have full column rank, ATA is singular and the normal equations have infinitely many solutions. All produce the same projection b^=Ax^, but the x^ vectors differ. The minimum-norm solution — the one with smallest ∥x^∥ — is selected by the pseudoinverse.
Worked Example: Fitting a Line
Fit a line y=c0+c1x to the data points (1,2), (2,3), (3,6), (4,7).
The model yi=c0+c1xi for each data point gives the system Ac=y with
A=11111234,y=2367
Four equations in two unknowns — overdetermined. The normal equations:
The best-fit line is y=0.5+1.6x. The residuals are 2−2.1=−0.1, 3−3.7=−0.7, 6−5.3=0.7, 7−6.9=0.1. Their sum of squares 0.01+0.49+0.49+0.01=1.0 is the minimum achievable error for any line through these data.
Worked Example: Fitting a Parabola
Fit a parabola y=c0+c1x+c2x2 to the same data (1,2), (2,3), (3,6), (4,7).
The design matrix gains a third column:
A=1111123414916
The normal equations ATAc^=ATy now form a 3×3 system. The machinery is identical — only the model matrix changes. A higher-degree model provides a closer fit (the residual sum of squares cannot increase when the model gains flexibility), but it also risks fitting noise rather than signal.
The framework generalizes to any linear model: y=c0f0(x)+c1f1(x)+⋯+ckfk(x) where the functions fi are chosen in advance. Each choice produces a different design matrix A, and the normal equations produce the best coefficients in the least-squares sense.
The Projection Matrix
The projection of b onto the column space is b^=Pb where
P=A(ATA)−1AT
When the columns of A are orthonormal (A=Q with QTQ=I), this simplifies to P=QQT.
The projection matrix is symmetric (PT=P) and idempotent (P2=P). The complementary matrix I−P projects onto the orthogonal complement Col(A)⊥ and extracts the residual: r=(I−P)b.
The matrix A+=(ATA)−1AT (when A has full column rank) is called the left pseudoinverse of A. The least-squares solution is x^=A+b.
The pseudoinverse satisfies A+A=In (it is a left inverse of A), but AA+=Im in general — the product AA+ equals the projection matrix P.
When A does not have full column rank, the Moore-Penrose pseudoinverse A+ is defined through the singular value decomposition: if A=UΣVT, then A+=VΣ+UT, where Σ+ inverts the nonzero singular values and transposes the shape. The Moore-Penrose pseudoinverse gives the minimum-norm least-squares solution — the x^ of smallest length among all minimizers of ∥Ax−b∥.
Least Squares via QR
The QR decompositionA=QR provides a numerically superior method for solving the normal equations.
Substituting A=QR into ATAx^=ATb gives RTQTQRx^=RTQTb. Since QTQ=I, this simplifies to RTRx^=RTQTb. Multiplying both sides by (RT)−1:
Rx^=QTb
The right-hand side QTb is a vector of n dot products. The left-hand side is an upper triangular system, solved by back substitution.
The critical advantage is numerical. Forming ATA explicitly squares the condition number of A, amplifying rounding errors. The QR approach works with Q and R directly, preserving the original conditioning. This is why QR-based least squares is the standard algorithm in numerical software, from MATLAB to Python's NumPy.
Regression as Least Squares
The entire framework of linear regression is a least-squares problem in matrix form.
Simple linear regression fits y=β0+β1x+ϵ to data (xi,yi). The design matrix has a column of ones and a column of xi values. The normal equations produce the slope β^1 and intercept β^0 that minimize ∑(yi−β0−β1xi)2.
Multiple linear regression fits y=Xβ+ϵ where X is the design matrix with rows for observations and columns for predictors. The normal equations XTXβ^=XTy give the coefficient estimates.
The projection matrix P=X(XTX)−1XT is called the hat matrix in statistics because it puts the "hat" on y: y^=Py. The residual vector e=y−y^=(I−P)y is the component of y orthogonal to the column space of X, and ∥e∥2 is the residual sum of squares.
The Minimum Error
The least-squares error ∥r∥=∥b−Ax^∥ is the distance from b to the column space of A. It is the length of the orthogonal component of b with respect to Col(A).
By the Pythagorean theorem, since b^=Ax^ and r=b−b^ are perpendicular:
∥b∥2=∥b^∥2+∥r∥2
The error is what remains after the projection accounts for as much of b as possible. The ratio ∥b^∥2/∥b∥2 (computed with centered data in regression) is the coefficient of determination R2 — the fraction of the total variation explained by the model. An R2 close to 1 means the column space captures nearly all of b; close to 0 means the model explains little.
The error is zero if and only if b∈Col(A) — if and only if the original system Ax=b has an exact solution. In that case, the least-squares solution is the exact solution, and the two problems coincide.