What is the least-squares solution?

The least-squares solution x̂ minimizes the sum of squared residuals ‖Ax − b‖² when the system Ax = b has no exact solution. It produces the point in the column space of A closest to b, which is the orthogonal projection of b onto Col(A).

What are the normal equations?

The normal equations are AᵀAx̂ = Aᵀb. They arise from requiring the residual b − Ax̂ to be orthogonal to every column of A. When A has full column rank, AᵀA is invertible and the unique solution is x̂ = (AᵀA)⁻¹Aᵀb.

How do you find the least-squares line through data points?

Set up the design matrix A with a column of ones and a column of x-values. The vector b contains the y-values. Solve the normal equations AᵀAĉ = Aᵀb to get the intercept and slope that minimize the sum of squared vertical distances from the data to the line.

Why is QR decomposition preferred for least squares?

Forming AᵀA explicitly squares the condition number of A, amplifying rounding errors. The QR decomposition A = QR reduces least squares to the triangular system Rx̂ = Qᵀb, which preserves the original conditioning and is the standard algorithm in numerical software.

What is the pseudoinverse?

The pseudoinverse A⁺ = (AᵀA)⁻¹Aᵀ gives the least-squares solution as x̂ = A⁺b when A has full column rank. When A is rank-deficient, the Moore-Penrose pseudoinverse from the SVD selects the minimum-norm least-squares solution.

How is least squares related to linear regression?

Linear regression is least squares in matrix form. The design matrix X holds predictor values, and the normal equations XᵀXβ̂ = Xᵀy produce the regression coefficients. The hat matrix P = X(XᵀX)⁻¹Xᵀ projects y onto the column space of X, and R² measures how much of y the projection captures.

Least Squares

The Problem

The Geometric Interpretation

The Normal Equations

Worked Example: Fitting a Line

Worked Example: Fitting a Parabola

The Projection Matrix

The Pseudoinverse

Least Squares via QR

Regression as Least Squares

The Minimum Error

Summary: Methods for Computing the Least-Squares Solution

The Best Approximate Solution

When a linear system has no solution, the least-squares method finds the vector that comes closest — the one minimizing the squared distance between Ax and b. The answer is a projection: the least-squares solution produces the point in the column space nearest to b, and the normal equations encode the orthogonality condition that defines "nearest."

The Problem

The system

A\mathbf{x} = \mathbf{b}

may have no solution —

\mathbf{b}

may not lie in the column space of

A

. This is typical when the system is overdetermined: more equations than unknowns, with the equations imposing contradictory constraints.

When no exact solution exists, the goal shifts from solving

A\mathbf{x} = \mathbf{b}

to minimizing the error:

\hat{\mathbf{x}} = \arg\min_{\mathbf{x}} \|A\mathbf{x} - \mathbf{b}\|^2

The quantity

\|A\mathbf{x} - \mathbf{b}\|^2 = \sum_i (A\mathbf{x} - \mathbf{b})_i^2

is the sum of squared residuals. The vector

\hat{\mathbf{x}}

that minimizes this sum is the least-squares solution.

The Geometric Interpretation

The set of all vectors

A\mathbf{x}

\mathbf{x}

ranges over

\mathbb{R}^n

is the column space of

A

. Minimizing

\|A\mathbf{x} - \mathbf{b}\|

means finding the point in the column space closest to

\mathbf{b}

. That closest point is the orthogonal projection

\hat{\mathbf{b}} = \text{proj}_{\text{Col}(A)}\mathbf{b}

.

The least-squares solution

\hat{\mathbf{x}}

satisfies

A\hat{\mathbf{x}} = \hat{\mathbf{b}}

— it produces the projection, not the original

\mathbf{b}

. The residual

\mathbf{r} = \mathbf{b} - A\hat{\mathbf{x}} = \mathbf{b} - \hat{\mathbf{b}}

is the component of

\mathbf{b}

orthogonal to the column space. It lies in

\text{Col}(A)^\perp = \text{Null}(A^T)

.

The orthogonality condition

A^T\mathbf{r} = \mathbf{0}

— the residual is perpendicular to every column of

A

— is the geometric content of the least-squares solution. It is this condition that leads to the normal equations.

The Normal Equations

The orthogonality condition

A^T(\mathbf{b} - A\hat{\mathbf{x}}) = \mathbf{0}

rearranges to

A^TA\hat{\mathbf{x}} = A^T\mathbf{b}

These are the normal equations. They form a square

n \times n

system regardless of the shape of

A

.

The matrix

A^TA

is always symmetric and positive semi-definite. When

A

has full column rank (the columns are linearly independent),

A^TA

is positive definite and invertible, giving a unique least-squares solution:

\hat{\mathbf{x}} = (A^TA)^{-1}A^T\mathbf{b}

When

A

does not have full column rank,

A^TA

is singular and the normal equations have infinitely many solutions. All produce the same projection

\hat{\mathbf{b}} = A\hat{\mathbf{x}}

, but the

\hat{\mathbf{x}}

vectors differ. The minimum-norm solution — the one with smallest

\|\hat{\mathbf{x}}\|

— is selected by the pseudoinverse.

Worked Example: Fitting a Line

Fit a line

y = c_0 + c_1 x

to the data points

(1, 2)

(2, 3)

(3, 6)

(4, 7)

.

The model

y_i = c_0 + c_1 x_i

for each data point gives the system

A\mathbf{c} = \mathbf{y}

with

A = \begin{pmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \end{pmatrix}, \quad \mathbf{y} = \begin{pmatrix} 2 \\ 3 \\ 6 \\ 7 \end{pmatrix}

Four equations in two unknowns — overdetermined. The normal equations:

A^TA = \begin{pmatrix} 4 & 10 \\ 10 & 30 \end{pmatrix}, \quad A^T\mathbf{y} = \begin{pmatrix} 18 \\ 53 \end{pmatrix}

Solving:

\det(A^TA) = 120 - 100 = 20

(A^TA)^{-1} = \frac{1}{20}\begin{pmatrix} 30 & -10 \\ -10 & 4 \end{pmatrix}

\hat{\mathbf{c}} = \frac{1}{20}\begin{pmatrix} 30 & -10 \\ -10 & 4 \end{pmatrix}\begin{pmatrix} 18 \\ 53 \end{pmatrix} = \frac{1}{20}\begin{pmatrix} 540 - 530 \\ -180 + 212 \end{pmatrix} = \frac{1}{20}\begin{pmatrix} 10 \\ 32 \end{pmatrix} = \begin{pmatrix} 0.5 \\ 1.6 \end{pmatrix}

The best-fit line is

y = 0.5 + 1.6x

. The residuals are

2 - 2.1 = -0.1

3 - 3.7 = -0.7

6 - 5.3 = 0.7

7 - 6.9 = 0.1

. Their sum of squares

0.01 + 0.49 + 0.49 + 0.01 = 1.0

is the minimum achievable error for any line through these data.

Worked Example: Fitting a Parabola

Fit a parabola

y = c_0 + c_1 x + c_2 x^2

to the same data

(1, 2)

(2, 3)

(3, 6)

(4, 7)

.

The design matrix gains a third column:

A = \begin{pmatrix} 1 & 1 & 1 \\ 1 & 2 & 4 \\ 1 & 3 & 9 \\ 1 & 4 & 16 \end{pmatrix}

The normal equations

A^TA\hat{\mathbf{c}} = A^T\mathbf{y}

now form a

3 \times 3

system. The machinery is identical — only the model matrix changes. A higher-degree model provides a closer fit (the residual sum of squares cannot increase when the model gains flexibility), but it also risks fitting noise rather than signal.

The framework generalizes to any linear model:

y = c_0 f_0(x) + c_1 f_1(x) + \cdots + c_k f_k(x)

where the functions

f_i

are chosen in advance. Each choice produces a different design matrix

A

, and the normal equations produce the best coefficients in the least-squares sense.

Model	Number of parameters	Row of the design matrix A	Typical use
Line: y = c₀ + c₁ x	2	[ 1, x_i ]	linear trends in one variable
Parabola: y = c₀ + c₁ x + c₂ x²	3	[ 1, x_i, x_i² ]	smooth one-variable curves with curvature
Polynomial of degree k	k + 1	[ 1, x_i, x_i², …, x_i^k ]	higher-order curve fitting; flexible but prone to overfit
Generic linear model: y = Σ_j c_j f_j(x)	k + 1	[ f₀(x_i), f₁(x_i), …, f_k(x_i) ]	arbitrary basis functions (sinusoids, exponentials, splines)
Multiple linear regression: y = β₀ + Σ_j β_j x_j	k + 1	[ 1, x_1,i, x_2,i, …, x_k,i ]	several predictors per observation

The Projection Matrix

The projection of

\mathbf{b}

onto the column space is

\hat{\mathbf{b}} = P\mathbf{b}

where

P = A(A^TA)^{-1}A^T

When the columns of

A

are orthonormal (

A = Q

with

Q^TQ = I

), this simplifies to

P = QQ^T

.

The projection matrix is symmetric (

P^T = P

) and idempotent (

P^2 = P

). The complementary matrix

I - P

projects onto the orthogonal complement

\text{Col}(A)^\perp

and extracts the residual:

\mathbf{r} = (I - P)\mathbf{b}

.

The minimum squared error is

\|\mathbf{r}\|^2 = \|(I - P)\mathbf{b}\|^2 = \|\mathbf{b}\|^2 - \|P\mathbf{b}\|^2

by the Pythagorean theorem, since

P\mathbf{b}

and

(I - P)\mathbf{b}

are orthogonal.

The Pseudoinverse

The matrix

A^+ = (A^TA)^{-1}A^T

(when

A

has full column rank) is called the left pseudoinverse of

A

. The least-squares solution is

\hat{\mathbf{x}} = A^+\mathbf{b}

.

The pseudoinverse satisfies

A^+A = I_n

(it is a left inverse of

A

), but

AA^+ \neq I_m

in general — the product

AA^+

equals the projection matrix

P

.

When

A

does not have full column rank, the Moore-Penrose pseudoinverse

A^+

is defined through the singular value decomposition: if

A = U\Sigma V^T

, then

A^+ = V\Sigma^+ U^T

, where

\Sigma^+

inverts the nonzero singular values and transposes the shape. The Moore-Penrose pseudoinverse gives the minimum-norm least-squares solution — the

\hat{\mathbf{x}}

of smallest length among all minimizers of

\|A\mathbf{x} - \mathbf{b}\|

Least Squares via QR

The QR decomposition

A = QR

provides a numerically superior method for solving the normal equations.

Substituting

A = QR

into

A^TA\hat{\mathbf{x}} = A^T\mathbf{b}

gives

R^TQ^TQR\hat{\mathbf{x}} = R^TQ^T\mathbf{b}

. Since

Q^TQ = I

, this simplifies to

R^TR\hat{\mathbf{x}} = R^TQ^T\mathbf{b}

. Multiplying both sides by

(R^T)^{-1}

R\hat{\mathbf{x}} = Q^T\mathbf{b}

The right-hand side

Q^T\mathbf{b}

is a vector of

n

dot products. The left-hand side is an upper triangular system, solved by back substitution.

The critical advantage is numerical. Forming

A^TA

explicitly squares the condition number of

A

, amplifying rounding errors. The QR approach works with

Q

and

R

directly, preserving the original conditioning. This is why QR-based least squares is the standard algorithm in numerical software, from MATLAB to Python's NumPy.

Regression as Least Squares

The entire framework of linear regression is a least-squares problem in matrix form.

Simple linear regression fits

y = \beta_0 + \beta_1 x + \epsilon

to data

(x_i, y_i)

. The design matrix has a column of ones and a column of

x_i

values. The normal equations produce the slope

\hat{\beta}_1

and intercept

\hat{\beta}_0

that minimize

\sum(y_i - \beta_0 - \beta_1 x_i)^2

.

Multiple linear regression fits

\mathbf{y} = X\boldsymbol{\beta} + \boldsymbol{\epsilon}

where

X

is the design matrix with rows for observations and columns for predictors. The normal equations

X^TX\hat{\boldsymbol{\beta}} = X^T\mathbf{y}

give the coefficient estimates.

The projection matrix

P = X(X^TX)^{-1}X^T

is called the hat matrix in statistics because it puts the "hat" on

\mathbf{y}

\hat{\mathbf{y}} = P\mathbf{y}

. The residual vector

\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = (I - P)\mathbf{y}

is the component of

\mathbf{y}

orthogonal to the column space of

X

, and

\|\mathbf{e}\|^2

is the residual sum of squares.

Linear-algebra object	Statistics / regression name	Symbol and role
Coefficient matrix	design matrix / predictor matrix	X — rows = observations, columns = features (with an intercept column of 1s)
Vector of unknowns	coefficient (parameter) vector	β — parameters to be estimated
Right-hand side	response vector	y — observed outcomes
Least-squares solution	estimated coefficients	β̂ = (XᵀX)⁻¹ Xᵀ y
Projection matrix P	hat matrix	P = X(XᵀX)⁻¹Xᵀ — puts the "hat" on y
Projection of b onto Col(A)	fitted values	ŷ = P y = X β̂
Residual vector	residual (errors)	e = y − ŷ = (I − P) y
Squared residual length	residual sum of squares (RSS)	‖e‖² = Σᵢ (y_i − ŷ_i)²
Fraction of length captured by projection (centered)	coefficient of determination	R² — fraction of variation in y explained by the model

The Minimum Error

The least-squares error

\|\mathbf{r}\| = \|\mathbf{b} - A\hat{\mathbf{x}}\|

is the distance from

\mathbf{b}

to the column space of

A

. It is the length of the orthogonal component of

\mathbf{b}

with respect to

\text{Col}(A)

.

By the Pythagorean theorem, since

\hat{\mathbf{b}} = A\hat{\mathbf{x}}

and

\mathbf{r} = \mathbf{b} - \hat{\mathbf{b}}

are perpendicular:

\|\mathbf{b}\|^2 = \|\hat{\mathbf{b}}\|^2 + \|\mathbf{r}\|^2

The error is what remains after the projection accounts for as much of

\mathbf{b}

as possible. The ratio

\|\hat{\mathbf{b}}\|^2 / \|\mathbf{b}\|^2

(computed with centered data in regression) is the coefficient of determination

R^2

— the fraction of the total variation explained by the model. An

R^2

close to

1

means the column space captures nearly all of

\mathbf{b}

; close to

0

means the model explains little.

The error is zero if and only if

\mathbf{b} \in \text{Col}(A)

— if and only if the original system

A\mathbf{x} = \mathbf{b}

has an exact solution. In that case, the least-squares solution is the exact solution, and the two problems coincide.

Summary: Methods for Computing the Least-Squares Solution

The least-squares solution x̂ can be computed by several distinct routes, each with its own cost and numerical-stability profile. The table below collects the main methods alongside their formulas, operation counts, and stability characteristics, providing a single reference for choosing among them.

Method	How the LS solution is obtained	Cost (m × n, m ≫ n)	Numerical stability
Normal equations directly	form AᵀA and Aᵀb; solve AᵀA x̂ = Aᵀb (e.g. by Cholesky)	~ m n² + n³ / 3	poor — forming AᵀA squares the condition number of A
QR decomposition	A = Q R; solve the triangular system R x̂ = Qᵀ b by back substitution	~ 2 m n² (Householder QR)	good — preserves the original condition number; the workhorse in numerical libraries
Pseudoinverse via SVD	A = U Σ Vᵀ; x̂ = V Σ⁺ Uᵀ b (Moore–Penrose pseudoinverse)	dominated by SVD: roughly ~ m n² + n³	best — handles rank-deficient A; returns the minimum-norm LS solution
Iterative methods (e.g. CG on AᵀA)	apply Krylov-subspace iteration to AᵀA x̂ = Aᵀb without forming AᵀA explicitly	depends on iteration count; one matrix-vector product per iteration	depends on preconditioning; preferred when A is huge and sparse