Mathematical Proof: RSS/σ² ~ χ²ₙ₋ₚ

Model Assumptions

Normal Linear Model:

y = Xβ + ε
ε ~ N(0, σ²I) [multivariate normal]
X is n × p design matrix with rank(X) = p
RSS = ||y - Xβ̂||² = residual sum of squares

Foundational Concepts

What is σ²?

σ² is the true error variance (population parameter):

σ² = Var(εᵢ): Variance of individual error terms (scalar)
σ² = Var(yᵢ | X): Conditional variance of response given predictors
Population parameter: Fixed, unknown constant describing the data-generating process

Why Var(yᵢ | X) = σ²?

Given yᵢ = Xᵢβ + εᵢ and conditioning on X:

Var(yᵢ | X) = Var(Xᵢβ + εᵢ | X)
             = Var(Xᵢβ | X) + Var(εᵢ | X)    [independence]
             = 0 + Var(εᵢ)                   [Xᵢβ fixed given X]
             = σ²

This works because:

εᵢ ⊥ X (errors independent of design matrix)
Var(εᵢ | X) = Var(εᵢ) (homoscedasticity assumption)

Step 1: Express RSS in Terms of Residuals

The OLS estimator: β̂ = (X’X)⁻¹X’y

Dimensions of (X’X)⁻¹X’y:

Let’s verify the dimensions work out correctly:

X: n × p matrix (n observations, p parameters)
X’: p × n matrix (transpose swaps dimensions)
X’X: (p × n)(n × p) = p × p matrix
(X’X)⁻¹: p × p matrix (inverse of square matrix has same dimensions)
y: n × 1 column vector
X’y: (p × n)(n × 1) = p × 1 vector
(X’X)⁻¹X’y: (p × p)(p × 1) = p × 1 vector

So β̂ is a p × 1 coefficient vector, as expected.

The fitted values: ŷ = Xβ̂ = X(X’X)⁻¹X’y = Hy

Where H = X(X’X)⁻¹X’ is the hat matrix (projection onto column space of X).

The residuals: ê = y - ŷ = (I - H)y

Therefore: RSS = ê’ê = y’(I - H)’(I - H)y = y’(I - H)y

(Since (I - H)’ = I - H and (I - H)² = I - H, as H is idempotent)

Step 2: Properties of the Hat Matrix H

First, prove H is idempotent:

H = X(X’X)⁻¹X'

Step-by-step matrix multiplication:

H² = [X(X'X)⁻¹X'][X(X'X)⁻¹X']

Using associativity of matrix multiplication:

= X(X'X)⁻¹ × [X' × X] × (X'X)⁻¹X'
= X(X'X)⁻¹ × [X'X] × (X'X)⁻¹X'

Now the crucial step - the middle terms X’X and (X’X)⁻¹ cancel:

= X × [(X'X)⁻¹(X'X)] × (X'X)⁻¹X'
= X × [I] × (X'X)⁻¹X'
= X(X'X)⁻¹X'
= H

So H² = H (H is idempotent).

Now prove (I - H) properties:

(I - H) is symmetric:

(I - H)' = I' - H' = I - H

(since I’ = I and H’ = [X(X’X)⁻¹X’]’ = X(X’X)⁻¹X’ = H)

Proof that H’ = H (symmetry): Using the transpose rule (ABC)’ = C’B’A':

H' = [X(X'X)⁻¹X']' = (X')'[(X'X)⁻¹]'X'
   = X(X'X)⁻¹X' = H

(I - H) is idempotent:

(I - H)² = (I - H)(I - H)
         = I² - IH - HI + H²
         = I - H - H + H     [since I² = I, IH = HI = H, H² = H]
         = I - H

So (I - H)² = I - H.

Other key properties:

1. Proof that rank(H) = p:

The hat matrix H = X(X’X)⁻¹X’ projects vectors onto the column space of X.

What is the Column Space of X? The column space col(X) is the set of all possible linear combinations of the columns of X. If X is n × p with rank(X) = p, then col(X) is p-dimensional.

Why H Projects onto col(X): When we compute ŷ = Hy, we get:

Hy = X(X'X)⁻¹X'y = X · [(X'X)⁻¹X'y] = X · β̂

Since β̂ is a p×1 vector of coefficients, Xβ̂ is a linear combination of the columns of X, so Hy ∈ col(X).

For any projection matrix P onto a subspace S: rank(P) = dim(S)

Therefore: rank(H) = dim(col(X)) = p

2. Proof that rank(I - H) = n - p:

Since H is a projection onto a p-dimensional subspace:

dim(range(H)) = p
dim(null(H)) = n - p [since H: ℝⁿ → ℝⁿ]

Now, (I - H) is the projection onto the orthogonal complement of col(X):

range(I - H) = null(H) = [col(X)]⊥
dim([col(X)]⊥) = n - p

Therefore: rank(I - H) = n - p

3. Proof that H(I - H) = 0:

Direct algebraic proof:

H(I - H) = H·I - H·H = H - H²

Since H is idempotent (H² = H):

H(I - H) = H - H = 0

Geometric interpretation:

H projects vectors onto col(X)
(I - H) projects vectors onto [col(X)]⊥ (orthogonal complement)
Since col(X) and [col(X)]⊥ are orthogonal subspaces, their projection matrices have zero product

The Special Nature of the Hat Matrix: Why Hy = Xβ̂ is Fundamental

What Makes H = X(X’X)⁻¹X’ Different from Generic Matrix Relationships

The equation Hy = Xβ̂ is fundamentally different from a generic equality like Hv₁ = Xv₂ because the vector β̂ is not arbitrary, but is explicitly defined in terms of X and y through the least-squares optimization process.

1. Specific vs. Arbitrary Vectors

For Hy = Xβ̂:

The vectors y and β̂ are specific to the linear regression problem
y is the observed response vector, and β̂ is the unique vector of coefficients that minimizes the sum of squared residuals
The matrix H is not just any matrix; it is defined by the relationship H = X(X’X)⁻¹X'
The equation Hy = Xβ̂ holds by definition and construction as a result of the least-squares derivation

For Hv₁ = Xv₂:

The vectors v₁ and v₂ are arbitrary - they could be any vectors for which the equality happens to hold
This equation simply states that at some specific point, the output of two different matrix transformations is the same
It does not establish any general relationship between the matrices themselves

2. The Nature of the Transformation

The Hat Matrix H:

The equation Hy = Xβ̂ is a statement about orthogonal projection
The hat matrix H projects the vector y onto the column space of X
The resulting vector ŷ = Hy is guaranteed to lie within the column space of X
Because ŷ is in the column space of X, it can be written as a linear combination of the columns of X, which is precisely what Xβ̂ represents
This is a fundamental geometric relationship derived from a specific optimization process (minimizing least-squares error)

Generic Matrix Multiplication:

The equation Hv₁ = Xv₂ says nothing about projection
It just states that the vector Hv₁ (a linear combination of the columns of H) is equal to the vector Xv₂ (a linear combination of the columns of X)
There is no underlying geometric principle like “orthogonal projection” that guarantees this equality for specific matrices and vectors

3. Column Space Relationships

Why col(H) = col(X):

The statement that “H projects onto the column space of X” directly defines the relationship between the two matrices’ column spaces.

col(H) is a subset of col(X):

Any vector in col(H) can be written as Hv for some vector v
By definition, H projects any vector into the column space of X
This means that for any vector v, the resulting vector Hv must lie within the subspace defined by the columns of X
Therefore, if every vector of the form Hv is an element of col(X), then the entire set col(H) must be a subset of col(X)

col(X) is a subset of col(H):

Any vector in col(X) can be written as Xβ for some vector β
The projection matrix H has a special property: for any vector that is already in the subspace it projects onto, H leaves that vector unchanged
We can verify: H(Xβ) = X(X’X)⁻¹X’ = X(X’X)⁻¹(X’X)β = X(I)β = Xβ
Since any vector Xβ ∈ col(X) can also be written as H(Xβ), every vector in col(X) is also in col(H)

Conclusion: Because col(H) ⊆ col(X) and col(X) ⊆ col(H), we have col(H) = col(X).

This equality of subspaces guarantees that rank(H) = dim(col(H)) = dim(col(X)) = rank(X) = p.

4. Implications for Rank

For the regression context:

Because H is defined to project onto the column space of X, we have col(H) = col(X)
This necessarily implies that their ranks are equal: rank(H) = rank(X) = p

For generic equations:

The single vector equality Hv₁ = Xv₂ does not imply that the column spaces of H and X are the same
It does not imply that their ranks are equal
Counterexamples can easily be found where the ranks differ

Summary

The special property that distinguishes the linear regression equation is that the matrices and vectors involved have a specific, non-arbitrary relationship based on orthogonal projection. The hat matrix H is explicitly designed to project vectors onto the column space of X, which automatically links its rank to the dimension of that column space. A generic vector equality like Hv₁ = Xv₂ carries no such geometric constraints.

Step 3: Distribution of y Under Normality

Since ε ~ N(0, σ²I): y = Xβ + ε ~ N(Xβ, σ²I)

Step 4: Key Transformation - Standardize y

Let z = (y - Xβ)/σ, so z ~ N(0, I) (standard multivariate normal).

Then: y = Xβ + σz

Substituting into RSS:

RSS = y'(I - H)y = (Xβ + σz)'(I - H)(Xβ + σz)

Step 5: Simplify Using Projection Properties

Expanding:

RSS = (Xβ)'(I - H)(Xβ) + 2σ(Xβ)'(I - H)z + σ²z'(I - H)z

Crucial observation: Since H projects onto the column space of X, we have:

HXβ = Xβ (Xβ is in the column space of X)
Therefore: (I - H)Xβ = Xβ - HXβ = Xβ - Xβ = 0

This eliminates the first two terms: RSS = σ²z’(I - H)z

Therefore: RSS/σ² = z’(I - H)z

Step 6: Apply Chi-Square Characterization Theorem

Theorem: If z ~ N(0, I) and A is symmetric idempotent with rank(A) = r, then z’Az ~ χ²ᵣ.

Verification for (I - H):

(I - H) is symmetric: (I - H)’ = I - H ✓
(I - H) is idempotent: (I - H)² = I - H ✓
rank(I - H) = n - p: Since rank(H) = p and rank(I) = n ✓

Step 7: Final Result

Since:

z ~ N(0, I)
(I - H) is symmetric, idempotent with rank(n - p)

By the chi-square characterization theorem: RSS/σ² = z’(I - H)z ~ χ²ₙ₋ₚ

Intuitive Explanation

The degrees of freedom are n - p because:

We start with n independent observations
We estimate p parameters (β₀, β₁, …, βₚ₋₁)
Each parameter estimation “uses up” one degree of freedom
Remaining degrees of freedom: n - p

This is why the residual sum of squares, when properly scaled, follows a chi-square distribution with n - p degrees of freedom.

The Deep Geometric Connection

These properties reveal the fundamental geometry:

H projects onto the model space (span of X columns) - dimension p
(I - H) projects onto the error space (orthogonal to model space) - dimension n - p
These spaces are orthogonal and complementary: their dimensions add to n
The orthogonality H(I - H) = 0 ensures model and residual sums of squares are independent

This is why RSS has n - p degrees of freedom: it lives entirely in the (n - p)-dimensional error space, orthogonal to the p-dimensional model space where our parameter estimates live.