Mathematical Proof: RSS/σ² ~ χ²ₙ₋ₚ
Model Assumptions
Normal Linear Model:
- y = Xβ + ε
- ε ~ N(0, σ²I) [multivariate normal]
- X is n × p design matrix with rank(X) = p
- RSS = ||y - Xβ̂||² = residual sum of squares
Foundational Concepts
What is σ²?
σ² is the true error variance (population parameter):
- σ² = Var(εᵢ): Variance of individual error terms (scalar)
- σ² = Var(yᵢ | X): Conditional variance of response given predictors
- Population parameter: Fixed, unknown constant describing the data-generating process
Why Var(yᵢ | X) = σ²?
Given yᵢ = Xᵢβ + εᵢ and conditioning on X:
Var(yᵢ | X) = Var(Xᵢβ + εᵢ | X)
= Var(Xᵢβ | X) + Var(εᵢ | X) [independence]
= 0 + Var(εᵢ) [Xᵢβ fixed given X]
= σ²
This works because:
- εᵢ ⊥ X (errors independent of design matrix)
- Var(εᵢ | X) = Var(εᵢ) (homoscedasticity assumption)
Step 1: Express RSS in Terms of Residuals
The OLS estimator: β̂ = (X’X)⁻¹X’y
Dimensions of (X’X)⁻¹X’y:
Let’s verify the dimensions work out correctly:
- X: n × p matrix (n observations, p parameters)
- X’: p × n matrix (transpose swaps dimensions)
- X’X: (p × n)(n × p) = p × p matrix
- (X’X)⁻¹: p × p matrix (inverse of square matrix has same dimensions)
- y: n × 1 column vector
- X’y: (p × n)(n × 1) = p × 1 vector
- (X’X)⁻¹X’y: (p × p)(p × 1) = p × 1 vector
So β̂ is a p × 1 coefficient vector, as expected.
The fitted values: ŷ = Xβ̂ = X(X’X)⁻¹X’y = Hy
Where H = X(X’X)⁻¹X’ is the hat matrix (projection onto column space of X).
The residuals: ê = y - ŷ = (I - H)y
Therefore: RSS = ê’ê = y’(I - H)’(I - H)y = y’(I - H)y
(Since (I - H)’ = I - H and (I - H)² = I - H, as H is idempotent)
Step 2: Properties of the Hat Matrix H
First, prove H is idempotent:
H = X(X’X)⁻¹X'
Step-by-step matrix multiplication:
H² = [X(X'X)⁻¹X'][X(X'X)⁻¹X']
Using associativity of matrix multiplication:
= X(X'X)⁻¹ × [X' × X] × (X'X)⁻¹X'
= X(X'X)⁻¹ × [X'X] × (X'X)⁻¹X'
Now the crucial step - the middle terms X’X and (X’X)⁻¹ cancel:
= X × [(X'X)⁻¹(X'X)] × (X'X)⁻¹X'
= X × [I] × (X'X)⁻¹X'
= X(X'X)⁻¹X'
= H
So H² = H (H is idempotent).
Now prove (I - H) properties:
(I - H) is symmetric:
(I - H)' = I' - H' = I - H
(since I’ = I and H’ = [X(X’X)⁻¹X’]’ = X(X’X)⁻¹X’ = H)
Proof that H’ = H (symmetry): Using the transpose rule (ABC)’ = C’B’A':
H' = [X(X'X)⁻¹X']' = (X')'[(X'X)⁻¹]'X'
= X(X'X)⁻¹X' = H
(I - H) is idempotent:
(I - H)² = (I - H)(I - H)
= I² - IH - HI + H²
= I - H - H + H [since I² = I, IH = HI = H, H² = H]
= I - H
So (I - H)² = I - H.
Other key properties:
1. Proof that rank(H) = p:
The hat matrix H = X(X’X)⁻¹X’ projects vectors onto the column space of X.
What is the Column Space of X? The column space col(X) is the set of all possible linear combinations of the columns of X. If X is n × p with rank(X) = p, then col(X) is p-dimensional.
Why H Projects onto col(X): When we compute ŷ = Hy, we get:
Hy = X(X'X)⁻¹X'y = X · [(X'X)⁻¹X'y] = X · β̂
Since β̂ is a p×1 vector of coefficients, Xβ̂ is a linear combination of the columns of X, so Hy ∈ col(X).
For any projection matrix P onto a subspace S: rank(P) = dim(S)
Therefore: rank(H) = dim(col(X)) = p
2. Proof that rank(I - H) = n - p:
Since H is a projection onto a p-dimensional subspace:
- dim(range(H)) = p
- dim(null(H)) = n - p [since H: ℝⁿ → ℝⁿ]
Now, (I - H) is the projection onto the orthogonal complement of col(X):
- range(I - H) = null(H) = [col(X)]⊥
- dim([col(X)]⊥) = n - p
Therefore: rank(I - H) = n - p
3. Proof that H(I - H) = 0:
Direct algebraic proof:
H(I - H) = H·I - H·H = H - H²
Since H is idempotent (H² = H):
H(I - H) = H - H = 0
Geometric interpretation:
- H projects vectors onto col(X)
- (I - H) projects vectors onto [col(X)]⊥ (orthogonal complement)
- Since col(X) and [col(X)]⊥ are orthogonal subspaces, their projection matrices have zero product
The Special Nature of the Hat Matrix: Why Hy = Xβ̂ is Fundamental
What Makes H = X(X’X)⁻¹X’ Different from Generic Matrix Relationships
The equation Hy = Xβ̂ is fundamentally different from a generic equality like Hv₁ = Xv₂ because the vector β̂ is not arbitrary, but is explicitly defined in terms of X and y through the least-squares optimization process.
1. Specific vs. Arbitrary Vectors
For Hy = Xβ̂:
- The vectors y and β̂ are specific to the linear regression problem
- y is the observed response vector, and β̂ is the unique vector of coefficients that minimizes the sum of squared residuals
- The matrix H is not just any matrix; it is defined by the relationship H = X(X’X)⁻¹X'
- The equation Hy = Xβ̂ holds by definition and construction as a result of the least-squares derivation
For Hv₁ = Xv₂:
- The vectors v₁ and v₂ are arbitrary - they could be any vectors for which the equality happens to hold
- This equation simply states that at some specific point, the output of two different matrix transformations is the same
- It does not establish any general relationship between the matrices themselves
2. The Nature of the Transformation
The Hat Matrix H:
- The equation Hy = Xβ̂ is a statement about orthogonal projection
- The hat matrix H projects the vector y onto the column space of X
- The resulting vector ŷ = Hy is guaranteed to lie within the column space of X
- Because ŷ is in the column space of X, it can be written as a linear combination of the columns of X, which is precisely what Xβ̂ represents
- This is a fundamental geometric relationship derived from a specific optimization process (minimizing least-squares error)
Generic Matrix Multiplication:
- The equation Hv₁ = Xv₂ says nothing about projection
- It just states that the vector Hv₁ (a linear combination of the columns of H) is equal to the vector Xv₂ (a linear combination of the columns of X)
- There is no underlying geometric principle like “orthogonal projection” that guarantees this equality for specific matrices and vectors
3. Column Space Relationships
Why col(H) = col(X):
The statement that “H projects onto the column space of X” directly defines the relationship between the two matrices’ column spaces.
col(H) is a subset of col(X):
- Any vector in col(H) can be written as Hv for some vector v
- By definition, H projects any vector into the column space of X
- This means that for any vector v, the resulting vector Hv must lie within the subspace defined by the columns of X
- Therefore, if every vector of the form Hv is an element of col(X), then the entire set col(H) must be a subset of col(X)
col(X) is a subset of col(H):
- Any vector in col(X) can be written as Xβ for some vector β
- The projection matrix H has a special property: for any vector that is already in the subspace it projects onto, H leaves that vector unchanged
- We can verify: H(Xβ) = X(X’X)⁻¹X’ = X(X’X)⁻¹(X’X)β = X(I)β = Xβ
- Since any vector Xβ ∈ col(X) can also be written as H(Xβ), every vector in col(X) is also in col(H)
Conclusion: Because col(H) ⊆ col(X) and col(X) ⊆ col(H), we have col(H) = col(X).
This equality of subspaces guarantees that rank(H) = dim(col(H)) = dim(col(X)) = rank(X) = p.
4. Implications for Rank
For the regression context:
- Because H is defined to project onto the column space of X, we have col(H) = col(X)
- This necessarily implies that their ranks are equal: rank(H) = rank(X) = p
For generic equations:
- The single vector equality Hv₁ = Xv₂ does not imply that the column spaces of H and X are the same
- It does not imply that their ranks are equal
- Counterexamples can easily be found where the ranks differ
Summary
The special property that distinguishes the linear regression equation is that the matrices and vectors involved have a specific, non-arbitrary relationship based on orthogonal projection. The hat matrix H is explicitly designed to project vectors onto the column space of X, which automatically links its rank to the dimension of that column space. A generic vector equality like Hv₁ = Xv₂ carries no such geometric constraints.
Step 3: Distribution of y Under Normality
Since ε ~ N(0, σ²I): y = Xβ + ε ~ N(Xβ, σ²I)
Step 4: Key Transformation - Standardize y
Let z = (y - Xβ)/σ, so z ~ N(0, I) (standard multivariate normal).
Then: y = Xβ + σz
Substituting into RSS:
RSS = y'(I - H)y = (Xβ + σz)'(I - H)(Xβ + σz)
Step 5: Simplify Using Projection Properties
Expanding:
RSS = (Xβ)'(I - H)(Xβ) + 2σ(Xβ)'(I - H)z + σ²z'(I - H)z
Crucial observation: Since H projects onto the column space of X, we have:
- HXβ = Xβ (Xβ is in the column space of X)
- Therefore: (I - H)Xβ = Xβ - HXβ = Xβ - Xβ = 0
This eliminates the first two terms: RSS = σ²z’(I - H)z
Therefore: RSS/σ² = z’(I - H)z
Step 6: Apply Chi-Square Characterization Theorem
Theorem: If z ~ N(0, I) and A is symmetric idempotent with rank(A) = r, then z’Az ~ χ²ᵣ.
Verification for (I - H):
- (I - H) is symmetric: (I - H)’ = I - H ✓
- (I - H) is idempotent: (I - H)² = I - H ✓
- rank(I - H) = n - p: Since rank(H) = p and rank(I) = n ✓
Step 7: Final Result
Since:
- z ~ N(0, I)
- (I - H) is symmetric, idempotent with rank(n - p)
By the chi-square characterization theorem: RSS/σ² = z’(I - H)z ~ χ²ₙ₋ₚ
Intuitive Explanation
The degrees of freedom are n - p because:
- We start with n independent observations
- We estimate p parameters (β₀, β₁, …, βₚ₋₁)
- Each parameter estimation “uses up” one degree of freedom
- Remaining degrees of freedom: n - p
This is why the residual sum of squares, when properly scaled, follows a chi-square distribution with n - p degrees of freedom.
The Deep Geometric Connection
These properties reveal the fundamental geometry:
- H projects onto the model space (span of X columns) - dimension p
- (I - H) projects onto the error space (orthogonal to model space) - dimension n - p
- These spaces are orthogonal and complementary: their dimensions add to n
- The orthogonality H(I - H) = 0 ensures model and residual sums of squares are independent
This is why RSS has n - p degrees of freedom: it lives entirely in the (n - p)-dimensional error space, orthogonal to the p-dimensional model space where our parameter estimates live.