Understanding the t² = F Identity in Linear Regression: A Complete Mathematical Proof

Jan 15, 2024·
Jiyuan (Jay) Liu
Jiyuan (Jay) Liu
· 14 min read

Introduction

In linear regression analysis, researchers often encounter two seemingly different approaches to hypothesis testing: the t-test and the F-test. While these tests appear distinct, they share a fundamental relationship when testing single contrasts. This post provides a complete mathematical proof that t² = F when the F-statistic has one degree of freedom in the numerator.

The Linear Model Framework

We begin with the general linear model:

$$y = X\beta + \varepsilon, \quad \varepsilon \sim N(0, \sigma^2 I_n)$$

Where:

  • y: n×1 response vector
  • X: n×p full-rank design matrix
  • β: p×1 vector of unknown regression coefficients
  • ε: error vector with independent components, each ~ N(0, σ²)
  • σ²: the true error variance, representing the population variance of residual errors

This parameter σ² is crucial—it represents:

$$\sigma^2 = \text{Var}(\varepsilon_i) = \text{Var}(y_i | X)$$

Ordinary Least Squares Estimation

The OLS estimator is:

$$\hat{\beta} = (X^\top X)^{-1} X^\top y$$

The Fundamental Variance Property

To understand the variance properties, we need to establish a fundamental result about linear transformations of random vectors.

Key Theorem: For any random vector Z with covariance matrix Σ_Z and constant matrix A:

$$\text{Var}(AZ) = A\Sigma_Z A^\top$$

Complete Step-by-Step Proof of the Variance Property

Let’s prove that Var(Ay) = AVar(y)A^⊤ step by step:

Definitions and Initial Setup

This proof relies on the definition of variance for a random vector. For a random vector y, the variance-covariance matrix, Var(y), is defined as:

$$\text{Var}(y) = E[(y - E[y])(y - E[y])^\top]$$

where:

  • E is the expectation operator
  • y is a random vector of size n×1
  • A is a non-random matrix of size m×n
  • A^⊤ is the transpose of matrix A
  • E[y] is the mean vector of y

Step 1: Apply the definition of variance to Var(Ay)

$$\text{Var}(Ay) = E[((Ay) - E[Ay])((Ay) - E[Ay])^\top]$$

Step 2: Use linearity of expectation (E[Ay] = AE[y])

$$\text{Var}(Ay) = E[((Ay) - AE[y])((Ay) - AE[y])^\top]$$

Step 3: Factor out matrix A

$$\text{Var}(Ay) = E[(A(y - E[y]))(A(y - E[y]))^\top]$$

Step 4: Apply transpose properties [(PQ)⊤ = Q⊤P^⊤]

$$(A(y - E[y]))^\top = (y - E[y])^\top A^\top$$

Substituting:

$$\text{Var}(Ay) = E[(A(y - E[y]))((y - E[y])^\top A^\top)]$$

Step 5: Use linearity of expectation (A and A^⊤ are constants)

$$\text{Var}(Ay) = A \cdot E[(y - E[y])(y - E[y])^\top] \cdot A^\top$$

Step 6: Substitute the definition of Var(y)

$$\text{Var}(Ay) = A \cdot \text{Var}(y) \cdot A^\top$$

This completes the derivation, showing that the variance of a linear transformation of a random vector is equal to the transformation matrix times the variance of the original vector, times the transpose of the transformation matrix.

Applying to OLS Variance

Using this result with A = (X^⊤X){-1}X⊤ and Var(y) = σ²I_n:

$$\begin{align} \text{Var}(\hat{\beta}) &= \text{Var}((X^\top X)^{-1} X^\top y) \\ &= (X^\top X)^{-1} X^\top \text{Var}(y) X (X^\top X)^{-1} \\ &= (X^\top X)^{-1} X^\top (\sigma^2 I_n) X (X^\top X)^{-1} \\ &= \sigma^2 (X^\top X)^{-1} X^\top X (X^\top X)^{-1} \\ &= \sigma^2 (X^\top X)^{-1} \end{align}$$

Residual Analysis and σ² Estimation

Residual Sum of Squares

The fitted values and residuals are:

  • Fitted values: $\hat{y} = X\hat{\beta}$
  • Residuals: $e = y - \hat{y}$
  • RSS: $\text{RSS} = e^\top e = (y - X\hat{\beta})^\top (y - X\hat{\beta})$

Under normal linear model assumptions:

$$\frac{\text{RSS}}{\sigma^2} \sim \chi^2_{n-p}$$

Why n-p Degrees of Freedom?

The degrees of freedom calculation follows from:

  • n observations provide n pieces of information
  • p estimated parameters consume p degrees of freedom
  • Remaining degrees of freedom = n - p

Unbiased Variance Estimator

Define the sample variance:

$$s^2 = \frac{\text{RSS}}{n-p}$$

Since E[χ²_k] = k, we have:

$$E[s^2] = E\left[\frac{\text{RSS}}{n-p}\right] = \frac{\sigma^2 E[\chi^2_{n-p}]}{n-p} = \frac{\sigma^2(n-p)}{n-p} = \sigma^2$$

Thus s² is an unbiased estimator of σ².

Single Contrast Testing

Setting Up the Hypothesis

We want to test:

$$H_0: c^\top \beta = 0$$

where c ∈ ℝ^p is a specified contrast vector.

The contrast estimate is:

$$\hat{L} = c^\top \hat{\beta}$$

Variance of the Contrast

Using the variance transformation property:

$$\begin{align} \text{Var}(\hat{L}) &= \text{Var}(c^\top \hat{\beta}) \\ &= c^\top \text{Var}(\hat{\beta}) c \\ &= c^\top (\sigma^2 (X^\top X)^{-1}) c \\ &= \sigma^2 c^\top (X^\top X)^{-1} c \end{align}$$

Step-by-Step Derivation of t-Statistic and F-Statistic Definitions

Fundamental Definitions

For any hypothesis test, we need to standardize our test statistic to follow a known distribution. The general pattern is:

$$\text{Test Statistic} = \frac{\text{Estimate} - \text{Hypothesized Value}}{\text{Standard Error of Estimate}}$$

Deriving the t-Statistic

Step 1: Start with the basic standardization formula

$$t = \frac{\hat{L} - 0}{\text{SE}(\hat{L})} = \frac{\hat{L}}{\text{SE}(\hat{L})}$$

Step 2: Find the standard error of $\hat{L}$

We previously showed that:

$$\text{Var}(\hat{L}) = \sigma^2 c^\top (X^\top X)^{-1} c$$

The standard error is the square root of the variance:

$$\text{SE}(\hat{L}) = \sqrt{\text{Var}(\hat{L})} = \sqrt{\sigma^2 c^\top (X^\top X)^{-1} c}$$

Step 3: Factor out $\sigma$ from the square root

$$\text{SE}(\hat{L}) = \sigma \sqrt{c^\top (X^\top X)^{-1} c}$$

Step 4: Substitute into the t-statistic formula

$$t = \frac{\hat{L}}{\sigma \sqrt{c^\top (X^\top X)^{-1} c}}$$

Step 5: Replace unknown $\sigma$ with its estimate $s$

Since $\sigma$ is unknown, we replace it with the sample standard deviation $s = \sqrt{s^2}$ where $s^2 = \frac{\text{RSS}}{n-p}$:

$$t = \frac{\hat{L}}{s \sqrt{c^\top (X^\top X)^{-1} c}}$$

Step 6: Final form by moving the square root to the denominator

$$\boxed{t = \frac{\hat{L}}{\sqrt{s^2 c^\top (X^\top X)^{-1} c}}}$$

Deriving the F-Statistic

Step 1: General F-statistic form

The F-statistic compares explained variation to unexplained variation:

$$F = \frac{\text{Explained Mean Square}}{\text{Unexplained Mean Square}}$$

For testing a single contrast, this becomes:

$$F = \frac{\text{(Constraint Sum of Squares)}/\text{(Constraint df)}}{\text{Residual Mean Square}}$$

Step 2: Identify the components for our single contrast

For the hypothesis $H_0: c^\top \beta = 0$:

  • Constraint degrees of freedom: 1 (testing one linear constraint)
  • Constraint Sum of Squares: This measures how much the constraint explains
  • Residual Mean Square: $s^2 = \frac{\text{RSS}}{n-p}$

The Constraint Sum of Squares: Complete Derivation

What is the Constraint Sum of Squares (CSS)?

The CSS measures how much the sum of squared residuals decreases when we allow a constraint to be violated versus when we enforce it.

For our hypothesis $H_0: c^\top \beta = 0$:

  • Constrained model: Forces $c^\top \beta = 0$
  • Unconstrained model: Allows $c^\top \beta$ to take any value (our original OLS model)
$$\text{CSS} = \text{RSS}_{\text{constrained}} - \text{RSS}_{\text{unconstrained}}$$

Problem Setup

We want to derive why:

$$\text{CSS} = \text{RSS}_{\text{constrained}} - \text{RSS}_{\text{unconstrained}} = (R\hat{\beta} - r)^\top [R(X^\top X)^{-1}R^\top]^{-1} (R\hat{\beta} - r)$$

The Two Models:

Unconstrained Model:

  • Minimize: $\|y - X\beta\|^2$
  • Solution: $\hat{\beta} = (X^\top X)^{-1} X^\top y$
  • RSS: $\text{RSS}_{\text{unconstrained}} = \|y - X\hat{\beta}\|^2$

Constrained Model:

  • Minimize: $\|y - X\beta\|^2$ subject to $R\beta = r$
  • Solution: $\tilde{\beta}$ (to be derived)
  • RSS: $\text{RSS}_{\text{constrained}} = \|y - X\tilde{\beta}\|^2$

Solving the Constrained Optimization Problem

Method: Lagrange Multipliers

Objective: Minimize $\frac{1}{2}\|y - X\beta\|^2$ subject to $R\beta = r$

Lagrangian:

$$L(\beta, \lambda) = \frac{1}{2}(y - X\beta)^\top(y - X\beta) + \lambda^\top(R\beta - r)$$

First-Order Conditions:

∂L/∂β = 0:

Let’s carefully expand the Lagrangian and take the gradient with respect to β.

The Lagrangian is:

$$L(\beta, \lambda) = \frac{1}{2}(y - X\beta)^\top(y - X\beta) + \lambda^\top(R\beta - r)$$

Step 1: Expand the first term

$$(y - X\beta)^\top(y - X\beta) = y^\top y - y^\top X\beta - \beta^\top X^\top y + \beta^\top X^\top X\beta$$

Since $y^\top X\beta$ is a scalar, $y^\top X\beta = (y^\top X\beta)^\top = \beta^\top X^\top y$:

$$(y - X\beta)^\top(y - X\beta) = y^\top y - 2\beta^\top X^\top y + \beta^\top X^\top X\beta$$

Step 2: Write the complete Lagrangian

$$L(\beta, \lambda) = \frac{1}{2}(y^\top y - 2\beta^\top X^\top y + \beta^\top X^\top X\beta) + \lambda^\top R\beta - \lambda^\top r$$

Step 3: Take the partial derivative with respect to β

We need $\frac{\partial L}{\partial \beta}$. Taking derivatives term by term:

  • $\frac{\partial}{\partial \beta}[y^\top y] = 0$ (constant with respect to β)
  • $\frac{\partial}{\partial \beta}[-2\beta^\top X^\top y] = -2X^\top y$
  • $\frac{\partial}{\partial \beta}[\beta^\top X^\top X\beta] = 2X^\top X\beta$ (quadratic form derivative)

Detailed proof of the quadratic form derivative:

General Theorem: For any symmetric matrix $A$ and vector $x$, we have $\frac{\partial}{\partial x}[x^\top A x] = 2Ax$.

Complete Step-by-Step Proof:

Step 1: Expand the Quadratic Form using Summation Notation

Let $x$ be a vector with components $x_1, \ldots, x_n$ and $A$ be an $n \times n$ symmetric matrix with elements $A_{ij}$.

The quadratic form can be written as:

$$x^\top Ax = \sum_{j=1}^n \left(\sum_{i=1}^n x_i A_{ij}\right) x_j = \sum_{i=1}^n \sum_{j=1}^n x_i A_{ij} x_j$$

Step 2: Differentiate with Respect to a Single Component $x_k$

To find the gradient vector, we first find the partial derivative with respect to any component $x_k$ for $k \in \{1, \ldots, n\}$.

We need to identify all terms in the double summation that contain $x_k$. There are three cases:

  1. When $i = k$ but $j \neq k$: Terms of the form $x_k A_{kj} x_j$

    • Derivative: $\frac{\partial}{\partial x_k}[x_k A_{kj} x_j] = A_{kj} x_j$
  2. When $j = k$ but $i \neq k$: Terms of the form $x_i A_{ik} x_k$

    • Derivative: $\frac{\partial}{\partial x_k}[x_i A_{ik} x_k] = x_i A_{ik}$
  3. When both $i = k$ and $j = k$: The term $x_k A_{kk} x_k = A_{kk} x_k^2$

    • Derivative: $\frac{\partial}{\partial x_k}[A_{kk} x_k^2] = 2A_{kk} x_k$

Therefore:

$$\frac{\partial}{\partial x_k}[x^\top Ax] = \sum_{j \neq k} A_{kj} x_j + \sum_{i \neq k} x_i A_{ik} + 2A_{kk} x_k$$

Step 3: Utilize the Symmetry of Matrix $A$

Since $A$ is symmetric, $A_{ij} = A_{ji}$ for all $i$ and $j$. Therefore, $A_{ik} = A_{ki}$.

$$\frac{\partial}{\partial x_k}[x^\top Ax] = \sum_{j \neq k} A_{kj} x_j + \sum_{i \neq k} x_i A_{ki} + 2A_{kk} x_k$$

Step 4: Combine Summations

We can split the term $2A_{kk} x_k$ into two identical terms: $A_{kk} x_k + A_{kk} x_k$.

Now we can “absorb” one $A_{kk} x_k$ term into each summation to complete the range from 1 to $n$:

  • $\sum_{j \neq k} A_{kj} x_j + A_{kk} x_k = \sum_{j=1}^n A_{kj} x_j$
  • $\sum_{i \neq k} x_i A_{ki} + x_k A_{kk} = \sum_{i=1}^n x_i A_{ki}$

Therefore:

$$\frac{\partial}{\partial x_k}[x^\top Ax] = \sum_{j=1}^n A_{kj} x_j + \sum_{i=1}^n x_i A_{ki}$$

Step 5: Express in Matrix Notation

Recognizing the summations as components of matrix-vector products:

  • The first summation, $\sum_{j=1}^n A_{kj} x_j$, is the $k$-th component of the vector $Ax$

  • The second summation, $\sum_{i=1}^n x_i A_{ki}$, is the $k$-th component of the vector $A^\top x$. But since $A$ is symmetric, $A^\top = A$, so this is also the $k$-th component of $Ax$

Therefore:

$$\frac{\partial}{\partial x_k}[x^\top Ax] = (Ax)_k + (A^\top x)_k = (Ax)_k + (Ax)_k = 2(Ax)_k$$

Step 6: Complete the Gradient Vector

Since this holds for any $k \in \{1, \ldots, n\}$, the full gradient vector is:

$$\frac{\partial}{\partial x}[x^\top Ax] = 2Ax$$

Application to Our Problem:

Since $X^\top X$ is always symmetric ($(X^\top X)^\top = X^\top X$), we can directly apply this result:

$$\frac{\partial}{\partial \beta}[\beta^\top X^\top X\beta] = 2X^\top X\beta$$

Continuing with our Lagrangian derivatives:

  • $\frac{\partial}{\partial \beta}[\lambda^\top R\beta] = R^\top \lambda$
  • $\frac{\partial}{\partial \beta}[\lambda^\top r] = 0$ (constant)

Step 4: Combine all terms

$$\frac{\partial L}{\partial \beta} = \frac{1}{2}(0 - 2X^\top y + 2X^\top X\beta) + R^\top \lambda$$$$= \frac{1}{2}(-2X^\top y + 2X^\top X\beta) + R^\top \lambda$$$$= -X^\top y + X^\top X\beta + R^\top \lambda$$

Step 5: Set equal to zero (first-order condition)

$$-X^\top y + X^\top X\beta + R^\top \lambda = 0$$

Step 6: Rearrange to standard form

$$X^\top X\beta + R^\top\lambda = X^\top y \quad \text{...(1)}$$

∂L/∂λ = 0:

$$R\beta - r = 0 \quad \text{...(2)}$$

Solving the System of Equations:

From equation (1): $\beta = (X^\top X)^{-1}(X^\top y - R^\top\lambda)$

Substituting into equation (2):

$$R(X^\top X)^{-1}(X^\top y - R^\top\lambda) - r = 0$$$$R(X^\top X)^{-1}X^\top y - R(X^\top X)^{-1}R^\top\lambda - r = 0$$

Since $R(X^\top X)^{-1}X^\top y = R\hat{\beta}$:

$$R\hat{\beta} - R(X^\top X)^{-1}R^\top\lambda - r = 0$$

Solving for λ:

$$\lambda = [R(X^\top X)^{-1}R^\top]^{-1}(R\hat{\beta} - r)$$

Finding the Constrained Solution

Substituting λ back into the expression for β:

$$\tilde{\beta} = (X^\top X)^{-1}X^\top y - (X^\top X)^{-1}R^\top[R(X^\top X)^{-1}R^\top]^{-1}(R\hat{\beta} - r)$$$$\tilde{\beta} = \hat{\beta} - (X^\top X)^{-1}R^\top[R(X^\top X)^{-1}R^\top]^{-1}(R\hat{\beta} - r)$$

Computing RSS_constrained

Step 1: Express the constrained residuals

$$\text{RSS}_{\text{constrained}} = \|y - X\tilde{\beta}\|^2$$

Substituting our expression for $\tilde{\beta}$:

$$y - X\tilde{\beta} = y - X\hat{\beta} + X(X^\top X)^{-1}R^\top[R(X^\top X)^{-1}R^\top]^{-1}(R\hat{\beta} - r)$$

Let $e = y - X\hat{\beta}$ (unconstrained residuals) and $A = X(X^\top X)^{-1}R^\top$:

$$y - X\tilde{\beta} = e + A[R(X^\top X)^{-1}R^\top]^{-1}(R\hat{\beta} - r)$$

Step 2: Expand the squared norm

$$\text{RSS}_{\text{constrained}} = \|e + A[R(X^\top X)^{-1}R^\top]^{-1}(R\hat{\beta} - r)\|^2$$

Let $\delta = [R(X^\top X)^{-1}R^\top]^{-1}(R\hat{\beta} - r)$ for simplicity:

$$\text{RSS}_{\text{constrained}} = \|e + A\delta\|^2 = (e + A\delta)^\top(e + A\delta)$$$$= e^\top e + 2e^\top A\delta + \delta^\top A^\top A\delta$$$$= \text{RSS}_{\text{unconstrained}} + 2e^\top A\delta + \delta^\top A^\top A\delta$$

Key Lemma: $e^⊤A = 0$

Proof that the cross term vanishes:

We need to show that $e^\top A = 0$ where:

  • $e = y - X\hat{\beta}$ (unconstrained residuals)
  • $A = X(X^\top X)^{-1}R^\top$
$$e^\top A = (y - X\hat{\beta})^\top X(X^\top X)^{-1}R^\top$$$$= (y^\top - \hat{\beta}^\top X^\top) X(X^\top X)^{-1}R^\top$$$$= y^\top X(X^\top X)^{-1}R^\top - \hat{\beta}^\top X^\top X(X^\top X)^{-1}R^\top$$$$= y^\top X(X^\top X)^{-1}R^\top - \hat{\beta}^\top R^\top$$

Since $\hat{\beta} = (X^\top X)^{-1}X^\top y$:

$$\hat{\beta}^\top R^\top = y^\top X(X^\top X)^{-1}R^\top$$

Therefore: $e^\top A = y^\top X(X^\top X)^{-1}R^\top - y^\top X(X^\top X)^{-1}R^\top = 0$

Simplifying RSS_constrained

Since $e^\top A = 0$:

$$\text{RSS}_{\text{constrained}} = \text{RSS}_{\text{unconstrained}} + \delta^\top A^\top A\delta$$

Computing $A^⊤A$:

$$A^\top A = (X(X^\top X)^{-1}R^\top)^\top X(X^\top X)^{-1}R^\top$$$$= R(X^\top X)^{-1}X^\top X(X^\top X)^{-1}R^\top$$$$= R(X^\top X)^{-1}R^\top$$

Therefore:

$$\text{RSS}_{\text{constrained}} = \text{RSS}_{\text{unconstrained}} + \delta^\top [R(X^\top X)^{-1}R^\top] \delta$$

Final Result: The CSS Formula

$$\text{CSS} = \text{RSS}_{\text{constrained}} - \text{RSS}_{\text{unconstrained}}$$$$= \delta^\top [R(X^\top X)^{-1}R^\top] \delta$$

Substituting back $\delta = [R(X^\top X)^{-1}R^\top]^{-1}(R\hat{\beta} - r)$:

$$\text{CSS} = (R\hat{\beta} - r)^\top [R(X^\top X)^{-1}R^\top]^{-1} [R(X^\top X)^{-1}R^\top] [R(X^\top X)^{-1}R^\top]^{-1}(R\hat{\beta} - r)$$

The middle terms simplify:

$$[R(X^\top X)^{-1}R^\top]^{-1} [R(X^\top X)^{-1}R^\top] = I$$

Therefore:

$$\boxed{\text{CSS} = (R\hat{\beta} - r)^\top [R(X^\top X)^{-1}R^\top]^{-1} (R\hat{\beta} - r)}$$

Verification for Our Specific Case

For $H_0: c^\top \beta = 0$:

  • $R = c^\top$ (1 × p)
  • $r = 0$
  • $R\hat{\beta} - r = c^\top\hat{\beta} = \hat{L}$
  • $R(X^\top X)^{-1}R^\top = c^\top(X^\top X)^{-1}c$ (scalar)
$$\text{CSS} = \hat{L}^\top [c^\top(X^\top X)^{-1}c]^{-1} \hat{L} = \frac{(\hat{L})^2}{c^\top(X^\top X)^{-1}c}$$

Completing the F-Statistic Derivation

Step 3: Form the F-statistic

$$F = \frac{\text{CSS}/1}{s^2} = \frac{\text{CSS}}{s^2}$$

Substituting our expression for CSS:

$$F = \frac{(\hat{L})^2/(c^\top (X^\top X)^{-1} c)}{s^2}$$

Step 4: Simplify to standard form

$$\boxed{F = \frac{(\hat{L})^2}{s^2 c^\top (X^\top X)^{-1} c}}$$

with degrees of freedom (1, n-p).

The Main Result: t² = F

Algebraic Identity

Squaring the t-statistic:

$$\begin{align} t^2 &= \left(\frac{\hat{L}}{\sqrt{s^2 c^\top (X^\top X)^{-1} c}}\right)^2 \\ &= \frac{\hat{L}^2}{s^2 c^\top (X^\top X)^{-1} c} \\ &= F \end{align}$$

Distributional Relationship

Under H₀: c^⊤β = 0:

  • t ~ t_{n-p} (t-distribution with n-p degrees of freedom)
  • F ~ F_{1,n-p} (F-distribution with 1 and n-p degrees of freedom)
  • t² ~ F_{1,n-p} (the square of a t-distributed random variable)

Relationship Between the Forms

Comparing the denominators

t-statistic denominator:

$$\sqrt{s^2 c^\top (X^\top X)^{-1} c}$$

F-statistic denominator:

$$s^2 c^\top (X^\top X)^{-1} c$$

Notice that: $(\text{t-denominator})^2 = \text{F-denominator}$

Comparing the numerators

t-statistic numerator: $\hat{L}$

F-statistic numerator: $\hat{L}^2$

Notice that: $(\text{t-numerator})^2 = \text{F-numerator}$

Interpretation and Practical Implications

Statistical Interpretation

  1. σ² represents the population error variance (unknown, estimated by s² from RSS)
  2. t measures the signed standardized effect
  3. F measures the squared standardized effect (always non-negative)
  4. When the constraint involves one degree of freedom: t² = F

Practical Applications

This relationship is fundamental in:

  • Regression coefficient testing: Testing whether β_j = 0
  • Linear contrast testing: Testing linear combinations of parameters
  • Model comparison: Understanding equivalence between different test formats

Geometric Interpretation

The CSS formula represents:

  1. $(R\hat{\beta} - r)$: How far the unconstrained estimate violates the constraint
  2. $[R(X^\top X)^{-1}R^\top]^{-1}$: The “precision matrix” for the constraint violation
  3. The quadratic form: A weighted measure of constraint violation

This derivation shows that the CSS is not arbitrary—it’s the natural consequence of comparing constrained and unconstrained least squares solutions through Lagrange multiplier optimization.

Why This Form is Universal and Natural

Multiple Theoretical Foundations

The quadratic form CSS = $(R\hat{\beta} - r)^\top [R(X^\top X)^{-1}R^\top]^{-1} (R\hat{\beta} - r)$ emerges from multiple theoretical frameworks:

  1. Lagrange Multiplier Optimization: Direct consequence of constrained least squares
  2. Likelihood Ratio Testing: Comparing constrained vs unconstrained model likelihoods
  3. Wald Test Framework: Measuring parameter distance from null hypothesis
  4. Geometric Projection: Distance between parameter estimates in appropriate metric

The Intuitive Interpretation

The CSS can be understood as a “Standardized Squared Distance”:

$\text{CSS} = \frac{(\text{Distance from constraint})^2}{\text{Variability of that distance}}$

More specifically:

  • Numerator: $(R\hat{\beta} - r)^2$ measures how far our estimate deviates from the constraint
  • Denominator: $R(X^\top X)^{-1}R^\top$ accounts for the estimation precision
  • Result: A scale-free measure that follows proper statistical distributions

Key Properties

  1. Distributional: $\text{CSS}/\sigma^2 \sim \chi^2_q$ where q is the number of constraints
  2. Scale-invariant: Unchanged under linear transformations of the data
  3. Additive: For independent constraints, CSS values add appropriately
  4. Generalizable: Works for single constraints (q=1) and multiple constraints (q>1)

Summary of Key Results

Final Test Statistic Forms

t-statistic for H₀: $c^⊤β = 0$: $t = \frac{\hat{L}}{\sqrt{s^2 c^\top (X^\top X)^{-1} c}} \sim t_{n-p}$

F-statistic for the same hypothesis: $F = \frac{\hat{L}^2}{s^2 c^\top (X^\top X)^{-1} c} \sim F_{1,n-p}$

The fundamental relationship: $t^2 = F$

Critical Insights

  1. Unified Framework: Both statistics test the same hypothesis using different perspectives
  2. Sign Information: t-statistics preserve directional information; F-statistics measure magnitude only
  3. Degrees of Freedom: The relationship t²_{n-p} ~ F_{1,n-p} is a fundamental distributional identity
  4. p value Equivalence: Software can report either form with identical p-values

Extended Applications

Multiple Contrast Testing

For testing multiple constraints simultaneously $H_0: R\beta = r$ where R is q×p:

$F = \frac{(R\hat{\beta} - r)^\top [R(X^\top X)^{-1}R^\top]^{-1} (R\hat{\beta} - r)/q}{s^2} \sim F_{q,n-p}$

This generalizes our single constraint result (q=1) to arbitrary linear hypothesis testing.

Model Comparison Framework

The t²=F identity extends to model comparison:

  • Nested models: F-tests compare full vs restricted models
  • Sequential testing: Adding parameters one at a time via t-tests
  • Overall model significance: Testing all parameters simultaneously

Practical Implementation Notes

Computational Considerations

  1. Numerical stability: Computing $(X^\top X)^{-1}$ directly can be unstable
  2. QR decomposition: More stable approach for computing test statistics
  3. Software implementation: Most packages use numerically stable algorithms automatically

Interpretation Guidelines

  1. Single parameters: Use t-tests for directional hypotheses
  2. Parameter magnitude: Use F-tests when direction doesn’t matter
  3. Multiple parameters: F-tests are the natural choice
  4. Model building: Both approaches provide identical inference

Conclusion

The identity t² = F when testing single contrasts in linear regression represents far more than a mathematical curiosity. It reveals the fundamental unity underlying two seemingly different approaches to hypothesis testing.

The Bigger Picture

This proof illustrates how careful mathematical analysis reveals elegant connections in statistical theory. The step-by-step derivations—from variance properties through Lagrange multipliers to distributional relationships—show how fundamental principles combine to produce powerful and practical statistical tools.

The t² = F identity stands as a beautiful example of how mathematical rigor enhances both theoretical understanding and practical statistical application. Every applied statistician benefits from understanding these deep connections that unify our statistical toolkit.


This comprehensive derivation provides the complete mathematical foundation for understanding one of the most important relationships in linear regression analysis. From first principles through advanced applications, every step demonstrates how careful mathematical reasoning reveals the elegant unity underlying statistical inference.