Deriving Kendall's Tau from The Sample-Based Perspective (Discrete)
Kendall’s Tau ($\tau$) is a robust non-parametric measure of correlation that captures the strength of monotonic relationships between two variables. Unlike Pearson’s correlation, Kendall’s Tau is based on the relative ordering of data points rather than their exact values, making it particularly useful for ordinal data and resistant to outliers.
In this post, we’ll derive the complete formula for Kendall’s Tau step by step, starting from basic definitions and arriving at the elegant mathematical expression using the sign function.
Step 1: Foundation - Concordant and Discordant Pairs
For a dataset with $n$ paired observations $(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)$, we need to examine all possible pairs of observations.
Definitions
Consider any two distinct observations $(x_i, y_i)$ and $(x_j, y_j)$ where $i < j$. This pair can be classified as:
Concordant Pair: A pair where both variables change in the same direction
$$ (x_i - x_j)(y_i - y_j) > 0 $$This occurs when:
- Both $x_i > x_j$ and $y_i > y_j$ (both increase together), or
- Both $x_i < x_j$ and $y_i < y_j$ (both decrease together)
Discordant Pair: A pair where the variables change in opposite directions
$$ (x_i - x_j)(y_i - y_j) < 0 $$This occurs when:
- $x_i > x_j$ but $y_i < y_j$ (one increases, other decreases), or
- $x_i < x_j$ but $y_i > y_j$ (one decreases, other increases)
Counting the Pairs
Let:
- $N_C$ = number of concordant pairs
- $N_D$ = number of discordant pairs
The total number of possible pairs from $n$ observations is:
$$ \text{Total pairs} = \binom{n}{2} = \frac{n(n-1)}{2} $$Initial Definition of Kendall’s Tau
Kendall’s Tau measures the excess of concordant over discordant pairs, normalized by the total number of pairs:
$$ \tau = \frac{N_C - N_D}{\text{total pairs}} = \frac{N_C - N_D}{\binom{n}{2}} = \frac{2(N_C - N_D)}{n(n-1)} $$Step 2: The Sign Function Representation
We can express the classification of pairs more elegantly using the sign function. The sign function is defined as:
$$ \text{sgn}(x) = \begin{cases} +1 & \text{if } x > 0 \\ -1 & \text{if } x < 0 \\ 0 & \text{if } x = 0 \end{cases} $$Key Insight
For any pair $(i,j)$ with $i < j$, the product of signs captures the pair’s nature:
$$ \text{sgn}(x_i - x_j) \cdot \text{sgn}(y_i - y_j) = \begin{cases} +1 & \text{if concordant} \\ -1 & \text{if discordant} \\ 0 & \text{if tie in } x \text{ or } y \end{cases} $$Why this works:
- Concordant: $(x_i - x_j)$ and $(y_i - y_j)$ have the same sign → $\text{sgn}(x_i - x_j) \cdot \text{sgn}(y_i - y_j) = (+1) \cdot (+1) = +1$ or $(-1) \cdot (-1) = +1$
- Discordant: $(x_i - x_j)$ and $(y_i - y_j)$ have opposite signs → $\text{sgn}(x_i - x_j) \cdot \text{sgn}(y_i - y_j) = (+1) \cdot (-1) = -1$ or $(-1) \cdot (+1) = -1$
- Tied: At least one difference is zero → product equals zero
Summation Formula
By summing over all pairs, we get:
$$ \sum_{i- $+1$ for each concordant pair
- $-1$ for each discordant pair
- $0$ for each tied pair
Step 3: The Final Formula
Combining our results from Steps 1 and 2:
$$ \boxed{\tau = \frac{2}{n(n-1)} \sum_{iNormalization Factor: $\frac{2}{n(n-1)}$
- Scales the result to the range $[-1, +1]$
- Accounts for the total number of possible pairs
Sum: $\sum_{i The value of Kendall’s Tau ranges from $-1$ to $+1$: The derivation of Kendall’s Tau demonstrates how a simple intuitive concept—comparing the relative ordering of pairs—leads to a mathematically elegant and computationally efficient formula. The sign function representation not only provides computational advantages but also offers deep insights into the nature of rank-based correlation measures. This formula serves as the foundation for many applications in non-parametric statistics, from hypothesis testing to robust correlation analysis in the presence of outliers or non-linear relationships.Interpretation
Advantages of This Formulation
Conclusion