Using Python (and R) to calculate Rank Correlations

You might also be interested in my pages on doing Linear Regressions with Python and/or R.

This page covers:

Ranking data
Rank based Correlations
Spearman's Rho (ρ)
Kendall's Tau (τ)

Ranking data

Rank Correlations are performed on ranks instead of the raw data itself. This can be very advantageous when dealing with data with outliers.

For example, given two sets of data, say x = [5.05, 6.75, 3.21, 2.66] and y = [1.65, 26.5, -5.93, 7.96], with some ordering (here numerical) we can give them the ranks [3, 4, 2, 1] and [2, 4, 1, 3] respectively.

Tied ranks are usually assigned using the midrank method, whereby those entries receive the mean of the ranks they would have received had they not been tied. Thus z = [1.65, 2.64, 2.64, 6.95] would yield ranks [1, 2.5, 2.5, 4] using the midrank method.

We can do this in Python using Gary Strangman's rankdata function from the stats library in SciPy:-

>>> import scipy >>> x = [5.05, 6.75, 3.21, 2.66] >>> print scipy.stats.stats.rankdata(x) [ 3. 4. 2. 1.] >>> y = [1.65, 26.5, -5.93, 7.96] >>> print scipy.stats.stats.rankdata(y) [ 2. 4. 1. 3.] >>> z = [1.65, 2.64, 2.64, 6.95] >>> print scipy.stats.stats.rankdata(z) [ 1. 2.5 2.5 4.]

This functionality is built into the R language:-

> x <- c(5.05, 6.75, 3.21, 2.66) > rank(x) [1] 3 4 2 1 > y <- c(1.65, 26.5, -5.93, 7.96) > rank(y) [1] 2 4 1 3 > z <- c(1.65, 2.64, 2.64, 6.95) > rank(z) [1] 1.0 2.5 2.5 4.0

Rank based Correlations

The two main correlations used for comparing such ranked data are known as the Spearman Rank Correlation (Spearman's ρ or Spearman's Rho) and Kendall's Tau (τ).

Both have several variants (e.g. r_s, r_sa and r_sb for Spearman's ρ) which deal with the situation of tied data in different ways.

Using Spearman's ρ as an example, there are no ties in x and y, thus r_s(x,y) and r_sb(x,y) are both 0.40 (2dp). However, z does have ties so r_s(x,z) = -0.55 (2dp) (no tie correction) and does not equal r_sb(x,z) = -0.63 (2dp) (with a tie correction).

The notation I am using is from the 5th edition (published 1990) of "Rank Correlation Methods", by Maurice Kendall and Jean Dickinson Gibbons (ISBN 0-85264-305-5, first published in 1948).

To date, I have found two existing Python libraries with support for these correlations (Spearman and Kendall):

Gary Strangman's stats.py (last updated in 2003, includes Linear Regression). Travis Oliphant incorporated an earlier version of this into SciPy - Scientific tools for Python in 2002.
Michiel de Hoon's PyCluster module (which is also included as Bio.Cluster in BioPython).

I have also used the R language (for statistical computing and graphics) from within Python using the package RPy (R from Python) to calculate these rank correlations.

Spearman's Rho

$r_s=1-\frac{6\small\sum{d^2}}{n^3-n}$

[Insert formula for r_s, r_sa and r_sb here]

Gary Strangman's library in SciPy gives r_s which has NO TIE CORRECTION included (plus it also calculates the two-tailed p-value):-

>>> import scipy >>> x = [5.05, 6.75, 3.21, 2.66] >>> y = [1.65, 26.5, -5.93, 7.96] >>> z = [1.65, 2.64, 2.64, 6.95] >>> print scipy.stats.stats.spearmanr(x, y)[0] 0.4 >>> print scipy.stats.stats.spearmanr(x, z)[0] -0.55

On the other hand, Michiel de Hoon's library (available in BioPython or standalone as PyCluster) returns Spearman r_sb which does include a tie correction:-

>>> import Bio.Cluster >>> x = [5.05, 6.75, 3.21, 2.66] >>> y = [1.65, 26.5, -5.93, 7.96] >>> z = [1.65, 2.64, 2.64, 6.95] >>> print 1 - Bio.Cluster.distancematrix((x,y), dist="s")[1][0] 0.4 >>> print 1 - Bio.Cluster.distancematrix((x,z), dist="s")[1][0] -0.632455532034

The distancematrix function takes a "matrix" and returns the distances between each row (in this case, x and y). This information could be stored as a symmetric matrix (with zeroes on the diagonal), but for efficiency it isn't - see help(Bio.Cluster.distancematrix) for more information.

We can also access R's Spearman correlation from within Python, again this uses the Spearman r_sb which does include a tie correction:-

>>> import rpy >>> x = [5.05, 6.75, 3.21, 2.66] >>> y = [1.65, 26.5, -5.93, 7.96] >>> z = [1.65, 2.64, 2.64, 6.95] >>> print rpy.r.cor(x, y, method="spearman") 0.4 >>> print rpy.r.cor(x, z, method="spearman") -0.632455532034

This could be done in R as follows:

> x <- c(5.05, 6.75, 3.21, 2.66) > y <- c(1.65, 26.5, -5.93, 7.96) > z <- c(1.65, 2.64, 2.64, 6.95) > cor(x, y, method="spearman") [1] 0.4 > cor(x, z, method="spearman") [1] -0.6324555

Kendall's Tau

[Insert formula for t_a and t_b here]

Gary Strangman's library in SciPy gives Kendall's t_b which has the standard tie correction included (and it calculates the two-tailed p-value):-

>>> import scipy >>> x = [5.05, 6.75, 3.21, 2.66] >>> y = [1.65, 26.5, -5.93, 7.96] >>> z = [1.65, 2.64, 2.64, 6.95] >>> print scipy.stats.stats.kendalltau(x, y)[0] 0.333333333333 >>> print scipy.stats.stats.kendalltau(x, z)[0] -0.547722557505

Michiel de Hoon's library in BioPython is faster according to my tests (using large lists with multiple ties), and also gives Kendall's t_b (standard tie correction included):-

>>> import Bio.Cluster >>> x = [5.05, 6.75, 3.21, 2.66] >>> y = [1.65, 26.5, -5.93, 7.96] >>> z = [1.65, 2.64, 2.64, 6.95] >>> print 1 - Bio.Cluster.distancematrix((x,y), dist="k")[1][0] 0.333333333333 >>> print 1 - Bio.Cluster.distancematrix((x,z), dist="k")[1][0] -0.547722557505

We can also access R's Kendall correlation from within Python, again this returns Kendall's t_b (standard tie correction included):-

>>> import rpy >>> x = [5.05, 6.75, 3.21, 2.66] >>> y = [1.65, 26.5, -5.93, 7.96] >>> z = [1.65, 2.64, 2.64, 6.95] >>> print rpy.r.cor(x, y, method="kendall") 0.333333333333 >>> print rpy.r.cor(x, z, method="kendall") -0.547722557505

The version in R would be simply:

> x <- c(5.05, 6.75, 3.21, 2.66) > y <- c(1.65, 26.5, -5.93, 7.96) > z <- c(1.65, 2.64, 2.64, 6.95) > cor(x, y, method="kendall") [1] 0.3333333 > cor(x, z, method="kendall") [1] -0.5477226

Python

The R Project

SciPy
For Gary Strangman's stats.py

RPy (R from Python)

Biopython
For Michiel de Hoon's PyCluster