Using Python (and R) to calculate Rank Correlations
You might also be interested in my pages on doing Linear Regressions with Python and/or R.
This page covers:
Ranking data
Rank Correlations are performed on ranks instead of the raw data itself. This can be very advantageous when dealing with data with outliers.
For example, given two sets of data, say x = [5.05, 6.75, 3.21, 2.66] and y = [1.65, 26.5, -5.93, 7.96], with some ordering (here numerical) we can give them the ranks [3, 4, 2, 1] and [2, 4, 1, 3] respectively.
Tied ranks are usually assigned using the midrank method, whereby those entries receive the mean of the ranks they would have received had they not been tied. Thus z = [1.65, 2.64, 2.64, 6.95] would yield ranks [1, 2.5, 2.5, 4] using the midrank method.
We can do this in Python using Gary Strangman's rankdata function from the stats library in SciPy:-
>>> import scipy
>>> x = [5.05, 6.75, 3.21, 2.66]
>>> print scipy.stats.stats.rankdata(x)
[ 3. 4. 2. 1.]
>>> y = [1.65, 26.5, -5.93, 7.96]
>>> print scipy.stats.stats.rankdata(y)
[ 2. 4. 1. 3.]
>>> z = [1.65, 2.64, 2.64, 6.95]
>>> print scipy.stats.stats.rankdata(z)
[ 1. 2.5 2.5 4.]
This functionality is built into the R language:-
> x <- c(5.05, 6.75, 3.21, 2.66)
> rank(x)
[1] 3 4 2 1
> y <- c(1.65, 26.5, -5.93, 7.96)
> rank(y)
[1] 2 4 1 3
> z <- c(1.65, 2.64, 2.64, 6.95)
> rank(z)
[1] 1.0 2.5 2.5 4.0
Rank based Correlations
The two main correlations used for comparing such ranked data are known as the Spearman Rank Correlation (Spearman's ρ or Spearman's Rho) and Kendall's Tau (τ).
Both have several variants (e.g. rs, rsa and rsb for Spearman's ρ) which deal with the situation of tied data in different ways.
Using Spearman's ρ as an example, there are no ties in x and y, thus rs(x,y) and rsb(x,y) are both 0.40 (2dp). However, z does have ties so rs(x,z) = -0.55 (2dp) (no tie correction) and does not equal rsb(x,z) = -0.63 (2dp) (with a tie correction).
The notation I am using is from the 5th edition (published 1990) of "Rank Correlation Methods", by Maurice Kendall and Jean Dickinson Gibbons (ISBN 0-85264-305-5, first published in 1948).
To date, I have found two existing Python libraries with support for these correlations (Spearman and Kendall):
- Gary Strangman's stats.py (last updated in 2003, includes Linear Regression). Travis Oliphant incorporated an earlier version of this into SciPy - Scientific tools for Python in 2002.
- Michiel de Hoon's PyCluster module (which is also included as Bio.Cluster in BioPython).
I have also used the R language (for statistical computing and graphics) from within Python using the package RPy (R from Python) to calculate these rank correlations.
Spearman's Rho
[Insert formula for rs, rsa and rsb here]
Gary Strangman's library in SciPy gives rs which has NO TIE CORRECTION included (plus it also calculates the two-tailed p-value):-
>>> import scipy
>>> x = [5.05, 6.75, 3.21, 2.66]
>>> y = [1.65, 26.5, -5.93, 7.96]
>>> z = [1.65, 2.64, 2.64, 6.95]
>>> print scipy.stats.stats.spearmanr(x, y)[0]
0.4
>>> print scipy.stats.stats.spearmanr(x, z)[0]
-0.55
On the other hand, Michiel de Hoon's library (available in BioPython or standalone as PyCluster) returns Spearman rsb which does include a tie correction:-
>>> import Bio.Cluster
>>> x = [5.05, 6.75, 3.21, 2.66]
>>> y = [1.65, 26.5, -5.93, 7.96]
>>> z = [1.65, 2.64, 2.64, 6.95]
>>> print 1 - Bio.Cluster.distancematrix((x,y), dist="s")[1][0]
0.4
>>> print 1 - Bio.Cluster.distancematrix((x,z), dist="s")[1][0]
-0.632455532034
The distancematrix function takes a "matrix" and returns the distances between each row (in this case, x and y). This information could be stored as a symmetric matrix (with zeroes on the diagonal), but for efficiency it isn't - see help(Bio.Cluster.distancematrix) for more information.
We can also access R's Spearman correlation from within Python, again this uses the Spearman rsb which does include a tie correction:-
>>> import rpy
>>> x = [5.05, 6.75, 3.21, 2.66]
>>> y = [1.65, 26.5, -5.93, 7.96]
>>> z = [1.65, 2.64, 2.64, 6.95]
>>> print rpy.r.cor(x, y, method="spearman")
0.4
>>> print rpy.r.cor(x, z, method="spearman")
-0.632455532034
This could be done in R as follows:
> x <- c(5.05, 6.75, 3.21, 2.66)
> y <- c(1.65, 26.5, -5.93, 7.96)
> z <- c(1.65, 2.64, 2.64, 6.95)
> cor(x, y, method="spearman")
[1] 0.4
> cor(x, z, method="spearman")
[1] -0.6324555
Kendall's Tau
[Insert formula for ta and tb here]
Gary Strangman's library in SciPy gives Kendall's tb which has the standard tie correction included (and it calculates the two-tailed p-value):-
>>> import scipy
>>> x = [5.05, 6.75, 3.21, 2.66]
>>> y = [1.65, 26.5, -5.93, 7.96]
>>> z = [1.65, 2.64, 2.64, 6.95]
>>> print scipy.stats.stats.kendalltau(x, y)[0]
0.333333333333
>>> print scipy.stats.stats.kendalltau(x, z)[0]
-0.547722557505
Michiel de Hoon's library in BioPython is faster according to my tests (using large lists with multiple ties), and also gives Kendall's tb (standard tie correction included):-
>>> import Bio.Cluster
>>> x = [5.05, 6.75, 3.21, 2.66]
>>> y = [1.65, 26.5, -5.93, 7.96]
>>> z = [1.65, 2.64, 2.64, 6.95]
>>> print 1 - Bio.Cluster.distancematrix((x,y), dist="k")[1][0]
0.333333333333
>>> print 1 - Bio.Cluster.distancematrix((x,z), dist="k")[1][0]
-0.547722557505
We can also access R's Kendall correlation from within Python, again this returns Kendall's tb (standard tie correction included):-
>>> import rpy
>>> x = [5.05, 6.75, 3.21, 2.66]
>>> y = [1.65, 26.5, -5.93, 7.96]
>>> z = [1.65, 2.64, 2.64, 6.95]
>>> print rpy.r.cor(x, y, method="kendall")
0.333333333333
>>> print rpy.r.cor(x, z, method="kendall")
-0.547722557505
The version in R would be simply:
> x <- c(5.05, 6.75, 3.21, 2.66)
> y <- c(1.65, 26.5, -5.93, 7.96)
> z <- c(1.65, 2.64, 2.64, 6.95)
> cor(x, y, method="kendall")
[1] 0.3333333
> cor(x, z, method="kendall")
[1] -0.5477226