Ensemble clustering for genetic variation in brain structure
Hierarchical clustering is a work-horse tool in many areas of data-driven science, including neuroscience and molecular biology. However, hierarchical clustering methods are generally algorithmic and lack any principled treatment of uncertainty, usually providing little or no notion of the robustness or reproducibility of obtained clusters under stochastic variation. In recent years, so-called “ensemble” methods have grown in popularity, these use resampling approaches to generate multiple instances of clustering results, which are aggregated to find reliable clusters.
This project will focus on ensemble hierarchical clustering for the study of co-heritability measures of brain structure. With twin or family studies, one can estimate the heritability of a single phenotype, the proportion of variability in the phenotype explained by genetic variation. Likewise, the correlation between pair of phenotypes can decomposed, and the genetic contribution (rho_g) to the correlation estimated. Our phenotype is gray matter thickness in 50 different brain regions, and we obtain both point estimates and standard errors of rho_g for each pair of regions. We seek to use rho_g as the (genetic) distance in hierarchical clustering, but wish to account for varying standard errors (i.e. some rho_g's may not be distinguishable from zero) while still having an interpretable distance measure.
The project will be motivated by this specific application, but there is also ample scope for general, theoretical investigation of the properties of ensemble clustering methods. While increasingly popular, the properties of these methods remain inadequately understood and there remain many open questions concerning why and under what conditions aggregation helps. Equally, it may be possible to develop novel model-based clustering methods utilizing the standard errors of the distance measure (usually not available).
Supervisors: Tom Nichols (Zeeman D2.03), Sach Mukherjee (Zeeman D1.08)