Machine Learning Stratification for Oncology Patient Survival
2013 - 2017
(Viva passed with minor corrections Feb 2018)
My PhD focused on the use of statistical and machine learning methods to predict oncology patient survival using clinical and molecular data.
Three main aspect were considered: a systematic review of existing literature concerning the prediction of ovarian cancer patient response to chemotherapy; the development of methods for the application of Gaussian process models to survival data, and the integration of feature selection into these methods; and the clinical implementation of data analysis following qPCR-based mutation testing for melanoma, lung and colorectal cancer.
Systematic Review
A systematic review has been carried out to investigate literature concerning the prediction of ovarian cancer patient response to chemotherapy using gene expression measurements. This review investigated the experimental and statistical methods applied, and compared the predictive ability of the models produced by the included studies. Additionally, using gene signatures reported by the studies, gene set enrichment analysis was applied to investigate differences between studies involving different chemotherapy treatments.
This systematic review is published: Katherine L Lloyd, Ian A Cree, and Richard S Savage. Prediction of resistance to chemotherapy in ovarian cancer: a systematic review. BMC cancer, 15(1): 117, 2015.
Gaussian processes for survival data
Gaussian process regression is a popular, well-researched machine learning technique which is well-suited to noisy, complex and high dimensional biomedical data. Gaussian processes place a prior on the space of all functions relating features to outcome, restricting qualities such as smoothness and stationarity. Given the data, they provide a posterior distribution on the possible functions, allowing predictions to be made about the properties of the latent function underlying the data. In this way, predictions of the mean and variation at any point may be made.
However, the application of Gaussian processes to survival data has received much less attention, and is not widely implemented. In order for the Gaussian process models to correctly interpret the survival data, which contains right-censored survival times, additional functionality must be developed. For the models developed during my PhD, the right-censored times are considered to be missing, with the censored times forming a lower bound on the value of the true, unmeasured survival time. By inferring new survival times for the censored samples, the underlying, uncensored data set may be estimated, allowing predictions to made as in Gaussian process regression. Three variations on this model have been developed.
Additionally, when considering the application of Gaussian processes to biomedical data, feature selection is likely to be beneficial. Two feature selection methods are under development: Informed ARD and Random Subset Feature Selection. Informed ARD is a modification of the Automatic Relevance Determination kernel, and allows the grouping of features. Random Subset Feature Selection is a Bayesian model averaging technique with random feature subset selection.
QPCR mutation testing analysis program
A qPCR based mutation test for common, actionable mutations in non-small cell lung cancer, melanoma and colorectal cancer was developed by the Cree group, as in
Hugh Kikuchi, Anne Reiman, Jenifer Nyoni, Katherine Lloyd, Richard Savage, Tina Wotherspoon, Lisa Berry, David Snead, and Ian A Cree. Development and validation of a TaqMan array for cancer mutation analysis. Pathogenesis, 3 (1):1–8, 2016.
In order for this test to be implemented clinically, a program to carry out the data analysis following testing was needed as the data generated by this test require analysis prior to the results being clinically accessible, and the process by which this is done is required to be carried out by non-specialists. An analysis program was developed to make data analysis and report preparation simple, fast and reliable, allowing the test to be clinically feasible.
Both the test and the analysis program are now in clinical use at University Hospital of Coventry and Warwickshire (UHCW).