Skip to main content Skip to navigation

Prediction of bacterial host and niche specificity from genomic data with machine learning methods

Primary Supervisor: Dr Danesh Moradigaravand, Institute of Cancer and Genomic Sciences

Secondary supervisor: Professor Jean-Baptiste Cazier

PhD project title: Prediction of bacterial host and niche specificity from genomic data with machine learning methods

University of Registration: University of Birmingham

Project outline:

Escherichia coli is a major bacterial strain, which is able to reside in the gastrointestinal tracts of animals and humans, as well as in environmental sites. The commensal strains of E. coli of zoonotic origin are recognized to underlie infections in humans and as a result impose immense costs on healthcare systems. The Antimicrobial Resistant (AMR) and virulent strains are leading cause of diarrheagenic and bloodstream infections in human worldwide. Despite the clinical importance, the knowledge of the epidemiology of bacterial strains within food animals and transmission routes to human hosts is scarce. Furthermore, risk factors that facilitate the spread of the bacterium across hosts is largely understudied.

Over the past decade, Whole Genome Sequencing (WGS) of bacterial strains has proven an effective means in determining the population structure of bacterial strains across human and non-human sources at a fine resolution. In the proposed project, the student will utilize the wealth of publicly available genomic data and develop genomic epidemiology and predictive modelling frameworks based on machine learning to address the following questions:

Q1- What is the rate of host switching for E. coli clones circulating across environments and hosts?

Q2- What are the biomarkers for host specific E. coli strains?

Q3- What is adaptive significance of these biomarkers?

Q4- Can host specificity of E. coli strains be predicted from genomic data?

Over the course the project, the student will fetch the genomic data and associated metadata from global collections, which are mostly available on the Enterobase database. He/She will then curate the genomic dataset consisting of representative populations of E. coli strains within clinical and non-clinical settings, including domestic and wild animals and environmental sites. He/She will then attempt to reconstruct the history of the population with phylogenetic methods to identify both hosts specific and between-host clones of E. coli. For lineages consisting of strains recovered from different hosts, Bayesian methods will be used to estimate the time of divergences and host switching incidents (Q1).

The clones will then be classified based on their hosts specificity. The student will then develop a Genome Wide Association Study (GWAS) based framework to identify robust, indicative and predictive genetic markers of hosts specificity and examine their functional implications of the variants by conducting Gene Ontology analysis (Q2). This aims to address whether adaptation to hosts has any underlying genetic causes.  

In the final step, the student will develop a machine learning model to predict host specificity from genomic data.  The platform will take the genomic variants as predictors and the labelled data corresponding to the site of isolation, e.g. environment, human and non-human hosts, as the response variable. The student will then tune and optimize the model to improve the accuracy. The model will then be published in the form of a publicly available package. Following the predictive model development, the student will conduct feature importance analysis to identify most important features in the models. The machine learning methods will specifically allow to understand higher order interactions between genetic variants to reconstruct the genotype-phenotype map of host specificity.

The output of the thesis will be two major articles on the genomic epidemiology of the evolution of E. coliacross hosts and a method paper that allows researchers to estimate the likelihood of the ability of the strain to spread across hosts based on the genomic context. This is particularly helpful for clinicians to determine the risk of transmission of strains recovered from food animal to human hosts from genomic variants of the bacterium.

BBSRC Strategic Research Priority: Understanding the rules of life: Microbiology

Techniques that will be undertaken during the project:

  • Genomic Epidemiology
  • Genomic data analysis
  • Genome Wide Association Study (GWAS)
  • Statistical Machine Learning

Contact: Dr Danesh Moradigaravand, University of Birmingham