My PhD: Modelling Transcription Factor Binding Sites and their Combinatorial Effects

The following text describes how my PhD started. It has drifted a little since then, although still keeping to the same general area of study. The PhD title and this description needs to be updated to reflect this.

-----------

Gene regulatory networks depend on the binding of gene regulatory proteins called transcription factors (TF) to promoters and regulatory modules. The new experimental technique called ChIP-seq (Chromatin ImmunoPrecipitation sequencing) is making it possible to obtain genome wide data on regions of TF binding.

The challenge is to identify motif finding techniques to identify the exact binding positions and derive mathematical models of TF binding. These models may reveal combined binding sites of a TF and its co-factor, and can be used to identify binding sites in species for which experimental data is not available.

Data from the Solexa ChIP-seq technology from Illumina that is being established at Warwick will provide the opportunity to develop generalised models of TF binding. These models will then be used to inform the reconstruction of gene regulatory networks, expand CHiP-seq results in other species, and to propose the design of mutated or artificial promoter sequences for experimental testing.

In the PhD, algorithms will be developed and implemented to analyse the Solexa output data in order to extract TF binding sites. Initial work has already been carried out based on the cisGenome software (Ji, Jiang et al. 2008). These will allow the binding sites to be mapped back to the genome and so identify the genes that are under the control of the TF. These data will also allow significant extensions of the network models by adding genes under the TF’s regulatory control that cannot be determined by micro-array analysis.

It is proposed that existing algorithms for identifying TF binding sites will be extended to include the combinatorial action of binding sites using a combination of exhaustive searching and local optimisation.

It is hoped that other data, such as the way that DNA sequence influences DNA topography which then influences binding (Parker, Hansen et al. 2009), can also be incorporated into the model.

Initial work has focussed on a re-evaluation of some of the existing scoring algorithms which measure the match between an arbitrary sequence and the consensus binding motifs for a TF (e.g. Kel, Gossling et al. 2003) and their ability to predict the TF binding as found in published ChIP data, as the choice of scoring mechanisms is the bedrock of the rest of the PhD.

This work is being conducted within the wider context of the PRESTA project that is looking at stress responses within plants, but will have potential application to the wider scientific community.

Hui Jiang, Fan Wang, Nigel P. Dyer, and Wing Hung Wong (2010) "CisGenome Browser: A flexible tool for genomic data visualization" Bioinformatics
Kel, A. E., E. Gossling, et al. (2003). "MATCH: A tool for searching transcription factor binding sites in DNA sequences." Nucleic Acids Res 31(13): 3576-9.
Parker, S. C., L. Hansen, et al. (2009). "Local DNA topography correlates with functional noncoding regions of the human genome." Science 324(5925): 389-92.

Link to PhD related presentations