Philipp Hermann, Institute of Applied Statistics, Johannes Kepler University Linz, Austria
LDJump: Estimating Variable Recombination Rates from Population Genetic Data
Recombination is a process during meiosis which starts with the formation of DNA double-strand breaks and results in an exchange of genetic material between homologous chromosomes. In many species, recombination is concentrated in narrow regions known as hotspots, flanked by large zones with low recombination. As recombination plays an important role in evolution, its estimation and the identification of hotspot positions is of considerable interest. In this talk we introduce LDJump, our method to estimate local population recombination rates with relevant summary statistics as explanatory variables in a regression model. More precisely, we divide the DNA sequence into small segments and estimate the recombination rate per segment via the regression model. In order to obtain change-points in recombination we apply a frequentist segmentation method. This approach controls a type I error and provides confidence bands for the estimator. Overall LDJump identifies hotspots at high accuracy under different levels of genetic diversity as well as demography and is computationally fast even for genomic regions spanning many megabases. We will present a practical application of LDJump on a region of the human chromosome 21 and compare our estimated population recombination rates with experimentally measured recombination events.
(joint work with Andreas Futschik, Irene Tiemann-Boege, and Angelika Heissl)
Professor Dr. Ingo Scholtes, Data Analytics Group, University of Zürich
Optimal Higher-Order Network Analytics for Time Series Data
Network-based data analysis techniques such as graph mining, social network analysis, link prediction and clustering are an important foundation for data science applications in computer science, computational social science, economics and bioinformatics. They help us to detect patterns in large corpora of data that capture relations between genes, brain regions, species, humans, documents, or financial institutions. While this potential of the network perspective is undisputed, advances in data sensing and collection increasingly provide us with high-dimensional, temporal, and noisy data on real systems. The complex characteristics of such data sources pose fundamental challenges for network analytics. They question the validity of network abstractions of complex systems and pose a threat for interdisciplinary applications of data analytics and machine learning.
To address these challenges, I introduce a graphical modelling framework that accounts for the complex characteristics of real-world data on complex systems. I demonstrate this approach in time series data on technical, biological, and social systems. Current methods to analyze the topology of such systems discard information on the timing and ordering of interactions, which however determines which elements of a system can influence each other via paths. To solve this issue, I introduce a modelling framework that (i) generalises standard network representations towards multi-order graphical models for causal paths, and (ii) uses statistical learning to achieve an optimal balance between explanatory power and model complexity. The framework advances the theoretical foundation of data science and sheds light on the important question when network representations of time series data are justified. It is the basis for a new generation of data analytics and machine learning techniques that account both for temporal and topological characteristics in real-world data.