Hello, my name is Edward. Prior to my PhD studies, I worked in Singapore for 11 years (1.5 years in software engineering, 2 years in teaching, 7.5 years in R&D). My former PhD supervisor is Dr. Arshad Jhumka. I successfully defended my thesis on 28 May 2020 and received the PhD in Computer Science on 7 July 2020. The title of my thesis is "Features Correlation-based Workflows for High-Performance Computing Systems Diagnosis".
With this my study at Warwick has ended. It was a pleasure working with collaborators during my time here.
My current interests lie at the intersection of fault tolerance, distributed systems and data analytics. I also have a general interest in anomaly detection, causal inference, networking and security.
Modern day data centres and HPC systems are comprised of complex combinations of networks, processors, storage systems and operating systems. Recent research has demonstrated the value and significance of combining system failure logs with resource utilisation data for failure diagnosis (and error detection). However, the massive amount of data that large HPC systems generate present a significant challenge in processing the data effectively for failure diagnosis.
My PhD research is addressing the challenge by developing a new framework for HPC systems diagnosis. The framework uses resource usage data and system logs in its analyses. I evaluated multiple feature extraction methods and correlation algorithms and implemented two diagnostics workflows. The workflows are called CORRMEXT and EXERMEST. CORRMEXT diagnosed error cases that occur frequently. EXERMEST diagnosed error cases that are rare. The impact of my thesis is a set of recommendations on what the systems administrator should look out for in the resource use data and system logs.
During my PhD studies, I collaborated with R&D teams at Intel, The Texas Advanced Computing Center at The University of Texas at Austin and Rutgers Discovery Informatics Institute at Rutgers University on fault tolerance for large distributed systems using statistics and data analytics.
My PhD research was supported by an Alan Turing Institute Doctoral studentship, The University of Warwick Department of Computer Science scholarship and the Data Science at Scale programme, in partnership with Intel.
- E. Chuah, A. Jhumka, S. Alt, D.Balouek-Thomert, J.C. Browne, M. Parashar, Towards Comprehensive Dependability-Driven Resource Use and Message Log-Analysis for HPC Systems Diagnosis, Journal of Parallel and Distributed Computing (JPDC), vol. 132c, pp. 95-112, October 2019. Link
- E. Chuah, A. Jhumka, S. Narasimhamurthy, J. Hammond, J.C. Browne, B. Barth, Linking Resource Usage Anomalies with System Failures from Cluster Log Data, in Proceedings of the 32nd IEEE International Symposium on Reliable Distributed Systems (SRDS), 2013. Link
- E. Chuah, S-h. Kuo, P. Hiew, W-C. Tjhi, G. Lee, J. Hammond, M.T. Michalewicz, T. Hung, J.C. Browne, Diagnosing the Root-Causes of Failures from Cluster Log Files, in Proceedings of 16th IEEE International Conference on High-Performance Computing (HiPC), 2010. Link
My DBLP page.
Software tools developed:
- CORRMEXT - A dependability-driven resource use and message logs analysis tool
- EXERMEST - A tool for diagnosing rare error cases in HPC systems using resource use data and system logs
Services to the community:
- 2020: Invited reviewer for IEEE Access.
- 2020: Invited reviewer for 2nd International Conference on Machine Learning and Intelligent Systems (MLIS 2020).
- 2018: Invited reviewer for Software: Practice and Experience, Wiley. Recognised as a distinguished referee. Link
- 2018: Invited reviewer for ACM Computing Surveys.
- 2017 Term 1: Volunteer teaching associate for Advanced Database course in the Computer Science department, University of Warwick.
edwardchuah at acm dot org