My name is Edward and I'm from Singapore. I worked in the public and private sectors in Singapore for 12 years and enrolled as a full-time PhD student in the Department of Computer Science at the University of Warwick in October 2016. My PhD research topic is about performance failures in cluster systems and my supervisor is Dr. Arshad Jhumka
The growth of large cluster systems and supercomputers have increasingly generated a massive amount of monitoring data. Recent research has demonstrated the value and significance of combining message logs and resource usage data for error detection and failure diagnosis. However, this is a challenging problem as exascale systems are harder to manage due to the complexity, the amount of monitored data and the interactions between the different system components and nodes, very often result in high failure frequency, manifestation of faults and produces diverse error and failure patterns. My PhD research will study the nature and characteristics of system performance failures, develop new data-processing methodologies that use the wealth of monitoring data generated by cluster systems and implement tools for prototyping and deployment on actual production cluster systems.
I received the Alan Turing Institute Doctoral studentship and the University of Warwick Department of Computer Science scholarship to work on this research.
- E. Chuah, A. Jhumka, J.C. Browne, N. Gurumdimma, S. Narasimhamurthy, B. Barth, Using Message Logs and Resource Use Data for Cluster Failure Diagnosis, in proceedings of 23rd IEEE International Conference on High Performance Computing, Data and Analytics, 2016 (Forthcoming).
- N. Gurumdimma, A. Jhumka, M. Liakata, E. Chuah, J.C. Browne, CRUDE: Combining Resource Usage Data and Error Logs for Accurate Error Detection in Large-Scale Distributed Systems, in proceedings of 35th IEEE International Symposium on Reliable Distributed Systems, 2016.
- E. Chuah, A. Jhumka, J.C. Browne, B. Barth, S. Narasimhamurthy, Insights into the Diagnosis of System Failures from Cluster Log Files, in Proceedings of 11th European Dependable Computing Conference, 2015.
- N. Gurumdimma, A. Jhumka, M. Liakata, E. Chuah, J.C. Browne, Towards Detecting Patterns in Failure Logs of Large-Scale Distributed Systems, in Proceedings of 20th IEEE Workshop on Dependable Parallel, Distributed and Network-Centric Systems, 2015.
- N. Gurumdimma, A. Jhumka, M. Liakata, E. Chuah, J.C. Browne, Towards Increasing the Error Handling Time Window in Large-Scale Distributed Systems Using Console and Resource Usage Logs, in Proceedings of 13th IEEE International Symposium on Parallel and Distributed Processing with Applications, 2015.
- A. Pelaez, A. Quiroz, J.C. Browne, E. Chuah, M. Parashar, Online failure prediction for HPC resources using decentralized clustering, in Proceedings of the 20th IEEE International Conference on High Performance Computing, 2014.
- E. Chuah, A. Jhumka, S. Narasimhamurthy, J. Hammond, J.C. Browne, B. Barth, Linking Resource Usage Anomalies with System Failures from Cluster Log Data, in Proceedings of 32nd International Symposium on Reliable Distributed Systems, 2013.
- E. Chuah, S-h. Kuo, P. Hiew, W-C. Tjhi, G.K.K. Lee, J. Hammond, M.T. Michalewicz, T. Hung, J.C. Browne, Diagnosing the Root-Causes of Failures from Cluster Log Files, in Proceedings of the 16th International Conference on High Performance Computing, 2010.