Coronavirus (Covid-19): Latest updates and information
Skip to main content Skip to navigation

Research

About Me

My name is Edward and I'm from Singapore. I worked in the public and private sectors in Singapore for 12 years and enrolled as a full-time PhD student in the Department of Computer Science at the University of Warwick in October 2016. My PhD research topic is about performance failures in cluster systems and my supervisor is Dr. Arshad Jhumka

Research

The growth of large cluster systems and supercomputers have increasingly generated a massive amount of monitoring data. Recent research has demonstrated the value and significance of combining message logs and resource usage data for error detection and failure diagnosis. However, this is a challenging problem as exascale systems are harder to manage due to the complexity, the amount of monitored data and the interactions between the different system components and nodes, very often result in high failure frequency, manifestation of faults and produces diverse error and failure patterns. My PhD research will study the nature and characteristics of system performance failures, develop new data-processing methodologies that use the wealth of monitoring data generated by cluster systems and implement tools for prototyping and deployment on actual production cluster systems.

I received the Alan Turing Institute Doctoral studentship and the University of Warwick Department of Computer Science scholarship to work on this research.

Related Publications

  1. E. Chuah, A. Jhumka, J.C. Browne, N. Gurumdimma, S. Narasimhamurthy, B. Barth, Using Message Logs and Resource Use Data for Cluster Failure Diagnosis, in proceedings of 23rd IEEE International Conference on High Performance Computing, Data and Analytics, 2016 (Forthcoming).
  2. N. Gurumdimma, A. Jhumka, M. Liakata, E. Chuah, J.C. Browne, CRUDE: Combining Resource Usage Data and Error Logs for Accurate Error Detection in Large-Scale Distributed Systems, in proceedings of 35th IEEE International Symposium on Reliable Distributed Systems, 2016.
  3. E. Chuah, A. Jhumka, J.C. Browne, B. Barth, S. Narasimhamurthy, Insights into the Diagnosis of System Failures from Cluster Log Files, in Proceedings of 11th European Dependable Computing Conference, 2015.
  4. N. Gurumdimma, A. Jhumka, M. Liakata, E. Chuah, J.C. Browne, Towards Detecting Patterns in Failure Logs of Large-Scale Distributed Systems, in Proceedings of 20th IEEE Workshop on Dependable Parallel, Distributed and Network-Centric Systems, 2015.
  5. N. Gurumdimma, A. Jhumka, M. Liakata, E. Chuah, J.C. Browne, Towards Increasing the Error Handling Time Window in Large-Scale Distributed Systems Using Console and Resource Usage Logs, in Proceedings of 13th IEEE International Symposium on Parallel and Distributed Processing with Applications, 2015.
  6. A. Pelaez, A. Quiroz, J.C. Browne, E. Chuah, M. Parashar, Online failure prediction for HPC resources using decentralized clustering, in Proceedings of the 20th IEEE International Conference on High Performance Computing, 2014.
  7. E. Chuah, A. Jhumka, S. Narasimhamurthy, J. Hammond, J.C. Browne, B. Barth, Linking Resource Usage Anomalies with System Failures from Cluster Log Data, in Proceedings of 32nd International Symposium on Reliable Distributed Systems, 2013.
  8. E. Chuah, S-h. Kuo, P. Hiew, W-C. Tjhi, G.K.K. Lee, J. Hammond, M.T. Michalewicz, T. Hung, J.C. Browne, Diagnosing the Root-Causes of Failures from Cluster Log Files, in Proceedings of the 16th International Conference on High Performance Computing, 2010.

Paris, 2015

Contact details:

@ The University of Warwick E dot Chuah at warwick dot ac dot uk

Computer Science, University of Warwick, Coventry CV4 7AL, UK.

@ The Alan Turing Institute echuah at turing dot ac dot uk

The Alan Turing Institute, 96 Euston Road, London NW1 2DB, UK.