Skip to main content Skip to navigation

CS347 Fault-tolerant Systems

CS347 15 CATS (7.5 ECTS) Term 2

Availability

Option - CS, DM

Academic Aims

To provide students with a sound knowledge of distributed fault-tolerant systems design. To cover protocols for synchronous distributed systems. To investigate protocol complexity due to system asynchrony and how to monitor and test distributed applications.

Learning Outcomes

By the end of the module the student should be able to:

  • General: Understand dependability attributes, threats and means. Understand the differences between fault, error and failure. Discuss the process by which a fault eventually causes a system failure. Understand the link between fault model and the corresponding dependability mechanisms. Introduction of terms such as fail-safe, fail-operational, fail-stop, etc. Concepts such as fault tree, FMECA, FMEA, etc.
  • HW/System: Calculate reliability of a system. Use of tools for reliability modelling. Design of dependable HW.

  • Middleware: Understand critical functions such as clock synchronisation, consensus, FDIR protocols, etc. Understand Byzantine failures and its impact on system complexity. Introduction to asynchronous message-passing distributed systems.
  • SW: Understand the various methods for SW fault tolerance. NVP, recovery blocks, run-time checks, problem of predicate detection.

Content

  • General: Fault, error, failure, fault transformation process. Implications of coverage on dependability, specifications, methods to achieve dependability.
  • Middleware: Protocols for synchronous distributed systems (leader election, consensus, clock synchronisation, Byzantine agreement and FDIR).
  • Protocols and abstractions for asynchronous distributed systems, including logical and vector clocks, broadcast (best-effort, unordered reliable, ordered reliable), failure detectors, global predicate detection in fault-free and faulty systems.

Books

  • Jalote P, Fault Tolerance in Distributed Systems, Prentice Hall, 1994.
  • Lynch N, Distributed Algorithms, Morgan Kauffman, 1996.
  • Gouda M, Elements of Network Protocol Design, John Wiley, 1998.

Assessment

Two-hour examination (70%), programming assignment (30%)

Teaching

30 one-hour lectures

Jalote P, Fault Tolerance in Distributed Systems, Prentice Hall, 1994.
Lynch N, Distributed Algorithms, Morgan Kauffman, 1996.
Gouda M, Elements of Network Protocol Design, John Wiley, 1998.