CS347 Fault-tolerant Systems
CS347 15 CATS (7.5 ECTS) Term 2
Availability
Option - CS, DM
Academic Aims
To provide students with a sound knowledge of distributed fault-tolerant systems design. To cover protocols for synchronous distributed systems. To investigate protocol complexity due to system asynchrony and how to monitor and test distributed applications.
Learning Outcomes
By the end of the module the student should be able to:
- General: Understand dependability attributes, threats and means. Understand the differences between fault, error and failure. Discuss the process by which a fault eventually causes a system failure. Understand the link between fault model and the corresponding dependability mechanisms. Introduction of terms such as fail-safe, fail-operational, fail-stop, etc. Concepts such as fault tree, FMECA, FMEA, etc.
-
HW/System: Calculate reliability of a system. Use of tools for reliability modelling. Design of dependable HW.
- Middleware: Understand critical functions such as clock synchronisation, consensus, FDIR protocols, etc. Understand Byzantine failures and its impact on system complexity. Introduction to asynchronous message-passing distributed systems.
- SW: Understand the various methods for SW fault tolerance. NVP, recovery blocks, run-time checks, problem of predicate detection.
Content
- General: Fault, error, failure, fault transformation process. Implications of coverage on dependability, specifications, methods to achieve dependability.
- Middleware: Protocols for synchronous distributed systems (leader election, consensus, clock synchronisation, Byzantine agreement and FDIR).
- Protocols and abstractions for asynchronous distributed systems, including logical and vector clocks, broadcast (best-effort, unordered reliable, ordered reliable), failure detectors, global predicate detection in fault-free and faulty systems.
Books
- Jalote P, Fault Tolerance in Distributed Systems, Prentice Hall, 1994.
- Lynch N, Distributed Algorithms, Morgan Kauffman, 1996.
- Gouda M, Elements of Network Protocol Design, John Wiley, 1998.
Assessment
Two-hour examination (70%), programming assignment (30%)
Teaching
30 one-hour lectures
Jalote P, Fault Tolerance in Distributed Systems, Prentice Hall, 1994.
Lynch N, Distributed Algorithms, Morgan Kauffman, 1996.
Gouda M, Elements of Network Protocol Design, John Wiley, 1998.