Syllabus for |
|
EDA121 - Fault-tolerant computer systems |
|
Owner: DCMAS |
|
4,0 Credits (ECTS 6) |
Grading: TH - Five, Four, Three, Not passed |
Level: A |
Department: 37 - COMPUTER SCIENCE AND ENGINEERING
|
Teaching language: English
Course module |
|
Credit distribution |
|
Examination dates |
Sp1 |
Sp2 |
Sp3 |
Sp4 |
|
No Sp |
0103 |
Examination |
3,0 c |
Grading: TH |
|
3,0 c
|
|
|
|
|
|
|
26 Oct 2006 am M, |
13 Jan 2007 pm V, |
27 Aug 2007 am M |
0203 |
Laboratory |
1,0 c |
Grading: UG |
|
1,0 c
|
|
|
|
|
|
|
|
In programs
TELTA ELECTRICAL ENGINEERING, Year 4 (elective)
TITEA SOFTWARE ENGINEERING, Year 4 (elective)
TITEA SOFTWARE ENGINEERING, Year 3 (elective)
TTFYA ENGINEERING PHYSICS, Year 4 (elective)
DCMAS MSc PROGR IN DEPENDABLE COMPUTER SYSTEMS, Year 1 (compulsory)
TAUTA AUTOMATION AND MECHATRONICS ENGENEERING, Year 4 (elective)
TDATA COMPUTER SCIENCE AND ENGINEERING - Computer security, Year 4 (elective)
TDATA COMPUTER SCIENCE AND ENGINEERING - Engineering of Computer-Based Systems, Year 4 (elective)
TDATA COMPUTER SCIENCE AND ENGINEERING - Embedded computer systems engineering, Year 4 (elective)
Examiner:
Professor
Johan Karlsson
Replaces
EDA120
Dependable distributed and embedded systems
Eligibility:
For single subject courses within Chalmers programmes the same eligibility requirements apply, as to the programme(s) that the course is part of.
Course specific prerequisites
No formal requirements, but the participants are expected to have basic knowledge in computer engineering, programming and probability theory.
Aim
Fault-tolerant systems are used in applications that require high dependability, such as safety-critical control systems in vehicles and airplanes, or business-critical systems for e-commerce, automatic teller machines and financial transactions. This is an introductory course that covers basic techniques for design and analysis of fault-tolerant systems, as well as project management and development processes for safety-critical systems.
Goal
After the course the student shall be able to:
- Formulate requirements for fault-tolerant computer systems used in business, safety and mission critical applications.
- Design system architectures for fault-tolerant computer systems from a given requirements specification.
- Perform probabilistic dependability analysis of fault-tolerant computer system using fault-trees, reliability block diagrams and time-continous Markov chains.
- Describe the principles and properties of techniques used for error detection, error recovery and errror masking in computer systems.
- Master the terminology of dependable computing and describe the major elements of relevant standards.
Content
The course covers techniques for tolerating hardware and software faults, analysis of fault-tolerant systems, project management and development processes for safety-critical systems.
The content can be divided into five areas:
1. Terminology and definitions: Includes terms such as dependability, reliability, maintainability, availability and safety, taxonomies for dependable systems, fault models, etc.
2. Design techniques for error detection and fault-tolerance: Fault-tolerance is achieved by introducing redundancy in the design. Various redundancy configurations are described. Hardware redundancy: triple modular redundancy (TMR), active redundancy, hot and cold standby systems, hybrid redundancy, etc. Software redundancy: N-version programming, recovery blocks. Information redundancy: error correcting codes and self-checking circuits. Time redundancy: Methods for detecting and tolerating transient and permanent faults. Fault-tolerance in distributed systems: time-triggered systems, forward recovery, backward recovery, checkpointing, domino effect, byzantine failures, etc.
3. Analysis of fault-tolerant system: Reliability block diagrams, fault-trees, markov chain models, failure mode and effects analysis (FMEA), failure rate prediction for integrated circuits, fault injection, etc. Includes a laboratory class in which markov chain models are used to analyse a fault-tolerant system. The analysis is done using a special computer program.
4. Project management and development processes: Competence models, process models, resource balancing, risk analysis, safety case, the IEC 61508 standard, etc.
5. System examples: Fault-tolerant systems from areas such as space, aviation, automotive, telecommunication and transaction processing are described, some by guest lectures from industry.
Organisation
Lectures, exercises and one laboratory class.
Literature
Neil Storey, Safety-Critical Computer Systems, Prentice Hall, ISBN 0-201-42787-7. Compendium, reprints of articles, compendium of exercises.
Examination
Written exam. Compulsory laboratory class.