Kinesia Online Course
Advanced Operating Systems
Kinesia LLC, 2003

    1. Review and Overview
    2. Deadlocks
    3. Distributed Systems Architecture
    4. Theoretical Foundations
    5. Distributed Mutual Exclusions
        6. Agreement Protocols
    7. Distributed Resource Management
    8. Distributed Scheduling
    9. Secutiry and Protection
    10. Recovery and Fault Tolerance
     

    
    A man who gives in to temptation after five minutes simply
    does not know what it would have liked an hour later.
    That is why bad people, in a sense, know very little about badness.
    The have lived a sheltered life by always giving in.
    
    						C.S. Lewis
    
    Recovery and Fault Tolerance
    1. Basic Concepts
    2. A system consists of a set of hardware and software components and is designed to provide a specified service.
    3. Failure of a system occurs when the system does not perform its services in the manner specified.
    4. An erroneous state of the system is a state which could lead to a system failure by a sequence of valid state transitions
    5. A fault is an anomalous physical condition.
    6. An error is a manifestation of a fault in a system, which can lead to system failure.
    7. Failure recovery is a process that involves restoring an erroneous state to an error-free state.
    8. Failure Classification
    9. process failure
    10. system failure
    11. secondary storage failure
    12. communication medium failure
    13. System Model

      Assume stable storage:

    14. does not lose information in the event of system failure
    15. is used to keep logs & recovery points

    16. Recovery

      forward recovery

      e.g. send signals to satellite

      need error correction code

      backward recovery

    17. rollback recovery
    18. based on recovery points
    19. two approaches:
      1. operation-based recovery

        record all modifications in sufficient detail so that a previous state of the process can be restored by reversing all the changes

      2. state-based recovery

        the complete state of a process is saved at various checkpoints

      A process takes a checkpoint from time to time by saving its state in stable storage

      need consistent global state

      the state of channels corresponding to a global state is the set of messages sent but not yet received

      A check point is saved as a local state of a process

      A set of check points one per process in the system, is consistent if the saved state form a consistent global state

      rollback-recovery from inconsistent check points may cause message losses

      Two approaches to create check points:

      1. processes take checkpoints independently and save all checkpoints in stable storage ( asynchronous )
      2. processes coordinate their checkpointing actions such that each process saves only its most recent checkpoints, and the set of checkpoints in the system is guaranteed to be consistent

      Orphan messages and Domino effect:

      may lead to unacceptable delays


      Lost messages:

      Livelock:

      a situation in which a single failure can cause an infinite number of rollbacks, preventing the system from making progress.

      Strongly Consistent Set of Checkpoints