A man who gives in to temptation after five minutes simply does not know what it would have liked an hour later. That is why bad people, in a sense, know very little about badness. The have lived a sheltered life by always giving in. C.S. LewisRecovery and Fault Tolerance
Assume stable storage:
forward recovery
need error correction code
backward recovery
record all modifications in sufficient detail so that a previous state of the process can be restored by reversing all the changes
the complete state of a process is saved at various checkpoints
A process takes a checkpoint from time to time by saving its state in stable storage
need consistent global state
the state of channels corresponding to a global state is the set of messages sent but not yet received
A check point is saved as a local state of a process
A set of check points one per process in the system, is consistent
if the saved state form a consistent global state
rollback-recovery from inconsistent check points may cause message losses
Two approaches to create check points:
Orphan messages and Domino effect:
may lead to unacceptable delays
Lost messages:
Livelock:
Strongly Consistent Set of Checkpoints