Kinesia Online Course
Advanced Operating Systems
Kinesia LLC, 2003

    1. Review and Overview
    2. Deadlocks
    3. Distributed Systems Architecture
    4. Theoretical Foundations
    5. Distributed Mutual Exclusions
        6. Agreement Protocols
    7. Distributed Resource Management
    8. Distributed Scheduling
    9. Secutiry and Protection
    10. Recovery and Fault Tolerance
     

    
    In a few hundred years, when the history of our time
    is written from a long-term perspective, it is likely
    that the most important event those historians will see
    is not technology, not the Internet, not e-commerce. It
    is an unprecedented change in human condition.  For the
    first time, they will have to manage themselves.
    
    					Peter Drucker
    
    Theoretical Foundation
    1. Inherent Limitations of a Distributed System

    2. Absence of Global clock
      • difficult to make temporal order of events
      • difficult to collect up-to-date information on the state of the entire system

    3. Absence of Shared Memory
      • no up-to-date state of the entire system to any individual process as there's no shared memory
      • coherent view -- all observations of different processes ( computers ) are made at the same physical time we can obtain a coherent but partial view of the system or
        incoherent view of the system
      • complete view ( global state ) -- local views ( local states ) + messages in transit
        difficult to obtain a coherent global state

    4. Lamport's Logical Clock

      happened before →

    5. a → b , if a and b are events in the same process and a occurred before b
    6. a → b , if a is the event of sending a message m in a process and b is the event of receipt of the same message m by another process
    7. if a → b and b → c, then a → c ( transitive )

    8. event a causally affects b if a → b
    9. concurrent: a || b if !( a → b ) and !( b → a )
    10. for any two events in a system, either a → b or b → a or a || b
      e11 → e12   , e12 → e22
      e21 → e13   , e14 || e24

      To realize the relation → we need a clock Ci at each process Pi in the system

      Ci(a) -- timestamp of event a at Pi
      if a → b, then C(a) < C(b)

    11. Condition requirements:
      1. for any two events a and b in a process Pi, if a occurs before b, then Ci(a) < Ci(b)
      2. if a is the event of sending a message m in Pi and b is the event of receiving the same message m at process Pj, then Ci(a) < Cj(b)
      Implementation rules:
      1. two successive events in Pi Ci = Ci + d ( d > 0 ) if a and b are two successive events in Pi and a → b then Ci(b) = Ci(a) + d ( d > 0 )
      2. event a: sending of message m by process Pi,
        timestamp of message m : tm = Ci(a ) then Cj = max ( Cj, tm + d )    d > 0

        → is irreflixive, defines partial order among events

        Totally ordering relation ( => ) can be defined by ( on top of the above )

        a is any event in process Pi
        b is any event in process Pj a => b iff
          either Ci(a) < Cj(b)
          or Ci(a) = Cj(b) and Pi Pj ( e.g. Pi Pj if i ≤ j, to break ties )

    12. Limitation of Lamport's Clocks
      if a → b then C(a) < C(b)
      but C(a) < C(b) does not necessarily imply a → b
    13. Vector Clocks

      n = number of processes in a distributed system
      Each event in process Pi ~ vector clock Ci ( integer vector of length n )

      Ci = Ci[1]
      Ci[2]
      ..
      Ci[n]
    14. Ci[i] ~ Pi's own logical clock
    15. Ci[j] ~ Pi's best guess of logical time at Pj. More precisely, the time of occurrence of the last event at Pj which "happenned before" the current point in time at Pj
    16. Ci(a) is referred to as the timestamp of event a at Pi
    17. Implementation Rules:

      1. two successive events a, b in process Pi
        Ci(b)[i] = Ci(a)[i] + d    ( d > 0 )
      2. event a at Pi sending message m to process Pj with receiving event b; vector timestamp tm = Ci(a) is assigned to m
        all k, Cj(b)[k] = max(Cj(b)[k],tm[k])

      Assertion.
      At any instant

      Comparing two vector timestamps of events a and b

      Equal    ta = tb   iff   all i, ta[i] = tb[i]
      Not Equal    tatb   iff   some i,    ta[i] ≠ tb[i]
      Less Than or Equal    tatb   iff   all i,    ta[i] ≤ tb[i]
      Not Less Than or Equal To    ta tb   iff   some i,    ta[i] > tb[i]
      Less Than    ta < tb   iff   all i, some j,    ta[i] ≤ tb[i], ta[j] < tb[j]
      Not Less Than    ta tb   iff     !(ta < tb)
      Concurrent    ta || tb   iff      ta[i] tb[i] and tb[i] ta[i]
      Events are causally related if ta < tb or tb < ta

      Now, a → b   iff   ta < tb
      otherwise concurrent

    18. Causal Ordering of Messages

      if Send( M1 ) → Send( M2 )
      then the receipient should receive M1 before M2

      i.e. Send( M1 ) → Send( M2 ) requires Receive( M1 ) → Receive( M2 )


      Figure: Violation of causal ordering of messages

    19. Applications: database replication management, monitoring distributed computations, simplifying distributed algorithms,...

    20. Solution idea: upon arrival of a message at a process, buffer (delay delivery) the message until the message immediately preceding it is delivered

    21. Birman-Schiper-Stephenson Protocol: Enforcing Causal Ordering of Messages

      Assumes broadcast communication channels that do not loose or corrupt messages. ( i.e. everyone talks to everyone ). Use vector clocks to "count" number of messages ( i.e. set d = 1 ). n processes.

      1. Process Pi updates vector time Ci and broadcasts message m with timestamp tm = Ci.
      2. Process Pj ( j ≠ i ) upon receiving message m with timestamp tm, Pj buffers the message until
        • all messages sent by Pi preceding m have arrived i.e. Cj[i] = tm[i] - 1

          and
        • Pj has received all messages that Pi had received before sending m. i.e. Cj[k] ≥ tm[k]    k = 1, 2, .. n, k ≠ i
      3. When the message is finally delivered at Pj, vector time Cj is updated according to vector clock rule 2.

      Schiper-Eggli-Sandoz were able to solve the problem without broadcasting channels

    22. Global State
    23. no global clock, no global memory
    24. To determine a global system state, a process p must enlist the cooperation of other processes that must record their states and send the recorded local states to p

    25. processes cannot record their local states at precisely the same instant unless they have access to a common clock

    26. the global-state-detection algorithm is to be superimposed on the underlying computation; it must run concurrently with but not alter the underlying computation
      diagrams

    27. Distributed system finite set of processes
      finite set of channels

      process state, channel state

    28. Single-token conservation system
      ( system has one token )

      each process has two states S1, S0

      S1 -- process possesses token
      S0 -- process does not possess token

      Global state: token in P1

      Global state: token in C1
       

      Global state: token in C2

      Global state: token in P2

      four global states : in-P1, in-C1, in-P2, in-C2

      Suppose

    29. P1 records local state at global state in-P1,
    30. P2, channels record local state at global state in-C1, in-C1,
    31. composite global state recorded in this fashion would show two tokens in system, one in P1, and one in C1,

      inconsistency arises because P1 recorded its state before P sent a message along C and C recorded after P sent the message

    32. let n = # of messages sent along C1 before P1 recorded its state
      and
      let n' = # of messages sent along C1 before C1 recorded its state

      e.g. n = 0, n' = 1 in the above example
      we see that recorded global state may be inconsistent if n < n'

      Suppose

    33. C1 records local state at G.S. in-P1, ( n' = 0 )
    34. C2, P1, P2 record local state at G.S. in-P1, ( n = 1 )
    35. Here n > n' and again we have inconsistency

      Thus consistent global state requires n = n'

      Similarly, if

      let m = # of messages received along C1 before P2 recorded its state
      and
      let m' = # of messages received along C1 before C1 recorded its state
      consitent global state requires m = m'

      Also a consistent global state must satisfy

        n' ≥ m
        ( i.e. # of messages sent along a channel is always larger than or equal to the number of messages received )

      or from above we have

        n ≥ m
    36. Global-State-Detection Algorithm Send a special message called marker

      Chandy-Lamport Global State Recording Protocol

      The goal of this distributed algorithm is to capture a consistent global state. It assumes all communication channels are FIFO. It uses a distinguished message called a marker to start the algorithm.

    37. Pi sends marker
      1. Pi records its local state
      2. For each channel Cij on which Pi has not already sent a marker, Pi sends a marker before sending other messages.

    38. Pj receives marker from Pi
      1. If Pj has not recorded its state:
        • a) Records the state of Cij as empty
        • b) Sends the marker as described above ( Note: it records local state before sending out marker )

      2. If Pj has recorded its state local state LSj
        • a) Record the state of Cij to be the sequence of messages received between the computation of LSj and the marker from Cij.

      Example

      In this example, all processes are connected by communications channels Cij. Messages being sent over the channels are represented by arrows between the processes.

      Snapshot s1:

      • P1 records LS1, sends markers on C12 and C13
      • P2 receives marker from P1 on C12; it records its state LS2, records state of C12 as empty, and sends marker on C21 and C23
      • P3 receives marker from P1 on C13; it records its state LS3, records state of C13 as empty, and sends markers on C31 and C32.
      • P1 receives marker from P2 on C21; as LS1 is recorded, it records the state of C21 as empty.
      • P1 receives marker from P3 on C31; as LS1 is recorded, it records the state of C31 as empty.
      • P2 receives marker from P3 on C32; as LS2 is recorded, it records the state of C32 as empty.
      • P3 receives marker from P2 on C23; as LS3 is recorded, it records the state of C23 as empty.

      Snapshot s2: now a message is in transit on C12 and C21.

      • P1 records LS1, sends markers on C12 and C13
      • P2 receives marker from P1 on C12 after the message from P1 arrives; it records its state LS2, records state of C12 as empty, and sends marker on C21 and C23
      • P3 receives marker from P1 on C13; it records its state LS3, records state of C13 as empty, and sends markers on C31 and C32.
      • P1 receives marker from P2 on C21; as LS1 is recorded, and a message has arrived since LS1 was recorded, it records the state of C21 as containing that message.
      • P1 receives marker from P3 on C31; as LS1 is recorded, it records the state of C31 as empty.
      • P2 receives marker from P3 on C32; as LS2 is recorded, it records the state of C32 as empty.
      • P3 receives marker from P2 on C23; as LS3 is recorded, it records the state of C23 as empty.

      The recorded process states and channel states must be collected and assembled to form the global state. ( e.g. send G.S. to all processes in finite time )

      Termination
      each process must ensure that

      • no marker remains forever in an incident input channel
      • it records its state within finite time of initiation of the algorithm
    39. A remark ( state without proof here )

    40. Some definitions

    41. LSi -- local state of Si ( site )

    42. events -- send( mij ), recv( mij )

    43. time ( x ) -- time at which state x was recorded
      e.g. time ( LSi )

    44. send ( mij ) ∈ LSi iff time ( send ( mij ) ) < time ( LSi )

    45. recv ( mij ) ∈ LSj iff time ( recv ( mij ) ) < time ( LSj )

    46. transit ( LSi, LSj ) = { mij | send( mij ) ∈ LSi Λ recv( mij ) !∈ LSj }
      i.e. message in channel

    47. inconsistent ( LSi, LSj ) = { mij | send( mij) !∈ LSi Λ recv( mij ) ∈ LSj }

    48. Global State GS = { LS1, LS2, ..., LSn }
      i.e. collection of local states ( may be consistent or inconsistent )

    49. Consistent Global State: A global state GS = { LS1, LS2, ..., LSn } is consistent iff all i, all j: 1 ≤ i, j ≤ n :: inconsistent( LSi, LSj ) = Φ

    50. Transitless global state: A global state is transitless iff all i, all j: 1 ≤ i, j ≤ n :: transit( LSi, LSj ) = Φ

    51. Strongly consistent global state: A global state is strongly consistent if it is consistent and transitless.

    52. Cuts of a distributed Computation

      Graphical representation of GS

      C = { c1, c2, ... ,cn }
      ci -- cut event, local state of site ( or process ) Si at that instant

    53. Consistent Cut:

      all Si, all Sj, no ei, no ej such that

        ( ei → ej ) and ( ei → cj ) and ( ei ci )

    54. Theorem
        A cut C = { c1, c2, ... ,cn } is a consistent cut iff no two cut events are causally related. ( i.e. every pair of cut events are concurrent )

      Time of a cut

        C = { c1, c2, ... ,cn }

        Ci -- vector clock of ci

        TC = sup ( C1, C2, ... , Cn )

        TC[k] = max ( C1[k], C2[k], ... , Cn[k] )

      • Theorem
        if C = { c1, c2, ... ,cn } is a cut with vector time TC, then the cut is consistent iff
        TC = C1[1]
        C2[2]
        .
        .
        Cn[n]
          -------------- (1)
        Proof:
        If C is a consistent cut, then all its events are concurrent. Thus Ci[i] ≥ Cj[i] for all i, j and hence
        TC = sup ( C1, C2, ... , Cn ) = C1[1]
        C2[2]
        .
        .
        Cn[n]

        On the other hand if (1) is true
        we have Ci[i] ≥ Cj[i] for all i, j. This implies that the the events ci are concurrent and the cut is consistent.

    55. Termination Detection

      Huang's Termination Detection Protocol

      The goal of this protocol is to detect when a distributed computation terminates.

      Notation:

    56. n processes
    57. Pi process; without loss of generality, let P0 be the controlling agent
    58. Wi. weight of process Pi; initially, W0 = 1 and Wi = 0 for all other i.
    59. B(W) computation message with assigned weight W
    60. C(W) control message sent from process to controlling agent with assigned weight W


      Protocol

    61. an active process Pi sends a computation message to Pj
      1. Set Wi' and Wij to values such that Wi' + Wij = Wi,
        Wi' > 0, Wij > 0. (Wi' is the new weight of Pi.)
      2. Send B(Wij) to Pj
    62. Pj receives a computation message B(Wij) from Pi
      1. Wj = Wj + Wij
      2. If Pj is idle, Pj becomes active

    63. Pi becomes idle by:
      1. Send C(Wi) to P0
      2. Wi = 0
      3. Pi becomes idle

    64. Pi receives a control message C(W):
      1. Wi = Wi + W
      2. If Wi = 1, the computation has completed.

    65. Example
    66. The picture shows a process P0, designated the controlling agent, with W0 = 1. It asks P1 and P2 to do some computation. It sets
        W01 = 0.2
        W02 = 0.3
        W0 = 0.5
    67. P2 in turn asks P3 and P4 to do some computations. It sets
        W23 = 0.1
        W24 = 0.1

    68. When P3 terminates, it sends C(W3) = C(0.1) to P2, which changes W2 to 0.1 + 0.1 = 0.2.

    69. When P2 terminates, it sends C(W2) = C(0.2) to P0, which changes W0 to 0.5 + 0.2 = 0.7.

    70. When P4 terminates, it sends C(W4) = C(0.1) to P0, which changes W0 to 0.7 + 0.1 = 0.8.

    71. When P1 terminates, it sends C(W1) = C(0.2) to P0, which changes W0 to 0.8 + 0.2 = 1.

    72. P0 thereupon concludes that the computation is finished.

      Total number of messages passed: 8 (one to start each computation, one to return the weight).