#### **Leader Election**

#### **▲** Definition:

- each processor has a set of **elected** states and a set of **not-elected** states. Once an elected state is entered, the processor always is in an elected state; similarly for non-elected. I.e., irreversible decision.
- In every *admissible* execution,
  - every processor eventually enters either an elected or a not-elected state (liveness)
  - eactly one processor (the leader) enters an elected state (safety)

A leader can be used to coordinate future activities of the system. For instance:

- find a spanning tree using the leader as the root
- reconstruct a lost token for a token-ring

We will study leader election in rings.



In an *oriented ring*, processors have a consistent notion of left and right:



For example, if messages are always forwarded on incident channel 1, they will cycle clockwise around the ring.

#### Why study rings?

- <u>simple</u> starting point, easy to analyze
- abstraction of a token ring



lower bounds for ring topology also apply to arbitrary topologies

#### **Anonymous Rings**

Intuition is that processors do not have unique identifiers.

Related issue is whether an algorithm A relies on processors knowing the ring size. (NON UNIFORM)

• uniform algorithm — does not use the ring size (same algorithm for each size ring)

Formally, every processor in every size ring is modeled with the same state machine A.

• **non-uniform** algorithm — does use the ring size (different algorithm for each size ring; may be only trivially different)

Formally, for every value of n, there is a state machine  $A_n$  such that every processor in a ring of size n is modeled with  $A_n$ . Thus A is the collection of all the  $A_n$ 's.

#### **Leader Election in Anonymous Rings**

**Theorem 3.2:** There is <u>no</u> leader election algorithm for anonymous rings, even if the algorithm knows the ring size (i.e., is non-uniform) and the ring is synchronous.

#### Proof Sketch: (BY CONTRADICTION)

- Every processor begins in the <u>same</u> state with the <u>same</u> messages originally in transit.
- Every processor receives the same messages and thus makes the <u>same</u> state transition and sends the <u>same</u> messages in round 1.
- Every processor receives the same messages and thus makes the same state transition and sends the same messages in round 2.
- Etc.

Eventually some processor is supposed to enter an elected state. But then they all would, a contradiction.

Consequently, there is no uniform or asynchronous leader election algorithm.

#### **Rings with Identifiers**



Assume each processor has a <u>unique</u> identifier. Distinguish between indices and identifiers:

- indices are 0 through n-1 and are <u>unavailable</u> to the processors; used only for analysis
- identifiers are <u>arbitrary</u> nonnegative integers and are available to the processors via a special state component called id.

Specify a ring by starting with the <u>smallest</u> id and listing ids in clockwise order. E.g., 3, 37, 19, 4, 25.



**Uniform algorithm:** There is one state machine for every id, no matter what size ring.

**Non-uniform algorithm:** There is one state machine for every id and every different ring size.

#### Overview of Leader Election in Rings with Ids

In this case, there are algorithms. We will evaluate them according to their **message complexity**.

#### **Overview of Upcoming Results:**

- asynchronous ring:  $\Theta(n \log n)$  messages
- synchronous ring:
  - $-\Theta(n)$  messages under certain conditions
  - otherwise  $\Theta(n \log n)$  messages

All bounds are asymptotically tight.

# $O(n^2)$ Messages Leader Election Algorithm

#### Each processor follows these rules:

- Initially send your id to the left
- When you receive an id (from the right):
  - if it is greater than your id then forward it to the left (you will never be the leader)
  - if it is equal to your id then elect yourself leader
  - if it is smaller than your id then do nothing

**Correctness:** Elects processor with largest id. Message containing that id passes through every processor.

Message complexity: Depends how ids are arranged.

- Largest id travels all around the ring,  $\underline{n}$  messages
- Second largest id travels until reaching largest
- Third largest id travels until reaching largest or second largest
- Etc.

#### $O(n^2)$ Messages Algorithm (cont'd)

Worst way to arrange the ids is in decreasing order:



- Second largest id contributes n-1 messages.
- Third largest id contributes n-2 messages.
- Etc.

Total number of messages is

$$\sum_{i=1}^{n} i = \Theta(n^2).$$

APPROPONDIMENTO: QUANTO COSTA NEL CASO MIGLIORE? E NEL CASO MEDIO?

#### $O(n \log n)$ Messages Leader Election Algorithm

# • Each processor tries to probe successively larger neigh borhoods. Size of neighborhood doubles in each phase. | Sondare | Sondare | Sondare | Service | Processors within | Processors | Within | Wi

- A probe is initiated by sending a probe message containing the initiator's id.
- If a processor receives a probe message whose id is larger than its own id, the processor will either forward it on or send back a reply, as appropriate.
- If a processor receives a probe message whose id is smaller than its own id, it does nothing.
- If a processor receives a probe message with its own id, then it becomes the leader.
- If a processor receives a reply message not destined for itself, it forwards it.
- If a processor receives both reply messages destined for itself, it proceeds to the next phase.



#### $O(n \log n)$ Messages Algorithm (cont'd)

Correctness: Similar to previous algorithm.

#### **Message Complexity:**

- Each message belongs to a phase and is initiated by a particular processor.
- Probe distance in phase i is  $2^i$ .
- The number of messages initiated by a particular processor in phase  $\underline{i}$  is  $\underbrace{4 \cdot 2^{i}}$  (probes and replies in both directions).

#### $O(n \log n)$ Messages Algorithm (cont'd)

How many processors initiate probes in phase i?

- For i = 0, all n of them do.
- For i > 0, every processor that is a "winner" in phase i-1 does (has largest id in its  $2^{i-1}$  neighborhood)

Maximum number of phase i-1 winners occurs when they are packed as densely as possible:



Total number of phase i-1 winners is at most

$$\frac{n}{2^{i-1}+1} \qquad i \ge 1$$

How many phases are there? Phases continue until there is only one winner, so  $\log n$  phases suffice.

#### $O(n \log n)$ Messages Algorithm (cont'd)

#### Total number of messages is

$$\leq 4 \cdot n + \sum_{i=1}^{\log n} 4 \cdot 2^{i} \cdot \frac{n}{2^{i-1} + 1} + 2n$$

$$\leq 6n + 4n \sum_{i=1}^{\log n} \frac{2^{i}}{2^{i-1}}$$

$$= 6n + 8n \log n$$

$$= O(n \log n).$$





#### **Leader Election in Synchronous Rings**

First, a simple algorithm for the synchronous model:

- Group the <u>rounds</u> into **phases** so that each phase contains *n* rounds.
- In phase *i*, the processor with id *i*, if there is one, sends a message around the ring and is elected.

Example: n = 4 and 7 is smallest id.

- In phases 0 through 6 (corresponding to rounds 1 through 28), no message is ever sent.
- At beginning of phase 7 (round 29), processor with id 7 sends message which is forwarded around ring.

Note reliance on synchrony and knowledge of n!

Correctness: Convince yourself.

**Message Complexity:**  $\Theta(n)$ . Note that this is optimal.

**Time Complexity:**  $O(n \cdot \underline{m})$ , where  $\underline{m}$  is the smallest id in the ring. Not bounded by n.

#### **Another Synchronous LE Algorithm**

#### This algorithm

- works in a slightly weaker model: Processors might not all start at same round; a processor either wakes up spontaneously or when first gets a message.
- is uniform (does not rely on knowing n).

#### Idea:

- A processor that wakes up spontaneously is <u>active</u>; sends its id in a <u>fast</u> message — 1 edge/round.
- A processor that wakes up when receiving a message is <u>relay</u>; never in the competition.
- A <u>fast</u> message becomes <u>slow</u> if it reaches an active processor 1 edge/ $2^{\underline{m}}$  rounds ( $\underline{m}$  is msg id)
- Processors (active or relay) only forward a message whose id is smaller than any id this processor has seen so far (ignoring the id of relay processors).
- If a processor gets own id back, leader.

#### **Analysis of Synchronous LE Algorithm**

**Correctness:** Convince yourself that active processor with smallest id is elected.

Message Complexity: Winner's message is the fastest. While it traverses the ring, other messages are slower, so they are overtaken and stopped before too many messages are sent.

More carefully, divide messages into three kinds:

- 1) fast messages
- 2) slow messages sent while the leader's message is fast
- 3) slow messages sent while the leader's message is slow

#### Analysis of Synchronous LE Algorithm (cont'd)

Number of type 1 messages (fast):

Show that no processor forwards more than one fast message. (BY CONTRADICTION)



If  $p_i$  forwards  $p_j$ 's fast msg and  $p_k$ 's fast msg, then when  $p_k$ 's fast message arrives at  $p_j$ :

- 1. either  $p_j$  has already sent its fast message, so  $p_k$ 's message becomes slow, or
- 2. p<sub>j</sub> has not already sent its fast message, so it never will.
- Number of type 1 messages is at most **n**.

#### Analysis of Synchronous LE Algorithm (cont'd)

Number of type 2 messages (slow while leader's is fast):

- Leader's message is fast for at most n rounds.
- Slow message  $\underline{i}$  is forwarded  $n/2^{\underline{i}}$  times in n rounds.
- Worst case (largest number of messages) is when ids are as small as possible, 0 to n-1.

Number of type 2 messages is at most  $\mathbf{\epsilon}_{i=1}^{n-1} \frac{n}{2^i} \leq n$ .

Number of type 3 messages (slow while leader's is slow)

- Once leader's message  $\underline{x}$  becomes slow, it takes at most  $n \cdot 2^x$  rounds to return to leader.
- No messages are sent once leader's message has returned to leader.
- Slow message  $\underline{i}$  is forwarded  $n \cdot 2^{\mathbf{z}}/2^{\mathbf{i}}$  times in  $n \cdot 2^x$  rounds.
- Worst case is when ids are 0 to n-1 and  $\underline{x=0}$ .

Number of type 3 messages is at most  $\sum_{i=0}^{n-1} n \cdot \frac{2^0}{2^i} \le 2n$ .

#### Time Complexity of Synchronous LE Algorithms

Time Complexity:  $O(n \cdot 2^x)$ , where  $\underline{x}$  is the minimum id. Even worse than the previous algorithm.

Both these algorithms have two potentially undesirable properties:

- rely on the numeric values of the ids to count
- number of rounds bears no relationship to  $\underline{n}$ , but depends on the minimum id

The last the state of the day of the complexity

# Gallager-Humblet-Spira Minimal Spanning Tree

- Assumption:
  - unique weights w(e) for all edges
  - ASYNCHRONOUS LSYSTEM WITH AN ARBITRARY CONNECTED GRAPH G=(VE)



- Fragment F is a subtree of the MST of G
- Outgoing edge e from F if one of the edge's nodes is in F and the other is not in F.



GHS ALGORITHM REQUIRES O(1E1+1V/log |V|)
MESSAGES

# GHS ALBORITHM

ASSUMPTION

FURTHER: EDGES FOLLOW A FIFO POLICY

LEMMA: IF THE EDGES HAVE DISTINCT WEIGHTS => THE MST IS UNIQUE

PROOF: (BY CONTR) TI TO MOT OFG.

LET & BE THE MIN-WEIGHT EDGE BELONGING TO 11 12. W.L.O.G., LET EETI.

To U Se & CONTAINS A CYCLE, AND AT LEAST ONE EDGE & OF THIS CYCLE of TI.

week we' AND To Use's I see is A SPANNING TREE WITH WEIGHT < THAN TZ (CONTR!)

LEMMA: LET T BE THE UNIQUE MST OF G.

Y FRAGMENT F OF T, THE MINIMUM-WEIGHT OUTGOING EDGE OF FE TO T.

PROOF: (BY CONTR) LET e BE THE MWOE OF F, AND LET etT. TU Se ? CONTAINS A CYCLE, AND SUCH A CYCLE CONTAINS AT LEAST ONE ADDITIONAL OUTGOING EDGE OF F, SAY e'.



FORTED THANTS CONTR.

# GHS (continued)

Proposition: If F is a fragment and e is the least weight outgoing edge of F, then
 F ∪{e} is a fragment.

#### Algorithm MAIN IDEA

Start with single node fragments and incrementally enlarge them

# Global description of GHS

- Maintain for G = (V,E) a set of fragments such that  $\bigcup_i \text{ nodes}(F_i) = V \text{ and } i \neq j \Leftrightarrow F_i \cap F_j = \emptyset$
- start with one-node fragments
- nodes in a fragment cooperate to find the lowest weight outgoing edge
- when the edge is found, combine with the other fragment
- Terminate when only one fragment remains
- OPERATIONS ARE GOORDINATED BY CORE FLORES OF THE FRAGMENTS

# TRAGMENT: (w, L)

W: WEIGHT OF THE CORE EDGE

L: LEVEL OF THE FRAGMENT, WITH L=0

IF THE FRAGMENT CONTAINS A SINGLE NODE

FRAGMENT COMBINATION L1 = L2
UNION ABSORPTION L1 < L2

A COMBINATION OF TWO FRAGMENTS OF LEVEL L-1 PRODUCES A NEW FRAGMENT OF LEVEL L AND WHOSE CORE EDGE IS THE EDGE USED FOR THE UNION

AN ABSORPTION OF A FRAGMENT DOES NOT CHANGE THE IDENTITY OF THE ABSORBING FRAG.

AS SOON AS A UNION TAKES PLACE, THE IDENTITY OF THE RESULTING TRAGMENT IS SENT TO ALL ITS NODES.

# Local description of GHS

- Each node p stores
  - the state of its edges e,  $state_p[e] \in \{basic, branch, reject\}$
- DENTRY name of its fragment w
  - level of its fragment
    - best weight of outgoing edges from its fragment
    - father channel (i.e. route towards the core node)
    - its own state, state [P] = {sleeping, Find, Found}



## TYPE OF MESSAGES:

INMATE (w, L, s): SENT BY CORE NODES RIGHT AFTER CREATION

TEST (W,L): SENT BY A NODE IN FIND STATE

OVER ITS MINIMUM-WEIGHT BASIC EDGE

TO CHECK WHETHER IS AN OUTGOING EDGE

ACCEPT.

REPORT (W): USED TO FIND THE MIN-WEIGHT OUTGOING EDGE

CHANGE - CORF: SENT BY CORF NODES TO ACTIVATE UNION

CONNECT (W,L): REQUEST OF UNION

# GHS ALGORITHM

- · IDENTIFYING OUTGOING EDGES
- · FINDING THE MINIMUM OUTGOING EDGE
- FRAGMENT UNION

#### DENTIFYING OUTGOING EDGES

NODE P. IN FI (W1, L1) PICKS ITS MINIMUM-WEIGHT

BASIC EDGE, SAY & AND SENDS ON IT A

Test (W1, L1)

- 1) IF  $(\omega_1, L_1) = (\omega_2, L_2)$  e is NOT AN OUTGOING EDGE => Property Reject; e:= REJECTED
- 2) IF  $(\omega_1, L_1) \neq (\omega_2, L_2)$  AND  $L_2 \geq L_1$ Proposition of the pr
- 3) IF (ws, Ls) + (wz, Lz) AND L2 < L1
  - THE ABOVE CONDITIONS IS VERIFIED
    - THIS BLOCKS R (AND THE WHOLE TI!)

# FINDING THE MINIMUM OUTBOING EDGE

- PROCESS STARTED BY CORE MODES
  BY SENDING Initiate (w, L, Find) TO
  ALL MODES IN THE FRAGMENT, THROUGH
  EDGES OF THE FRAGMENT
- A NODE PI RECEIVING THE Initiate CHANGES ITS
  - 1) UPDATES ITS INFORMATION ABOUT FRAGMENT, THAT IS (W,L);
  - 2) RECORDS THE DIRECTION TOWARDS THE CORE
  - 3) IF PI HAS OUTWARDS EDGES OF THE FRAGMENT, FORWARDS THE INTTATE MESSAGE
  - 4) FIND ITS LOCAL MINIMUM OUTGOING EDGE
  - IF P is A LEAF OF THE FRAGMENT Report (W)
    AND ENTERS INTO A Found STATE (NOTE: WI CAN BE +00)
  - IF PI IS INTERNAL, WAITS FOR ALL THE REPORTS
    FROM ITS CHILDREN, AND CHOOSES THE MINIMUM;
    FINALLY SENDS A REPORT (WI) AND MARKS THE
    BEST EDGE, AND ENTERS INTO A FOUND STATE
  - BASED ON ALL THE REPORTS, CORE NODES SEND A Change-Core to THE CHOSEN NODE, THAT SENDS A Connect (W, L) OVER ITS MIN. OUTGOING EDGE, AND MARKS IT AS A Branch.

# CHAMBINARY TRAGMENTS UNION



1) IF L2=L1 AND B IS GOING TO SEND (OR ALPEADY SENT) A CONNect ON EDGE e, THEN COMBINATION TAKES PLACE

$$F = F_1 \cup F_2$$

$$F(\omega(e), L_1+1)$$

$$e \quad CORF \quad EDGE$$

2) IF L\_2>L1, THEN ABSORPTION OF \$\frac{1}{2} INTO \$\frac{1}{2}\$
TAKES PLACE

$$f_2 := f_2 \cup f_1$$

$$f_2 \left( \omega_2, L_2 \right)$$

PSENDS AN Initiate (U2, L2, S) TO PO WHERE SE SFIND, FOUNDS; PFORWARDS THE WHITE MESSAGE TO NODES OF E.

3 L1 > L2 IS IMPOSSIBLE (FI IS LOCKED)

Approfonolimento

ESEGUIRE GHS SUL SEGUENTE GRAFO:



HP 3 L'ALGORITMO INIZIA DA PI E PS

2) IL SISTEMA E PSEUDOSINCRONO: I MESSAGGI INVIATI DA PROCESSORI DISPARI VENGONO CONSEGNATI IN 1 UNITA DI TEMPO, QUELLI INVIATI DA PROCESSORI PARI, IN 2 UNITA DI TEMPO

# CORRECTNESS OF GHS

- · FULL PROOF IS VERY COMPLICATED
- WE FOCUS ON GENERAL PROPERTIES:
  - · TERMINATION
  - · SYNCHRONIZATION
  - · ABSORPTION WHILE SEARCHING FOR A MOE

# TERMINATION

RESPONSE TO test and Connect ARE SOMETIMES
DELAYED => DEADLOCK IS A PRIORI POSSIBLE

LEMMA FROM ANY CONFIGURATION WITH AT LEAST TWO FRAGMENTS, EVENTUALLY EITHER ABSORPTION OR COMBINATION TAKES PLACE

Proof: LET L BE THE MIN LEVEL IN THIS CONFIGURATION,

AND LET F BE THE LEVEL- L TRAGITENT WHOSE

MOE HAS MIN WEIGHT AMONG ALL LEVEL- L TRAGITENTS.

A Test MESSAGE FROM F EITHER REACHES

F' OF LEVEL L'>L OR A Sleeping Mode.

IN THE FIRST CASE, F GETS A RERY IMMEDIATELY.

IN THE SECOND CASE, THE AWAKENED NODE BECOMES

A TRAGITENT OF LEVEL L=0 CHOOSE A NEW

F AND APPLY RECORSIVELY THE ARGUMENT.

THE FIRST CASE APPLIES IS REACHED.

EVENTUALLY, FFINDS ITS MOE:



TWO CASES:

1 L(F') > L(F) + ABSORPTS F

2 L(F') = L(F) e is also the moe

OF F' (BY CONSTRUCTION), AND F' CANNOT

BE LOCKED FAND F' COMBINE.

GORGLARY: GHS TERMINATES.

Proof: IF GHS DOES NOT TERMINATE THERE

MUST BE AT LEAST 2 FRAGMENTS (SINCE

WITH JUST ONE IT WOULD TERMINATE)

THE ABOVE LEMMA GUARANTEES THAT

THE NUMBER OF FRAGMENTS WILL BE PROGRESSIVELY

REDUCED UP TO 1

THESIS

### SYNCRONIZATION

MESSAGE TRANSMISSION TIME IS UNBOUNDED

A HODE MIGHT HAVE INACCURATE INFO
ABOUT ITS OWN FRAGMENT.

EXAMPLE

Test

ABSORPTION

F

ABSORP

WE WILL SHOW THAT AN INACCURATE ANSWER DOES NOT AFFECT CORRECTNESS.

CLAIM 1 LET & BE THE CORE EDGE OF SOME FRAGMENT F. THEN, & IS NEVER THE CORE OF A FRAGMENT F' SUCH THAT F & F!

Claim 2 A NODE P: WHOSE FRAGMENT ID is CURRENTLY (w, L) PELONGS TO A FRAGMENT WITH LEVEL L' > L.

Proof IF THE INFO OF PE IS INACCURATE

PE IS PARTICIPATING IN COMBINATION OR

ABSORPTION IN BOTH CASES, LI > L

REMARKA IF POSENDS A LEST TO PS THE

FRAGMENT POSENDS TO IS NOT JOINING (NOT ABSORDED)

WITH OTHER FRAGMENTS THE ONLY INCORPET



REMARK 2 Reject MESSAGES ARE ALWAYS CORRECT

Claim 3 IF & SENDS AN Accept MESSAGE

PE AND B ARE NOT IN THE SAME FRAGMENT.

Froof Accept IS SENT IFF  $(w_3, L_3) \neq (w_2, L_2)$  AND  $L_2 \gg L_1$ .

IF L2>L1 THE REAL L2, BY CLAIM 2, CAN ONLY BE > L2 > L1 PLANDPJ ARE IN DIFFERENT FRAGMENTS.

IF  $L_2=L_1$  THE REAL  $L_2>L_2$ .

IF IT IS =, THEN  $W_2\neq W_1$  OK

IF IT IS >, THEN SEE ABOVE

NOTE THAT IF L2 < L1, THE FOLLOWING IS
POSSIBLE: F. MIGHT ABSORPT F2 ELSEWHERE,
AND P3 IS STILL NOT INMATED BY F2

PE AND PS ARE IN THE SAME FRAGMENT BUT
PS DOES NOT KNOW IT

NO PROBLEM, SINCE PJ IS NOT REPLYING
TO P:



## ABSORPTION WHILE SEARCHING FOR A MOE



F1: (w1, L1) F2: (w2, L2) L2<L1

AFTER Connect (W2, L2) FROM PJ TO Pi, Pi SENDS TO PJ AN Juitiate (W1, L1, state) Two CASES:

- 1 State = Find No PROBLEM, SINCE IN THIS CASE F\_2 WILL PARTICIPATE IN THE SEARCH OF THE MOE OF F\_ (=F\_1UF\_2).
- 2) State = Found Pi HAS ALREADY SENT A Report MESSAGE.

POTENTIALLY, THIS MIGHT CAUSE A PROBLEM! BUT:

CLAIM THE MOE OF F3 is ALSO THE MOE OF FT.

PROOF THE COMMET BUE CANNOT BE THE MIN-WEIGHT OUTGOING EDGE OF PE (OTHERWISE PE WOULD BE LOCKED FROM B NO REPORT).

w(e) < w(e) where w is a substitute of w in w is a substitute of w in w i

$$\omega$$
 (MOE(F<sub>2</sub>))  $\leq \omega$  (e')  $< \omega$ (e)  $= \omega$  (MOE(F<sub>2</sub>))  $\leq$ 

< W (ANY OUTGOING EDGE OF F2)



# MESSAGE COMPLEXITY

LEAST 2 NODES.

Proof: BY INDUCTION.

L=0 TRIVIAL

ASSUME TRUE UP TO FRAGMENTS OF LEVEL L-1.
LET F: (w, L)

T WAS CREATED

FITHER AFTER COMBINATION
OF F. AND F2 OF LEVEL L-1:  $|F| = |F_3| + |F_2| \ge 2^{\frac{1}{2}} + 2^{\frac{1}{2}} = 2^{\frac{1}{2}}$ 

OR AFTER ABSORPTION OF A LEVEL-L' < L PRAGMENT F

APPLY RECURSIVELY TO FIF

THEOREM: GHS REQUIRES ( W+ nlog n) MESSAGES.

ANY NODE SENDS: 1 Initiate

RECEIVES 1 Change Core

1 Report

EACH TIME THE LEVEL OF ITS FRAGRENT INCREASES

IT CAN GO THROUGH & logn LEVELS





# Shared Memory SYSTEM (ASYNCRONOUS)



Processors communicate via a set of shared variables, REGISTER. instead of passing messages.

Each shared variable has a type, defining a set of operations that can be performed atomically. 

READ/WRITE

ACCESS PATTERN: SINGLE /MULTIPLE

Changes to the model from the message-passing case:

- no inbuf and outbuf state components No MESSAGES!
- configuration includes values of shared variables
- only event type is a computation step by a processor. 4 NO DELIVERY! When  $p_i$  takes a step:
  - $-p_i$ 's state in old config specifies which shared variable is to be accessed and with which operation
  - operation is done; variable's value in the new config changes according to operation's semantics
  - $-p_i$ 's state in new config changes according to its old state and result of operation

SYSTEM: N PROCESSORS Pa, ..., Pn
M REGISTERS Ry, ..., Rum

CONFIGURATION: C = (95, -,9n, 75, ..., 7m)

EVENT: COMPUTATION STEP BY & PROCESSOR \$

EXECUTION

DEGMENT:  $C_0, \phi_0, C_1, \phi_1, C_2, \phi_2, \dots$ 

CK+1 is the result of APPLYING THE
TRANSITION FUNCTION OF PIX TO ITS
STATE IN CK, AND APPLYING PIX MEMORY
ACCESS OPERATIONS TO THE REGISTERS IN Ck.

CK OK CKHI STEP OF FOR

SCHEDOLE: 00, 01, 02, ....

EXECUTION: EXECUTION SEGMENT WITH CO AS INITIAL CONFIGURATION OF THE SYSTEM

ADMISSIBLE: IN AN INFINITE EXECUTION, EACH PROCESSOR TAKES INFINITE STEPS.

C: CONFIGURATION O'= i, i2, ... SCHEDULE - (C,O) OF O' TO C

#### **Mutual Exclusion Problem**

Each processor's code is divided into four sections:



- entry section: synchronize with others to ensure mutually exclusive access to the...
- <u>critical</u> section: use some resource; when done, enter the...
- exit section: clean up, and then enter the...
- remainder section: not interested in the critical section

#### **Mutual Exclusion Algorithms**

A mutual exclusion algorithm specifies code for entry and exit sections to ensure that:

- mutual exclusion: at most one processor is in its critical section at any point, and
- either **no deadlock:** if a processor is in its entry section at some point, then later some processor is in its critical section,
- or **no lockout:** if a processor is in its entry section at some point, then later the *same* processor is in its critical section,
- or **bounded waiting:** no lockout + while a processor is in its entry section, other processors enter the critical section no more than a certain number of times.

Algorithm is allowed to assume:

- no processor stays in its critical section forever
- variables used in the entry and exit sections are not accessed during the critical and remainder sections

#### **Overview of Mutual Exclusion Results**

The main complexity measure of interest for shared memory mutual exclusion algorithms is the amount of shared space necessary, which is affected by:

- how powerful the type of the shared variables
- how strong the liveness condition to be satisfied

For most powerful shared variables (read-modify-write), number of different states of the shared memory is:

|    | upper bound                      | lower bound                  |
|----|----------------------------------|------------------------------|
| ND | 2 (T&S alg.)                     | 2 (obvious)                  |
| NL | $\frac{n}{2} + c$ (Burns et al.) | $\frac{n}{2}$ (Burns et al.) |
| BW | $n^2$ (queue alg.)               | n (Burns & Lynch)            |

# Overview of Mutual Exclusion Results (cont'd)

If the shared variables are the weak read/write kind, we measure the number of <u>distinct</u> variables needed.

|    | upper bound       | lower bound     |
|----|-------------------|-----------------|
| ND |                   | n               |
|    |                   | (Burns & Lynch) |
| NL | 3n boolean        |                 |
|    | (tournament alg.) |                 |
| BW | 2n unbounded      |                 |
|    | (bakery alg.)     |                 |

# Overview of Mutual Exclusion Results (cont'd)

If the shared variables are the weak read/write kind, we measure the number of *distinct* variables needed.

|    | upper bound       | lower bound       |
|----|-------------------|-------------------|
| ND |                   | n (Burns & Lynch) |
|    |                   | (Bullis & Lylich) |
| NL | 3n boolean        |                   |
|    | (tournament alg.) |                   |
| BW | 2n unbounded      |                   |
|    | (bakery alg.)     |                   |

### **Analysis of Bakery Algorithm**

Mutual Exclusion: This follows by showing that processor in critical section has the unique smallest Number (breaking ties with ids) among all contending processors.

**No Lockout:** This follows by observing that the contending processor  $p_i$  with the smallest Number (breaking ties with ids) will be next — every other processor that picks a Number while  $p_i$  is waiting will get a larger one.

**Space Complexity:** Number of shared variables is 2n; Choosing variables are binary but Number variables are unbounded.

Lemma 1 | F Pi is IN THE CRITICAL SECTION, AND,

FOR SOME K & i number [K] \$ 0 - 1 (number [K], K) > (number [i], i)

Proof SINCE P. IS IN THE CS, IT PASSED THE SECOND WAIT STATEMENT FOR J=K. THERE ARE 2 CASES:

PREAD NUMBER [K] = 0. IN THIS CASE:

EITHER: PR WAS IN THE REMAINDER SECTION

OR DID NOT FINISH IN CHOOSING ITS HUMBER.

BUT, P: ALREADY PASSED THE FIRST WAIT STATEMENT WITH choosing [K] = FALSE AND HAD ALREADY CHOSEN ITS NUMBER.

PR STARTED READING homber AFTER Pi number [i] < number [k]

2) PI READ THAT (NUMBER[K], K) > (NUMBER[i], i). IN THIS
CASE, THIS REMAINS TRUE UNTIL PI EXITS THE CS
OR PR DOES NOT CHOOSE A NEW NUMBER (THAT
WILL BE, IN ANY CASE, LARGER THAN NUMBER[i]).

LEMMA 2 IF Pi is INTHE CS, THEN houser[i]>0.

FOR ANY PJ. SINCE Pi is IN THE CS, IN THE ENTRY

FOR ANY PJ. SINCE Pi is IN THE CS, IN THE ENTRY

ECTION CHOSE A NUMBER > Number [J], YJ, i.e., >0.

THEOREM: THE BA PROVIDES MUTUAL EXCLUSION.

Proof: By CONTRADICTION.

ASSUME Pi AND PJ ARE SIMULTANGOUSLY IN THE CS.

FROM LEMMA 2, NUMber [i], Mumber [J] & 0,

AND FROM LEMMA1, (Number [i], i.) > (Number [J], J)

AND (Number [J], J) > (Number [i], i), A CONTRAD.

Proof: By CONTRADICTION.

ASSUME THERE EXIST STARVED PROCESSORS, AND LET Pi DE THE STARVED PROCESSORS WITH HINIMUM (NUMBER[i],i). ALL PROCESSORS CHOOSING A NUMBER AFTER PI WILL RECEIVE A LARGER TICKET, AND THEREFORE WILL NOT ENTER THE CS BEFORE PI. ALL PROCESSORS WITH SHALLER TICKET WILL ENTER THE CS (SINCE NO STARVED) AND EXIT IT.

AND ENTER THE CS, A CONTRADICTION.

SPACE 2.11 SHARED UNBOUNDED VARIABLES.
IN FACT, NUMBER[i]=0 & I IFF ALL THE

PROCESSORS ARE IN THE REMAINDER SECTION.

Approfondimento: PROVARTO CONFUTARE LA SEGUENTE

AFFERMAZIONE: THE BA ALGORITHM IS BOUNDED WAITING.

# BOUNDED ME ALBORITHM



: Bounded ME Algorithm For 2 Processors WHICH ALLOWS LOCKOUT Po, Pa

, O OTHERWISE

2 BOOLEAN : W[i] <

1 IF P. IS INTERESTED IN ENTERING THE CS

THE ALGORITHM IS ASYMMETRIC: PRIORITY IS GIVEN TO PO: PO ENTERS THE CS IFF PO IS NOT INTERESTED IN IT AT ALL!

CODE FOR PO

< Entry>

CODE FOR P

<Entry>

W[0]:=1

WAIT UNTIL (W[1]=0)

< Critical Section >

< Exit >

W[0]:=0

L: W[1]:=0
WAIT UNTIL (W[0]=0) W[1] :=1 IF (W[0]=1) GOTO L

< Critical Section>

< Exit>

W[1]:= 0

Approfondimento: DIMOSTRARE CHE L'ALGORITMO GARANTISCE LA ME, L'ASSENSA DI STALLO MA NON L'ASSENSA DI Blocks.

## Bounded 2-Processor ME Algorithm - No LOCKOUT

#### Uses 3 binary shared variables:

- W[0]: written by  $p_0$  and read by  $p_1$ , initially 0
- W[1]: vice versa, initially 0
- Priority: written and read by both, initially 0

# Entry: CODE FOR Pi i=91

- 1. W[i] := 0
- 2. wait until W[1-i] = 0 or Priority = i
- 3. W[i] := 1
- 4. if (Priority = 1-i) then
- 5. if (W[1-i] = 1) then goto Line 1
- 6. else wait until (W[1-i] = 0)
- 7. < Critical Section >

#### < Exit:

- 9. Priority := 1-i
- 9. W[i] := 0

#### No Deadlock for 2-Processor Algorithm

(Useful for showing no lockout.)

If one processor (say  $p_1$ ) ever enters its remainder section for good, then the other processor (say  $p_0$ ) cannot be starved, since it will keep seeing W[1] = 0.

So any deadlock would starve both processors.

WLOG, suppose Priority gets stuck at 0 after both processors are stuck in their entry sections.

Thus  $p_0$  is not stuck in Line 2, skips Line 5, and is stuck in Line 6.

Thus  $p_1$  skips Line 6 and is stuck in Line 2 with W[1] stuck at 0.

But then  $p_0$  gets unstuck and enters the critical section.

#### No Lockout for 2-Processor Algorithm

Suppose in contradiction  $p_0$  (WLOG) is starved.



Since there is no deadlock,  $p_1$  subsequently goes critical infinitely often.

The first time that  $p_1$  executes Line 8 after  $p_0$  gets stuck in its entry section, Priority gets stuck at 0.

Then  $p_0$  is stuck in Line 6, waiting for W[1] to equal 0, with W[0] = 1.

But the next time  $p_1$  enters its entry section, it gets stuck in Line 2 with W[1] = 0. This contradicts no deadlock.

p0 in entry

p1 at Line 8; Priority = 0 forever after p0 stuck in Line **6** with W[0] = 1 forever

p1 enters entry, sets W[1] to 0, stuck in Line 2

### **Mutual Exclusion for 2-Processor Algorithm**

Mutual Exclusion: Suppose in contradiction  $p_0$  and  $p_1$  are simultaneously critical.



WLOG suppose (2) precedes (3). But then in (4),  $p_0$  reads 1, not 0, and thus  $p_0$  cannot be critical at (1).

#### Tournament Algorithm (cont'd)

Pseudocode in book is recursive: HP:  $n = 2^{k+1}$ 

- $p_i$  begins at node  $2^k + \lfloor \frac{i}{2} \rfloor$ , playing the role of  $p_{i \mod 2}$ , where  $k = \lceil \log n \rceil 1$ .
- After winning at node  $\mathbf{v}$ , "critical section" for node  $\mathbf{v}$  is competition for  $\mathbf{v}$ 's parent, node  $\lfloor \frac{\mathbf{v}}{2} \rfloor$ , playing role of  $p_{\mathbf{v} \mod 2}$  in 2-proc. algorithm.

<u>Correctness</u>: Based on the correctness of the 2-processor algorithm and the tournament structure.

- Projection of an admissible execution of tournament algorithm onto a particular node produces an admissible execution of 2-proc. algorithm.
- ME for tournament algorithm follows from ME for 2-proc. algorithm at the <u>root node</u>.
- NL for tournament algorithm follows from NL for the 2-proc. algorithms at all nodes of tree.

What about bounded waiting? No.

Space Complexity: 3n boolean read/write variables.

```
procedure NODE (v: inTeger; side: {0,1})
   L: WANT SIDE : = 0
       WAIT UNTIL (WANTY-SIDE = 0 OR PRIORITY = SIDE)
(entry>) WANT V := 1
for NODE V FRIORITY = 1-SIDE) THEN
            IF (WANTY-SIDE = 1) THEN GOTO L
      ELSE WAT UNTIL (WANT 1-SIDE = 0)
        IF (V=1) THEN AT THE ROOT
              < CRITICAL SECTION >
```

ELSC NODE ([1/2], V mod 2)

### < Exit>

WANT SIDE := 0 PRIORITY V := 1-SIDE

END PROCEDURE

#### **Tournament Algorithm**

No lockout mutual exclusion algorithm for  $\underline{n}$  processors using bounded size variables:

- based on a <u>tournament tree</u>, complete binary tree with n-1 nodes
- A copy of the 2-processor algorithm is associated with each node of the tree
- Each proc. begins at a specified leaf, two per leaf
- A proc proceeds to next level in tree by winning the 2-processor competition for current node:
  - on left side, play role of  $p_0$
  - on right side, play role of  $p_1$
- when processor wins at root, it enters critical section



# FAULT-TOLERANCE

FAILURES: BENIGNE (CRASH)

BYZANTINE (ARBITRARY BEHAVIOUR)

- COORDINATED ATTACK PROBLEM (CAP)
  - CONDENSOUS PROBLEM (CP)

IN SYNCRONOUS MPS WITH BENIGN FAILURES

CP IN SYNC MPS WITH BYZANTINE FAILURES

2 ALGORITHMS: OPTIMAL # OF ROUNDS BUT EXPONENTIAL MESSAGE COMPLEXITY

DOUBLE # OFROUNDS BUT POLYNOMIAL MESS COMMERTIN

HIRD

PART: IMPOSSIBILITY OF SOWING CP (EVEN IN

THE CASE OF ONLY ONE BENIGN FAILURE)

FOR ASYNCROMOUS SYSTEMS

BOTH MPS AND SMS

# THE COORDINATED ATTACK PROBLEM



#### SYNC MPS WITH BENIGN FAILURES





AGREEMENT: y= y2

VALIDITY: IF  $x_1 = x_2 = 0$  AND NO MESSAGES APPLIE AT P3 ANDP2 NON-TRIMALITY:

NON-TRIVIALITY: THERE IS AN EXECUTION IN WHICH  $y_1 = y_2 = 1$ 

# NO SOLUTION!

Det: LET & BE AN EXECUTION AND LET PI BE A PROCESSOR. THE VIEW OF Pi IN & , a/pi, IS THE SUBSEQUENCE OF COMPUTATION AND MESSAGE DELIVERY EVENTS THAT OCCUR IN P. SIMILARITY BETWEEN EXECUTIONS:

 $\alpha_1, \alpha_2$  EXECUTIONS:  $\alpha_1, \alpha_2 \iff \alpha_1 | p_i = \alpha_2 | p_i$ 

REMARK: IF  $\alpha_1^{p_i}\alpha_2$ , THEN P: MAKES THE SAME DECISIONS IN  $\alpha_1$  AND  $\alpha_2$ .

THERE IS NO ALGORITHM THAT SOLVES THE CAP. BY CONTR)

By EXECUTION S.T. y1=y=1. LET K BE THE # OF MESSAGES SENT IN P1.

W.L.O.G., ASSUME LAST MESSAGE MAPS -> P2.

LET OR BE AN EXECUTION IDENTICAL TO BY, EXCEPT Mx IS NOT RECEIVED BY P2.

Produce Produces 1 BOTH IN BY AND IN de DECIDES 1 AS WELL IN de.

THEN, LET 00 = IDENTICAL TO 00 , EXCEPT MK-1 IS NOT DELIVERED.

THE PROCESSOR SENDING MK-1. ALSO HERE, Y==9=1.

d<sub>1</sub> ≥ d<sub>2</sub> WHERE IN d<sub>1</sub> M<sub>2</sub> IS NOT DELIVERED. y<sub>1</sub>=y<sub>2</sub>1

BO IDENTICAL TO X3, WITH 26=0 A RB BY=4=1

DENTICAL TO B, WITH 22=0 B B BB A B= 4==1

#### **Fault-Tolerant Consensus**

#### Types of **processor failure**:

- *crash*: in middle of step, might only send a subset of messages
- Byzantine: take arbitrary actions

Consensus problem: Every processor has an input. 2;

- *Termination*: Eventually every nonfaulty processor must decide on a value.
- Agreement: All nonfaulty decisions must be the same.
   Validity: If all inputs are the same, then the non-
- Validity: If all inputs are the same, then the non-faulty decision must be that input.  $y \in \{x_1, \dots, x_n\}$

Validity ensures that outputs bear some relationship to inputs (but also rules out easy solutions!).

*Background:* Collection of armies, all on the same side. Each general begins with an opinion whether to attack. If all attack, they will win, otherwise they will lose. Some generals are *traitors* and will behave incorrectly.



#### **Overview of Consensus Results**

Let f be the maximum number of faulty processors.

Tight bounds for synchronous message passing:

|                       | crash failures | Byzantine failures  |
|-----------------------|----------------|---------------------|
| number of rounds      | f+1            | $\underline{f} + 1$ |
| total number of procs | $\geq f + 1$   | $\geq 3f + 1$       |
| message size          | polynomial     | polynomial          |

• Asynchronous case: *impossible* in both shared memory and message passing, even if only one crash failure is to be tolerated.

#### **Modeling Processor Failures**

For an execution to be admissible:

#### Crash Failures:

All but a set of at most f processors (the **faulty** ones) take an infinite number of steps.

- In synchronous case: once a faulty processor fails to take a step in a round, it takes no more steps.
- In message passing case: In a faulty processor's last step, an arbitrary subset of the processor's outgoing messages make it into the channels. (This is where the difficulties lie.)

#### Byzantine Failures:

A set of at most f processors (the **faulty** ones) can send messages with arbitrary content and change state arbitrarily (not according to their transition functions).

## Consensus Algorithm for Crash Failures

#### Code for $p_i$ :

```
v := my input = xi ∈ N
at each round 1 through f+1:
    fif I have not yet sent v then
        send v to all
v := minimum among all received
        values and current value of v
in round f+1, decide on v=5:
```

**Termination:** By the code.

Validity: Holds since processors do not introduce spurious messages: if all inputs are the same, then that is the only value ever in circulation.

#### **Analysis of Crash Consensus Algorithm**

BY CONTR. ,

**Agreement:** Suppose  $p_j$  decides on a smaller value,  $\boldsymbol{x}$ , than does  $p_i$ . Then x was hidden from  $p_i$  by a chain of faulty processors:



There are f + 1 faulty processors in this chain, a contradiction.

## Performance:

- number of processors n > f
- f + 1 rounds

 $O(n^2 \cdot |V|)$  messages each of size  $\log |V|$ , where V is the input set.

# MESSAGES 
$$\leq_{2}\underline{n \cdot (n-1)}$$
. min  $(|V|, f+1) \leq n^{2} |V| = 0$ 

EDGES
$$= O(n^{2} |V|) = 0$$

$$= O(n^{3})$$

# Approfondimento: K-CONSENSOUS

P= { P3, P2, ---, Pn} CLIQUE

NPUT: X = 5 xs, xz, ..., xn}

QUIPOT: Y= & yi, yi, yi, yi, EX, for h

AND THE NUMBER OF THE NUMBER OF

AND THE NUMBER OF DIFFERENT VALUES < K.

PRESENT A SYNCRONOUS ALGORITHM WITH # ROUND & +1. (ASSUME & DIVIDES f).

CODE FOR Pi

V := wy input

at each round 1 through fx+1

if I have not yet sent v then send it to all vi= uninimum { received volues, v} at last round, decide on v

MESSAGE
COMPLEXITY:  $\leq 2 \cdot n \cdot (n-1)$  . min  $(|V|, \frac{1}{2} + 1) =$  $O\left(\min\left(\frac{N^3}{K}, N^2 \cdot |V|\right)\right)$ 

#### **Byzantine Failures**



How many processors total are needed to solve consensus when f = 1?

- Suppose n = 2. If  $p_0$  starts with input 0 and  $p_1$  starts with input 1, then someone has to change, but not both. What if one processor is faulty? How can the other one know?
- Suppose n = 3. If  $p_0$  has input 0,  $p_1$  has input 1, and  $p_2$  is faulty, then a tie-breaker is needed, but  $p_2$  might be malicious.

**Theorem (5.8):** Any consensus algorithm for message passing that tolerates 1 Byzantine failure must have at least 4 processors total.

**Proof:** Suppose there is a consensus algorithm  $\mathcal{A} = (A, B, C)$  for 3 processors and 1 Byzantine failure.





# **Processor Lower Bound for Byzantine Case**

Now consider a ring of six processors running components of A in this fashion:



Give each processor the indicated input and let the ring execute. Call the resulting execution  $\beta$ .

- $\beta$  does not necessarily solve consensus: it doesn't have to, since the assumptions under which A is supposed to work do not hold.
- However, the processors do something. This behavior will be used to specify the behavior of the faulty processors in certain particularly adversarial executions of  $\mathcal{A}$  on the triangle.

#### **Processor Lower Bound (cont'd)**

Let  $\alpha_1$  be this execution, in which 1 is decided:



Let  $\alpha_2$  be this execution, in which 0 is decided:



#### **Processor Lower Bound (cont'd)**



What is decided in  $\alpha_3$ ?

- $p_0$ 's view in  $\alpha_3$  equals  $p_0$ 's view in  $\beta$ , which equals  $p_0$ 's view in  $\alpha_1$ . Thus  $p_0$  decides 1 in  $\alpha_3$ .
- $p_2$ 's view in  $\alpha_3$  equals  $p_2$ 's view in  $\beta$ , which equals  $p_2$ 's view in  $\alpha_2$ . Thus  $p_2$  decides 0 in  $\alpha_3$ .
- But this contradicts agreement.

Read reduction in textbook to show n=3f is impossible for f>1.

THEOREM: IN A SYSTEM WITH IN PROCESSORS, WITH WITH IT KOST & BYZANTINE THEOREM , THERE IS NO LIGORITHM WHICH SOLVES THE CONSENSUS PROBLEM IF 11 < 3 f.

PROOF: (BY CONTRADICTION)

FOR THE SAKE OF SIMPLICATY, LET 1 = 31.

PARTITION THE SET OF PROCESSORS PINTO 3 SETS P1, P2, P3 EACH CONTAINING EXACTLY

1 = f PROCESSORS.



IF I' IS FAULTY AT MOST

FROCESSORS ARE FAULTY IN THE SIMULATED

SYSTEM THE SIMULATED SYSTEM WORKS

SYSTEM THE SIMULATED SYSTEM WORKS
CORRECTLY PI, P2', P3' WORKS CORRECTLY CANTEADOR'S

#### Consensus Algorithms for Byzantine Failures

Minimum number of rounds is f + 1, since crash failures are a special case of Byzantine failures.

## Exponential Tree Algorithm HEIGHT: \$+1

Each processor maintains a tree data structure in its local state. Each node of the tree is labeled with a sequence of processor indices with no repeats:

- root's label is empty sequence  $\lambda$  (root has level 0)
- root has n children labeled 0 through n-1
- child node labeled i has n-1 children labeled i:0 through i:n-1, skipping i:i
- in general, node at level d with label v has n-d children labeled v: 0 through v: n-1, skipping any index appearing in v [LENGTH OF THE LABEL: d+1]
- nodes at level f + 1 are the leaves.

#### **Exponential Tree Algorithm**

Each processor fills in the tree nodes with values as the rounds go by:

- initially, store your input in the root (level 0)
- round 1: send level 0 of your tree (the root); store value received from  $p_j$  in node j (level 1) (default if none)
- round 2: send level 1 of your tree; store value received from  $p_j$  for node k in node k:j(level 2) ("the value that  $p_j$  told me that  $p_k$  told  $p_j$ ") (default if none)
- continue for f + 1 rounds

In the last round, each processor uses the values in its tree to compute its decision. The decision is  $\underline{\text{resolve}}(\lambda)$ , where  $\underline{\text{resolve}}(\pi)$  equals

- value in tree node labeled  $\pi$  if it is a leaf
- majority{ $\underline{resolve}(\pi')$  :  $\pi'$  is a child of  $\pi$ } otherwise (default if none).

#### **Example of Exponential Tree**

The tree when n = 4 and f = 1:



IN GENERAL, A NODE IN THE TREE (IS LABELLED WITH TO SEQUENCE:

" id SAYS THAT id-1 SAID THAT id-2 SAYD THAT...
THAT is SAID X"

### **Proof of Exponential Tree Algorithm**

Lemma (5.10): Nonfaulty processor  $p_i$ 's resolved value for node  $\pi = \pi' j$ , what  $p_j$  reports for  $\pi'$ , equals what  $p_j$  has stored for  $\pi'$ .

**Proof:** By induction on the height of  $\pi$ .

Basis:  $\pi$  is a leaf. Then  $p_i$  stores in node  $\pi$  what  $p_j$  sends it for  $\pi'$  in the last round. For leaves, the resolved value is the tree value.

*Induction:*  $\pi$  is not a leaf.

- By tree definition,  $\pi$  has at least n-f children. >2
- Since n > 3f,  $\pi$  has majority of nonfaulty children.
- Let  $\pi k$  be a child of  $\pi$  such that  $p_k$  is nonfaulty.
- Since  $p_j$  is nonfaulty,  $p_j$  correctly reports to  $p_k$  that it has some value  $\mathbf{v}$  in node  $\pi'$ ; thus  $p_k$  stores  $\mathbf{v}$  in node  $\pi = \pi' j$ .
- By induction,  $p_i$ 's resolved value for  $\pi k$  equals the value  $\mathbf{v}$  that  $p_k$  has in its tree node  $\pi$ .
- So all of  $\pi$ 's nonfaulty children resolve to  $\boldsymbol{v}$  in  $p_i$ 's tree, and thus  $\pi$  resolves to  $\boldsymbol{v}$  in  $p_i$ 's tree.

#### **Proof of Exponential Tree Algorithm**



**Validity:** Suppose all inputs are  $\boldsymbol{v}$ .

- Nonfaulty processor  $p_i$  decides on resolve( $\lambda$ ), which is the majority among resolve(j),  $0 \le j \le n-1$ .
- The previous lemma implies that for each nonfaulty  $\underline{p_i}$ , resolve(j) is the value stored at the root of  $\underline{p_j}$ 's tree, which is  $p_j$ 's input  $\boldsymbol{v}$ .
- Thus  $p_i$  decides  $\boldsymbol{v}$ .

#### **Proof of Exponential Tree Algorithm (cont'd)**

**Agreement:** Show that all nonfaulty processors resolve the same value for their tree roots.

- \*A node is **common** if all nonfaulty processors resolve the same value for it. We will show the root is common. Strategy:
  - 1. Show that every node with a certain property is common.
  - 2. Show that the root has the property.
  - **Lemma (5.11):** If every  $\pi$ -to-leaf path has a common node, then  $\pi$  is common.

**Proof:** By induction on the height of  $\pi$ .

Basis:  $\pi$  is a leaf. Then every  $\pi$ -to-leaf path consists solely of  $\pi$ , and since the path is assumed to contain a common node, that node is  $\pi$ .

### **Proof of Exponential Tree Algorithm (cont'd)**

Induction:  $\pi$  is not a leaf. Suppose in contradiction  $\pi$  is not common.

- Then every child  $\pi'$  of  $\pi$  has the property that every  $\pi'$ -to-leaf path has a common node. Assure the Non-Faculty of NE
- Since the height of  $\pi'$  is smaller than the height of  $\pi'$ , the inductive hypothesis implies that  $\pi'$  is common.
- Therefore all nonfaulty processors compute the same resolved value for  $\pi'$ , and thus  $\pi$  is common.
- THE ROOT IS COMMON:
- 2) Show every root-to-leaf path has a common node.
  - There are f + 2 nodes on a root-to-leaf path.
  - The label of each non-root node on a root-to-leaf path ends in a distinct processor index:  $i_1, i_2, \dots, i_{f+1}$
  - At least one of these indices is that of a nonfaulty processor, say  $i_k$ .
  - Lemma 5.10 implies that the node whose label ends in  $i_k$  is common.

#### **Proof of Exponential Tree Algorithm (cont'd)**

### **Complexity:**

- n > 3f processors
- f + 1 rounds
- Messages in round  $r^{7/2}$  contain  $n(n-1)(n-2)\cdots(n-(r-2))$  values. When r = f + 1, this is exponential if f is more

than constant relative to n.

## A Polynomial Algorithm for Byzantine Agreement

We can reduce the message size with a simple algorithm that increases the number of processors to n > 4f and number of rounds to 2(f + 1).

#### Phase King Algorithm

```
Uses f + 1 phases, each taking two rounds.

Code for p_i:

pref := my input \approx i

first round of phase k:

send pref to all

receive prefs of others

let maj be the value that occurs > n/2 times

among all prefs (0 if none)

let mult be number of times maj occurs

second round of phase k:
```

```
if i = k then send maj // I am the phase king
receive tie-breaker from pk (0 if none)
if mult > n/2 + f
    then pref := maj
    else pref := tie-breaker
if k = f+1 then decide pref
```

(log {max(V)}

#### Proof of Phase King Algorithm (cont'd)

#### Agreement:

- $\odot$  Since there are f+1 phases, at least one has a nonfaulty king.
- Lemma 5.14 implies that at the end of that phase, all nonfaulty processors have the same preference.
- Lemma 5.13 implies that from that phase onward, the nonfaulty preferences stay the same.
- Thus the decisions are the same.

#### Performance:

- $\circ$  number of processors n > 4f
- $\circ$  2(f+1) rounds
- o  $\mathcal{O}(n^2f)$  messages, each of size  $\mathfrak{m}$ .

## V: INPUT SET

Accordination: Mostrare un'esecusion per n=4, ==1 tale che l'algoritmo FHASE-KING fallisce.

#### Proof of Phase King Algorithm

#### VALIDITY

Lemma (5.13): If all nonfaulty processors prefer v at start of phase k then all do at end of phase k.

Proof:

• Each nonfaulty processor receives at least n - f preferences (including its own) for v in the first round of phase k.



## Proof of Phase King Algorithm (cont'd)

Lemma (5.14): If the king of phase k is nonfaulty, then all nonfaulty processors have the same preference at the end of phase k.

**Proof:** Consider two nonfaulty processors  $p_i$  and  $p_j$ . Case  $l: p_i$  and  $p_j$  both use  $p_k$ 's tie-breaker. Since  $p_k$  is nonfaulty, they agree.

Case 2:  $p_i$  uses its majority value and  $p_j$  uses the king's tie-breaker.

- $p_i$ 's majority value is v.
- $p_i$  receives more than n/2 + f preferences for v.
- $p_k$  receives more than n/2 preferences for v.
- $p_k$ 's tie-breaker is v.

Case 3:  $p_i$  and  $p_j$  both use their own majority values.

- $p_i$ 's majority value is v.
- $p_i$  receives more than n/2 + f preferences for v.
- $\circ$   $p_j$  receives more than n/2 preferences for v.
- $\circ$   $p_j$ 's majority value is also v.