Eventual and causal consistency
Weaker than linearizability
- linearizability
- ordering
- C2I (global)
- pros and cons
- easy to use
- hard to implement, low performance, not available during network partitions.
- e.g. eventual consistency (generally):
- accept write at any replica immediately
- respond to read at any replica immediately
- anomalies
- stale read
- diverging write
- loss of causality
- desired properties:
- eventual convergence of state
- preserving causality
A cloud album as an example
- mobile phones and cloud, album and photos
- operations: add album, upload photo to album, delete photo, get photo
- design #1: all operations have to be acknowledged by cloud
- strong consistency, no anomalies
- potentially bad user experience, long latency, freezes when connections are bad
- design #2: each device acts as a storage server
- a device caches photos/albums and modifies its local copy; read from local copy.
- data is available despite periodic connectivity to cloud and other nodes.
- consistency issues if two devices have conflicting operations.
Straw man 1: merge storage
- write-write conflict, e.g., add two photos to the same album the same time
- automatic conflict resolution
- idea update functions: have update be a function, not a new value.
- e.g. add photo pid to album w/ aid.
- read current state of storage, decide best change
- function must be deterministic
- otherwise nodes will get different answers
- Challenge:
A: add pid1 to album w/ aid B: add pid2 to album w/ aid
- X syncs w/ A, then B Y syncs w/ B, then A Will X show pid1 in front of pid2, and Y show pid2 in front of pid1?
- If update function is commutative (e.g. album is an unordered set of photos), then it dos not matter.
Goal: eventual state convergence
- Idea: ordered update log
- Ordered list of updates at each node
- Storage state is result of applying updates in order
- Syncing == ensure all nodes have same updates in log
- How can nodes agree on update order?
- Update ID: <time T, node ID>
- Assigned by node that creates the update
- Ordering updates a and b:
- a < b if a.T < b.T or (a.T = b.T and a.ID < b.ID)
- Example:
<10,A>: add pid1 to album aid <20,B>: add pid2 to album aid
- What’s the final ordered list in aid?
- the result of executing update functions in timestamp order
- [..pid1, pid2] (not [..pid2, pid1])
- What’s the status before any syncs? i.e. content of each node’s storage state
- A: pid1 at 3rd position for album aid
- B: pid2 at 3rd position for album aid
- This is what A/B user will see before syncing.
- Now A and B sync with each other
- Both now know the full set of updates
- Can each just run the new update function against its storage state?
- A: pid1 at 3rd position, pid2 at 4th position
- B: pid2 at 3rd position, pid1 at 3rd position
- That’s not the right answer!
- Roll back and replay
- Naive way: Re-run all update functions, starting from empty storage state
- Since A and B have same ordered set of updates
- they will arrive at same final state
- We will optimize this in a bit
- Displayed photo positions are “tentative”
- B’s user saw a photo pid2 at 3rd position, then it’s changed to 4th position
- You never know if there’s some other photo from nodes you haven’t yet synced
- That will change the pid1’s position yet again
- Will update order be consistent with wall-clock time?
- Maybe A went first (in wall-clock time) with timestamp <10,A>
- Node clocks are not perfectly synchronized
- So B could then generates <9,B>
- B’s meeting gets priority, even though A asked first
- Will update order be consistent with causality?
- What if A adds a photo pid1,
- then B sees it,
- then B deletes pid1
- Perhaps
- <10,A> add
- <9,B> delete – B’s clock is slow
- Now delete will be ordered before add!
- Unlikely to work
- Differs from wall-clock time case b/c system knew B had seen the add
Lamport logical clocks
- Want to timestamp events s.t.
- node observes E1, then generates E2, TS(E2) > TS(E1)
- Thus other nodes will order E1 and E2 the same way.
- Each node keeps a clock T
- increments T as real time passes, one second per second
- T = max(T, T’+1) if sees T’ from another node
- Note properties:
- E1 then E2 on same node => TS(E1) < TS(E2)
- BUT it’s a partial order
- TS(E1) < TS(E2) does not imply E1 came before E2
- Logical clock solves add/delete causality example
- When B sees <10,A>,
- B will set its clock to 11, so
- B will generate <11,B> for its delete
- Irritating that there could always be a long-delayed update with lower TS
- That can cause the results of my update to change
- Would be nice if updates were eventually “stable”
- => no changes in update order up to that point
- => results can never again change – e.g. you know for sure pid1 is at position 3.
- => no need to re-run update function
- How about a fully decentralized “commit” scheme?
- You want to know if update <10,A> is stable
- Have sync always send in log order – “prefix property”
- If you have seen updates w/ TS > 10 from every node
- Then you’ll never again see one < <10,A>
- So <10,A> is stable
- Why doesn’t Bayou do something like this? (Bayou commits updates through designated primary replica)
- How to sync?
- A sending to B
- Need a quick way for B to tell A what to send
- A has:
- <-,10,X>
- <-,20,Y>
- <-,30,X>
- <-,40,X>
- B has:
- <-,10,X>
- <-,20,Y>
- <-,30,X>
- At start of sync, B tells A “X 30, Y 20”
- Sync prefix property means B has all X updates before 30, all Y before 20
- A sends all X’s updates after <-,30,X>, all Y’s updates after <-,20,X>, &c
- This is a version vector – it summarize log content
- It’s the “F” vector in Figure 4
- A’s F: [X:40,Y:20]
- B’s F: [X:30,Y:20]
- How did all this work out?
- Replicas, write any copy, and sync are good ideas
- Now used by both user apps and multi-site storage systems
- Requirement for p2p interaction is debatable
- clients (phones, ipads) can just (sporadically) contact the servers
- Bayou introduced some very influential design ideas
- Update functions
- Ordered update log
- Allowed general purpose conflict resolution
- Bayou made good use of existing ideas
- Logical clock
