- Introduction
- Past two weeks: Transactions (atomicity + isolation) on a single machine.
- Today: Distributed transactions.
- The Setup + The Problem
- Client + coordinator + two servers: One with accounts A–M, the other with accounts N–Z.
- Coordinator + servers all have logs.
- Coordinator passes messages from the client to the appropriate servers (see slides); responses from servers/coordinator indicate whether the action completed successfully, or whether we need to abort.
- New problems to deal with, besides server failure: Network loss/reordering, coordinator failure.
- The main problem, though: Multiple servers can experience different events.
- One commits while the other crashes, one commits while the other aborts, etc.
- Dealing with the Network
- Message loss re-ordering is easy: Reliable transport.
- If messages are lost, they're retransmitted.
- If duplicates are received, they're ignored.
- If messages arrive out of order, they can be put back in order.
- This is the exactly-once semantics we discussed in the very first day of class, with RPCs!
- Message loss re-ordering is easy: Reliable transport.
- Basic Two-phase Commit (2PC)
- Two-phase commit is the protocol that will help us here.
- Basic protocol:
- Coordinator sends tasks to servers (workers).
- Once all tasks are done, coordinator sends prepare messages to workers. Prepared = tentatively committed. "Prepared" means that the workers will definitely commit even if they crash.
- Once all workers are confirmed to be prepared, coordinator will tell them to commit, and tell client that the transaction has committed.
- Two phases: Prepare phase, commit phase.
- Worker/Network Failures Prior to the Commit Point
- Basic idea: It’s okay to abort.
- Lost prepare message: Coordinator times out and resends.
- If prepare message experience persistent loss, coordinator will consider this worker to have failed.
- If prepare messages make it to some workers but not others, coordinator continues resending to "missing" workers until either everyone is prepared, or it has deemed some workers to have failed.
- Lost ACK for prepare: Coordinator times out and resends. Reliable transport means that workers won't repeat the action; they'll just ACK the duplicate.
- Worker failure before prepare: Coordinator sends abort messages to all workers and the client, and writes an ABORT record on its log.
- Upon recovery, the worker will find that this transaction has aborted; see worker failure recovery in next part.
- Worker/Network Failures After the Commit Point
- Basic idea: It’s *not* okay to abort.
- Lost commit message: Coordinator times out and resends. Worker will also send a request for the status of this particular transaction.
- For this specific failure, that request is not needed, but just wait.
- Lost ACK for commit message: Coordinator times out and resends.
- Worker failure before receiving commit: Can't abort now!
- After receive prepare messages, workers write PREPARE records into their log.
- On recovery, they scan the log to determine what transactions are prepared but not yet committed or aborted.
- They then make a request to the server asking for the status of that transaction. In this case, it has committed, so the server will send back a commit message.
- This request is the same as the one above, after the commit message was lost. Whenever the worker has a prepared but not committed/aborted transaction, it makes periodic requests to the server for its status. This takes care of the case where a worker has not crashed, but there is persistent network loss such that the coordinator has determined it to have crashed.
- Coordinator typically keeps a table mapping transaction ID to its state, for quick lookup here.
- Worker failure after commit received: No problem; transaction is committed. Duplicate commits may be received after recovery, but that's fine (hooray for reliable transport).
- Coordinator Failures
- Basic idea: If before the commit point, can abort. If not, can’t!
- Once coordinator has heard that all workers are prepared, it writes COMMIT to its own log. This is the commit point.
- Once coordinator has heard that all workers are committed, it writes a DONE record to its own log. At that point, transaction is totally done.
- Coordinator failure before prepare: On recovery, abort (send abort message to workers + client).
- Why not try to continue on with the transaction? Likely the client has also timed out and assumed abort. Aborting everywhere is much cleaner.
- Coordinator failure after commit point, but before DONE: On recovery, resend commits. Duplicate commit messages to some workers are no problem.
- Coordinator failure after writing DONE: Transaction is complete; nothing to do.
- DONE record keeps coordinator from resending commit messages for every commit message ever upon recovery.
- Performance Issues
- Coordinator can forget state of a transaction after it is DONE (minus having the records in its logs, of course).
- Workers cannot forget the state of a transaction until after they hear commit/abort from coordinator.
- 2PC can be impractical. Sometimes we use compensating actions instead (e.g., banks let you cancel a transaction for free if you do so within X minutes of initiating the transaction).
- 2PC Summary
- 2PC provides a way for a set of distributed nodes to reach agreement (commit or abort).
- Does NOT guarantee that they agree at the same instant, nor that they even agree within bounded time.
- This is an instance Two-Generals Paradox.
- A Remaining Problem
- When the coordinator is down in our system, the whole thing is inaccessible. When a worker is down, part of our data is unavailable.
- Solution is replication. But how do we keep data consistent?
- Ideal property: Single-copy consistency.
- Property of the externally-visible behavior of a replicated system.
- Operations appear to execute as if there's only a single copy of the data.
- We will see a way to provide single-copy consistency on Wednesday.
- Tomorrow in recitation: PNUTS, which uses a more relaxed version of consistency.
- Why relax your constraints? Single-copy consistency will add a lot of overhead. If applications don't need it, they can often get better performance by relaxing their semantics.
- Another system that does not use single-copy consistency: DNS.