Two-phase commit protocol

5 min readMar 5, 2024

Two-phase commit protocol is a type of atomic commitment protocol (ACP) used in distributed systems to ensure that transactions are carried out consistently across all nodes.

Here’s a step-by-step breakdown of how it works:

Phase 1: Voting Phase

Begin Transaction:

The transaction begins at a node known as the coordinator. This node is responsible for initiating the transaction and coordinating the commit process.

Query to Commit

The coordinator sends a query to commit message to all nodes (also known as cohorts) involved in the transaction. This message is essentially asking the nodes if they are ready and able to commit the transaction.

Vote

Each cohort node then responds with a vote. This vote can either be a “Yes” vote, indicating that the node has recorded the transaction and is ready to commit, or a “No” vote, indicating that the node is not ready to commit (usually because of a failure during the transaction).

In a distributed system, communication between nodes is typically achieved through some sort of messaging system. This can be implemented in various ways, such as through APIs, events, or direct socket connections, among others. The specific implementation can vary greatly depending on the system’s requirements, the programming language being used, the network architecture, and other factors.

Here’s a general idea of how a node might send a “Yes” vote in the voting phase:

APIs: If the system is using a RESTful architecture, for example, the coordinator might send a HTTP request to an endpoint on the cohort, asking it to vote on the transaction. The cohort would then respond to this request with a “Yes” or “No”, indicating its vote.

Direct Connections: In a system using direct socket connections, the coordinator might send a message directly to the cohort over the socket, asking it to vote. The cohort would then send a message back over the socket to indicate its vote.

It’s important to note that in all these cases, the “Yes” vote would typically only be sent after the cohort has ensured that it can commit the transaction. This might involve, for example, checking that all necessary data is available and valid, that the transaction doesn’t conflict with any others, and that there are no other issues that would prevent the transaction from being committed.

Phase 2: Commit Phase

Commit or Abort

If all cohorts have responded with a “Yes” vote, the coordinator sends a commit message to all cohorts. If any cohort has responded with a “No” vote, the coordinator sends an abort message to all cohorts.

Acknowledge

Upon receiving the commit or abort message, each cohort carries out the command and then sends an acknowledgment back to the coordinator.

Completion

Once the coordinator has received acknowledgments from all cohorts, the transaction is considered complete.

This protocol ensures that all nodes either commit the transaction or none do, maintaining the consistency of the distributed system. However, it’s important to note that the two-phase commit protocol can be subject to blocking. If a failure occurs during the commit phase, cohorts can be left waiting indefinitely for a decision.

It’s also important to note that in real-world implementations, additional steps and safeguards are often included to handle possible failures and ensure data consistency. For example, all messages in the protocol are logged so that nodes can recover their state if a failure occurs.

If the answer is No from cohort, how should other cohorts terminate it?

In a two-phase commit protocol, if any cohort votes “No” during the voting phase, the coordinator sends an “Abort” message to all cohorts. This means that none of the cohorts should commit the transaction.

Now, if a cohort has already written the data to its local database during the transaction, it will need to roll back those changes when it receives the “Abort” message. Here’s a general idea of how this might work:

Logging: When a cohort writes data to its local database as part of a transaction, it also logs this operation. This log includes enough information to undo the operation, such as the old value of any changed data. This is often done using a write-ahead log, where the log entry is written before the actual data is changed.
Abort: When the cohort receives an “Abort” message, it looks up the log entry for the transaction. It then uses this log entry to undo the changes it made during the transaction. This might involve, for example, restoring any changed data to its old value.
Acknowledgment: After the cohort has successfully rolled back the transaction, it sends an acknowledgment to the coordinator. This lets the coordinator know that the cohort has successfully aborted the transaction.

This process ensures that even if a cohort has already started to commit a transaction, it can still abort it if necessary. However, it does require careful management of logs and transaction state, and it can be complex to implement correctly.

What happens if the abortion procedure is unsuccessful for a member of the cohort?

In a two-phase commit protocol, if an abort operation fails at a cohort, it can lead to a serious issue because it can result in inconsistency across the distributed system. The transaction might be committed on some nodes and aborted on others, which is exactly what the two-phase commit protocol is designed to avoid.

To handle such situations, two-phase commit protocol implementations often include recovery mechanisms. Here’s a general idea of what might happen if an abort operation fails:

Detection: The coordinator needs to be aware that the abort operation has failed at a cohort. This can be achieved through timeouts (if the coordinator doesn’t receive an acknowledgment from a cohort within a certain time, it assumes that the operation has failed) or through explicit error messages from the cohort.
Recovery: Once the coordinator knows that the abort operation has failed, it can trigger a recovery process. This might involve retrying the abort operation, possibly after a delay or after taking some action to resolve the issue that caused the failure.
Logging and Persistence: All actions in the two-phase commit protocol are usually logged persistently. So even if a node crashes during an abort operation, it can look at the log when it recovers and see that it was supposed to abort the transaction. It can then complete the abort operation.
Manual Intervention: In some cases, automatic recovery might not be possible, and manual intervention could be required. For example, a database administrator might need to manually undo the transaction on the cohort where the abort operation failed.

It’s important to note that handling failures during the two-phase commit protocol can be complex, and it’s one of the main challenges of using this protocol. That’s why many distributed systems use more advanced protocols that can handle failures more gracefully, such as the three-phase commit protocol.