How database maintains resiliency?

4 min readJun 25, 2024

A Write-Ahead Log (WAL) is a technique used in databases to ensure data integrity and durability. Here’s an easy way to understand it:

The Diary (Database): Think of the database as a diary where you keep all your important information.
The Notebook (WAL): Now, imagine you also have a small notebook where you jot down quick notes before writing them neatly in your diary.

Data Inconsistency Issue

When transferring, if the system crashes after updating one location but before updating the other, it can lead to inconsistent data. For example, reducing the count but not increasing it in another location can result in inconsistent data

Write-Ahead Log (WAL) Solution

A WAL helps prevent this inconsistency by first writing the details of the update to a log file.
This log write is atomic, meaning it either completes fully or not at all, ensuring that the system has a record of the intended changes.
After writing to the log, the system acknowledges the request and proceeds to update the actual data files (e.g., bos.json and pnq.json).

Crash Recovery

If the system (Neptune) crashes after updating one file but before updating the other, the log contains the information needed to restore consistency.
Upon restarting, the system reads the log to determine the last known state and the intended changes, allowing it to complete any incomplete updates.

Resilience and Replication

The WAL provides resilience by ensuring that the system can recover to a consistent state after a crash.
For replication, if multiple nodes start with the same initial state and apply the same sequence of log entries, they will all reach the same final state, ensuring consistency across replicas.

Concurrent updates

The Leader and Followers approach can help manage concurrent updates effectively.

In this architecture, one node is designated as the leader, and the other nodes are followers. The leader is responsible for handling all update requests and ensuring data consistency. Followers forward their requests to the leader and apply updates received from the leader.

When Alice wants to increase count and Bob wants to reduce count, both requests are sent to their respective nodes. These nodes forward the requests to the leader.
Leader receives both requests and has the authority to decide how to process them. It can process the requests in the order they are received or based on predetermined rules.
Before writing the update to the WAL, the leader checks the current state of the data to ensure that the update can be applied without causing inconsistencies or conflicts.
If the update can be applied, the leader writes the update details to its WAL.
After writing a successful update to the WAL, the leader processes the update and modifies the data. The leader then broadcasts the update to all follower nodes, which write the update to their own WALs and apply the changes to their data.
If there are conflicts (e.g., insufficient count ), the leader can reject requests that cannot be fulfilled and the leader does not write the update to the WAL and instead rejects the request. The rejected requests can be communicated back to the users (Alice and Bob) with appropriate reasons.

The leader node ensures that all updates are processed in a controlled and sequential manner, and any conflicts are resolved centrally. This approach helps in maintaining a consistent state across the cluster and ensures that all nodes have the same view of the data.

What if leaders fail?

Nodes in a cluster send regular heartbeat messages to each other to indicate they are alive and communicating. If a node does not receive a heartbeat from another node within a specified period, it marks that node as down.

Leader Election

When the leader is marked as down, the remaining nodes (initiate a leader election process to choose a new leader. This ensures that the cluster can continue to operate and handle requests even if the original leader is unavailable.

If leader crashes after replicating updates to few of the nodes but not to all, the nodes will have different states. To resolve this, the system relies on a Write-Ahead Log (WAL). The leader writes changes to the WAL and replicates these log entries to its followers.

When a new leader is elected, the nodes compare their log entries.

Nodes having received the updates from leader, will have later log entries than others. They can then apply the missing log entries from those nodes to achieve a consistent state.

Quorum-Based Consistency

The system uses a quorum mechanism, where a majority of nodes must successfully replicate the log entries for the update to be considered committed. In a cluster with three nodes, a quorum is achieved if at least two nodes have the log entries. This allows leader to confirm to user nce it has been replicated to most of the nodes, even if few of the nodes has not yet received it.

By using heartbeats, leader election, and log-based replication, the distributed system can handle failures and maintain consistency. The quorum mechanism ensures that updates are committed as long as a majority of nodes have the log entries, allowing the system to continue operating even if some nodes are temporarily unavailable.