I was catching up on the Mutual Exclusion chapter from The Art of Multiprocessor Programming, and while reading through the discussion thread, it became clear that there arenāt many practical, real-world examples of livelocks being shared.
This reminded me of a messy situation in production: a flawed implementation of optimistic locking combined with multiple threads consuming Kafka batches. It became a perfect recipe for threads actively preventing each other from making progress, while still appearing to ādo work.ā
At the time, it was clear that the retries were happening because of contention, but I didnāt know that this scenario could be classified as a livelock.
Most engineers are familiar with deadlocks: threads get stuck waiting on each other, and nothing moves forward. Easy to detect, easy to understand. A livelock, however, is sneakier. The book defines it like this:
āTwo or more threads actively prevent each other from making progress⦠When the system is livelocked rather than deadlocked, there is some way to schedule the threads so that the system can make progress (but also some way to schedule them so that there is no progress).ā
The key insight is that a livelock isnāt about threads being stuck waiting, itās about threads working so hard they keep undoing each otherās progress. Everything looks busy and alive on the outside. But inside, things move very slowly.
The system processed messages from Kafka in parallel. Multiple threads consumed batches of messages and updated database records with optimistic locking:
This approach works fine under low contention. But a design change introduced a problem: new Kafka partitions were keyed differently from the database target tableās primary key.
Suddenly, multiple independent threads were consuming messages that targeted the same database records, causing a lot of contention and triggering repeated retries.
A simplified example illustrates what happened:
Now multiply this by dozens of threads and many records. The system didnāt grind to a complete halt. Some transactions did succeed, but as the load increased, throughput collapsed. Adding more threads only made things worse.
Each thread was āworking,ā but most of that work was wasted. Threads were canceling out each otherās progress, exactly as described in the book:
āThere is some way to schedule the threads so that the system can make progress (but also some way to schedule them so that there is no progress).ā
The fix is counterintuitive: slowing things down enables faster overall progress. Two changes made the most significant difference:
Back-off with Jitter: Instead of retrying immediately, failed transactions waited for a randomized, exponential delay before retrying, giving āwinningā threads time to finish cleanly before others piled back in.
Align Partitioning with the Database: Kafka consumption was reworked so that all messages related to the same database record were processed by the same thread, eliminating direct contention.
By deliberately reducing concurrency in these hotspots, the endless collision loop stopped.
When I shared this example online, someone replied with a perspective that perfectly sums up the broader lesson:
āItās a valid example. Generally, you have to degrade concurrency to escape the trap. For example, thereās an old concept called an āescalating lock managerā that tries to prevent this. A different approach Iāve used more recently is to always include both a priority indicator and a retry count on each transaction. These hints allow the transaction manager to automatically degrade concurrency when it detects this scenarioāfor example, delaying other commits in the presence of a serial offender.ā
This emphasizes the point: whether implementing back-off, introducing priority hints, or using more advanced transaction management techniques, the key to escaping a livelock is controlled degradation of concurrency. If every thread continues to fight at full speed, the system remains trapped in a cycle of unproductive work.
Before reading The Art of Multiprocessor Programming and engaging online, it wouldnāt have been clear to call this a livelock. The retry loop wasnāt infinite, and progress was happening, just very slowly.
Now itās clear: a livelock isnāt about frozen threads. Itās about a system stuck in a cycle of unproductive work, where the only way forward is to slow down and coordinate, rather than throwing more concurrency at the problem.