I was catching up on the Mutual Exclusion chapter from The Art of Multiprocessor Programming, and while reading through the discussion thread, it became clear to me that there werenât many practical, real-world examples of livelocks being shared.
This reminded me of a messy situation in production: a flawed implementation of optimistic locking combined with multiple threads consuming Kafka batches. It became a perfect scenario for threads actively preventing each other from making progress, while still appearing to âdo work.â
At the time, I knew the system was busy retrying because of contention, but I didnât realize that this was an example of a livelock.
Most engineers are familiar with deadlocks: threads get stuck waiting on each other, and nothing moves forward. Easy to detect, easy to understand. A livelock, however, is sneakier. The book defines it like this:
âTwo or more threads actively prevent each other from making progress by taking steps that subvert steps taken by other threads⌠When the system is livelocked rather than deadlocked, there is some way to schedule the threads so that the system can make progress (but also some way to schedule them so that there is no progress).â
The key insight is that a livelock isnât about threads being stuck waiting, itâs about threads working so hard they keep undoing each otherâs progress. Everything looks busy and alive on the outside. But inside, things move very slowly.
The system processed messages from Kafka in parallel. Multiple threads consumed batches of messages and updated database records with optimistic locking:
This approach works fine under low contention, but a design change introduced a problem: new Kafka partitions were keyed differently from the database target tableâs primary key.
Suddenly, multiple independent threads were consuming messages that targeted the same database records, causing a lot of contention and triggering repeated retries.
A simplified example of what happened:
Now multiply this by dozens of threads and many records, and you get a storm of constant retries. The system never fully stopped, but as the load increased, throughput tanked, and adding more threads only made the conflicts worse.
Each thread was âworking,â but most of that work was wasted. Threads were canceling out each otherâs progress, precisely as described in the book.
The fix is counterintuitive: slowing things down enables faster overall progress. Two changes made the most significant difference:
Back-off with Jitter: Instead of retrying immediately, failed transactions waited for a randomized, exponential delay before retrying, giving âwinningâ threads time to finish cleanly before others piled back in.
Align Partitioning with the Database: Kafka consumption was reworked so that all messages related to the same database record were processed by the same thread, eliminating direct contention.
By deliberately reducing concurrency in these hotspots, the endless collision loop stopped.
When I shared this example online, someone replied with a perspective that perfectly sums up the broader lesson:
âItâs a valid example. Generally, you have to degrade concurrency to escape the trap. For example, thereâs an old concept called an âescalating lock managerâ that tries to prevent this. A different approach Iâve used more recently is to always include both a priority indicator and a retry count on each transaction. These hints allow the transaction manager to automatically degrade concurrency when it detects this scenarioâfor example, delaying other commits in the presence of a serial offender.â
Whether implementing back-off, introducing priority hints, or using more advanced transaction management techniques, the key to escaping a livelock is controlled degradation of concurrency. If every thread continues to fight at full speed, the system remains trapped in a cycle of unproductive work.
Before reading The Art of Multiprocessor Programming and engaging online, it wouldnât have been clear to call this a livelock. The retry loop wasnât infinite, and progress was happening, just very slowly. Now itâs clear to me, this example fits the livelock description, and the only way to escape the trap is to degrade concurrency, rather than throwing more concurrency at the problem.