Share

Kernel Lockup Debugging: Detecting Deadlocks and Softlockups in Custom Drivers in Linux

The Linux kernel is often celebrated for its robustness, its scalability across hardware—from the tiniest embedded boards to the largest supercomputers—and its rich ecosystem of drivers supporting an enormous range of devices. But for all its strengths, the kernel is still a living, breathing system of complex concurrency. It is full of threads, interrupts, preemption points, locks, and atomic operations that must coordinate perfectly in order for the system to remain responsive. This delicate balance can be disturbed by something as small as a misused spinlock or a misplaced blocking call in a driver’s interrupt handler. When that balance is broken, the kernel can encounter one of its most feared states: a lockup.

A lockup can take many forms, but two of the most common and problematic are deadlocks and softlockups. These terms might sound similar, but they refer to different failure modes, each with unique detection and debugging challenges. A deadlock is a logical stand-off, where two or more execution contexts are waiting on each other in a circular dependency that will never resolve. A softlockup, on the other hand, occurs when a CPU spends too much time executing in kernel space without yielding, preventing the scheduler from switching to other tasks. While deadlocks can leave parts of the system idle forever without consuming much CPU, softlockups keep the processor spinning in loops, causing high CPU usage and preventing other threads from running. Both can freeze your driver’s functionality and, depending on where they occur, can stall the entire system.

When working on custom drivers, the risk of these lockups increases. Unlike upstream-maintained kernel code that has been stress-tested across millions of machines, a custom driver often targets unique hardware, specialized features, or proprietary communication patterns. The locking mechanisms in such drivers are often designed specifically for the device’s needs. That means that if the locking scheme is flawed—say, by holding a mutex across an I/O operation that can block, or by acquiring locks in inconsistent order—lockups can emerge in subtle and unpredictable ways. The only solution is to learn not just how to fix these problems, but how to reliably detect and understand them in the first place.


Understanding the Core Concepts

Before diving into debugging techniques, it’s important to understand the distinction between the two main types of lockups.

A deadlock happens when threads are waiting on each other in a way that no one can proceed. Imagine Thread A holds Lock 1 and needs Lock 2, while Thread B holds Lock 2 and needs Lock 1. Neither can progress, and both are stuck forever. In the kernel, deadlocks can occur between process contexts, between process and interrupt contexts, or even within the same thread if lock reentrancy is not allowed but is attempted. Deadlocks are often the result of inconsistent lock acquisition ordering or trying to take a blocking lock in a context where it cannot be released.

A softlockup, meanwhile, occurs when a CPU doesn’t return control to the scheduler within a defined time limit. The kernel’s watchdog subsystem monitors each CPU to make sure it periodically schedules. If a CPU spins in a loop inside kernel space without yielding, the watchdog will detect this and report a softlockup. Softlockups are common in poorly designed busy-wait loops in drivers, in infinite loops without a cond_resched() call, or when polling hardware registers with no timeout.


Preparing the Kernel for Lockup Detection

Lockup debugging starts long before the bug actually manifests—it starts at kernel configuration. Many of the kernel’s lockup detection facilities are only available if specific configuration options are enabled at compile time. For example:

Bash
CONFIG_LOCKUP_DETECTOR=y
CONFIG_SOFTLOCKUP_DETECTOR=y
CONFIG_HARDLOCKUP_DETECTOR=y
CONFIG_PROVE_LOCKING=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_LOCK_STAT=y
CONFIG_DETECT_HUNG_TASK=y

When compiling your development kernel, enabling these will give you more visibility into the kernel’s internal locking state. CONFIG_PROVE_LOCKING enables the lock dependency validator (lockdep), which dynamically tracks lock acquisition order and can warn about potential deadlocks before they happen. CONFIG_LOCK_STAT collects runtime statistics about lock contention, which can be invaluable for identifying high-contention locks in your driver.


Boot Parameters and Runtime Settings

The Linux kernel also offers boot parameters and /proc sysctl tunables that can influence lockup detection:

  • nosoftlockup can disable softlockup detection if you need to run benchmarks without interference, but for debugging you’ll want it enabled.
  • softlockup_panic=1 will cause the system to panic when a softlockup is detected, preserving a crash dump for post-mortem analysis.
  • hung_task_panic=1 will do the same for hung tasks detected via CONFIG_DETECT_HUNG_TASK.
  • /proc/sys/kernel/watchdog_thresh controls how many seconds a CPU can run without scheduling before being considered stuck.
  • /proc/sys/kernel/hung_task_timeout_secs sets how long a task can be blocked before being reported as hung.

For example, to make the kernel panic on a softlockup and shorten the detection threshold to 5 seconds:

Bash
echo 1 > /proc/sys/kernel/softlockup_panic
echo 5 > /proc/sys/kernel/watchdog_thresh

In driver debugging scenarios, lowering thresholds can make lockups appear more quickly, which is especially useful in reproducing intermittent problems.


Detecting Softlockups in Practice

Softlockups are relatively easier to detect than deadlocks because they involve CPU activity that can be measured. When the watchdog detects that a CPU has been running in kernel space too long, it logs a message like:

Bash
watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [mydriver:1234]

This message includes the CPU number, the time it has been stuck, and the process context if applicable. Following this, you often see a backtrace of the functions on that CPU’s stack, showing exactly where it was stuck.

If you suspect your driver is causing a softlockup, you can deliberately stress the code path with tools like stress-ng or by running hardware stress patterns. While the softlockup detector triggers, capture the trace:

Bash
dmesg -w

Then, use tracing tools like ftrace to see the execution path:

Bash
echo function_graph > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/tracing_on
# Trigger the suspected operation
echo 0 > /sys/kernel/debug/tracing/tracing_on
cat /sys/kernel/debug/tracing/trace > /tmp/trace.txt

By reading trace.txt, you can see which functions consumed CPU time leading up to the softlockup.


Debugging Deadlocks with Lockdep

Deadlocks can be silent killers—they may not produce high CPU usage or obvious log spam. That’s where lockdep, the kernel’s lock dependency validator, comes in. When enabled (CONFIG_PROVE_LOCKING), lockdep records the order in which locks are acquired across the system. If it detects a circular dependency, it prints a warning with a lock graph showing the problematic sequence.

Lockdep output can be verbose, but it’s a goldmine for debugging. For example, if your driver acquires a mutex in process context and then tries to acquire it again in its interrupt handler, lockdep will warn that this can cause a deadlock.

You can also force the kernel to dump blocked tasks and their wait states at any time:

Bash
echo w > /proc/sysrq-trigger

This prints a list of all tasks, their current state (such as D for uninterruptible sleep), and the stack trace for each. For deadlock debugging, look for tasks stuck in D state inside your driver’s code.


Common Driver Mistakes Leading to Lockups

Custom drivers often encounter lockups due to:

  1. Holding locks too long in interrupt context.
  2. Blocking inside spinlock-protected sections.
  3. Acquiring locks in inconsistent order across code paths.
  4. Polling hardware registers in infinite loops without timeouts.
  5. Using mutexes where spinlocks are required (and vice versa).
  6. Not calling cond_resched() in long-running loops in process context.

A classic example is an SPI driver that polls for a hardware ready bit without a timeout. If the hardware never sets the bit due to an error, the loop becomes infinite and causes a softlockup.


Advanced Lockup Reproduction and Analysis

In embedded or custom hardware environments, lockups can be intermittent and difficult to reproduce. To catch them, you can combine stress tools with tracing:

Bash
stress-ng --io 4 --timeout 60s

Run this while enabling lockstat:

Bash
echo 1 > /proc/lock_stat

After the test, read the statistics:

Bash
cat /proc/lock_stat

This shows which locks were most contended and how long they took to acquire. If your driver’s locks appear at the top, investigate why they’re hot spots.

For reproducible softlockups, you can temporarily insert deliberate delays in your driver’s hot paths to force the watchdog to trigger, confirming your debugging setup is working.


Real-World Debugging Workflow

A typical deadlock/softlockup debugging session in a custom driver might follow this sequence:

  1. Enable kernel debugging options in .config.
  2. Boot with aggressive watchdog thresholds to make stalls show up faster.
  3. Trigger the suspected workload repeatedly until the lockup happens.
  4. Capture kernel logs and stack traces using dmesg and sysrq-w.
  5. Analyze lock ordering with lockdep warnings.
  6. Trace execution paths with ftrace or perf.
  7. Fix locking order or introduce yielding points as needed.
  8. Retest under heavy load to confirm stability.

Conclusion

Kernel lockup debugging, especially in the realm of custom drivers, is a craft that combines deep knowledge of Linux’s locking mechanisms with disciplined investigative techniques. Deadlocks and softlockups, though different in nature, both threaten system stability, and both can emerge from subtle mistakes in driver design. By enabling the right kernel features, using watchdogs and tracing effectively, and adopting careful locking practices, developers can not only detect and fix these issues but also design drivers resilient enough to withstand extreme conditions.

In the end, mastering lockup debugging is not just about preventing system freezes—it is about building robust, predictable, and high-performance kernel modules that behave correctly under all workloads, a skill that defines the most capable Linux driver developers.