The Linux kernel is often celebrated for its robustness, its scalability across hardware—from the tiniest embedded boards to the largest supercomputers—and its rich ecosystem of drivers supporting an enormous range of devices. But for all its strengths, the kernel is still a living, breathing system of complex concurrency. It is full of threads, interrupts, preemption points, locks, and atomic operations that must coordinate perfectly in order for the system to remain responsive. This delicate balance can be disturbed by something as small as a misused spinlock or a misplaced blocking call in a driver’s interrupt handler. When that balance is broken, the kernel can encounter one of its most feared states: a lockup.
A lockup can take many forms, but two of the most common and problematic are deadlocks and softlockups. These terms might sound similar, but they refer to different failure modes, each with unique detection and debugging challenges. A deadlock is a logical stand-off, where two or more execution contexts are waiting on each other in a circular dependency that will never resolve. A softlockup, on the other hand, occurs when a CPU spends too much time executing in kernel space without yielding, preventing the scheduler from switching to other tasks. While deadlocks can leave parts of the system idle forever without consuming much CPU, softlockups keep the processor spinning in loops, causing high CPU usage and preventing other threads from running. Both can freeze your driver’s functionality and, depending on where they occur, can stall the entire system.
When working on custom drivers, the risk of these lockups increases. Unlike upstream-maintained kernel code that has been stress-tested across millions of machines, a custom driver often targets unique hardware, specialized features, or proprietary communication patterns. The locking mechanisms in such drivers are often designed specifically for the device’s needs. That means that if the locking scheme is flawed—say, by holding a mutex across an I/O operation that can block, or by acquiring locks in inconsistent order—lockups can emerge in subtle and unpredictable ways. The only solution is to learn not just how to fix these problems, but how to reliably detect and understand them in the first place.
Understanding the Core Concepts
Before diving into debugging techniques, it’s important to understand the distinction between the two main types of lockups.
A deadlock happens when threads are waiting on each other in a way that no one can proceed. Imagine Thread A holds Lock 1 and needs Lock 2, while Thread B holds Lock 2 and needs Lock 1. Neither can progress, and both are stuck forever. In the kernel, deadlocks can occur between process contexts, between process and interrupt contexts, or even within the same thread if lock reentrancy is not allowed but is attempted. Deadlocks are often the result of inconsistent lock acquisition ordering or trying to take a blocking lock in a context where it cannot be released.
A softlockup, meanwhile, occurs when a CPU doesn’t return control to the scheduler within a defined time limit. The kernel’s watchdog subsystem monitors each CPU to make sure it periodically schedules. If a CPU spins in a loop inside kernel space without yielding, the watchdog will detect this and report a softlockup. Softlockups are common in poorly designed busy-wait loops in drivers, in infinite loops without a cond_resched() call, or when polling hardware registers with no timeout.
Preparing the Kernel for Lockup Detection
Lockup debugging starts long before the bug actually manifests—it starts at kernel configuration. Many of the kernel’s lockup detection facilities are only available if specific configuration options are enabled at compile time. For example:
CONFIG_LOCKUP_DETECTOR=y
CONFIG_SOFTLOCKUP_DETECTOR=y
CONFIG_HARDLOCKUP_DETECTOR=y
CONFIG_PROVE_LOCKING=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_LOCK_STAT=y
CONFIG_DETECT_HUNG_TASK=yWhen compiling your development kernel, enabling these will give you more visibility into the kernel’s internal locking state. CONFIG_PROVE_LOCKING enables the lock dependency validator (lockdep), which dynamically tracks lock acquisition order and can warn about potential deadlocks before they happen. CONFIG_LOCK_STAT collects runtime statistics about lock contention, which can be invaluable for identifying high-contention locks in your driver.
Boot Parameters and Runtime Settings
The Linux kernel also offers boot parameters and /proc sysctl tunables that can influence lockup detection:
nosoftlockupcan disable softlockup detection if you need to run benchmarks without interference, but for debugging you’ll want it enabled.softlockup_panic=1will cause the system to panic when a softlockup is detected, preserving a crash dump for post-mortem analysis.hung_task_panic=1will do the same for hung tasks detected viaCONFIG_DETECT_HUNG_TASK./proc/sys/kernel/watchdog_threshcontrols how many seconds a CPU can run without scheduling before being considered stuck./proc/sys/kernel/hung_task_timeout_secssets how long a task can be blocked before being reported as hung.
For example, to make the kernel panic on a softlockup and shorten the detection threshold to 5 seconds:
echo 1 > /proc/sys/kernel/softlockup_panic
echo 5 > /proc/sys/kernel/watchdog_threshIn driver debugging scenarios, lowering thresholds can make lockups appear more quickly, which is especially useful in reproducing intermittent problems.
Detecting Softlockups in Practice
Softlockups are relatively easier to detect than deadlocks because they involve CPU activity that can be measured. When the watchdog detects that a CPU has been running in kernel space too long, it logs a message like:
watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [mydriver:1234]This message includes the CPU number, the time it has been stuck, and the process context if applicable. Following this, you often see a backtrace of the functions on that CPU’s stack, showing exactly where it was stuck.
If you suspect your driver is causing a softlockup, you can deliberately stress the code path with tools like stress-ng or by running hardware stress patterns. While the softlockup detector triggers, capture the trace:
dmesg -wThen, use tracing tools like ftrace to see the execution path:
echo function_graph > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/tracing_on
# Trigger the suspected operation
echo 0 > /sys/kernel/debug/tracing/tracing_on
cat /sys/kernel/debug/tracing/trace > /tmp/trace.txtBy reading trace.txt, you can see which functions consumed CPU time leading up to the softlockup.
Debugging Deadlocks with Lockdep
Deadlocks can be silent killers—they may not produce high CPU usage or obvious log spam. That’s where lockdep, the kernel’s lock dependency validator, comes in. When enabled (CONFIG_PROVE_LOCKING), lockdep records the order in which locks are acquired across the system. If it detects a circular dependency, it prints a warning with a lock graph showing the problematic sequence.
Lockdep output can be verbose, but it’s a goldmine for debugging. For example, if your driver acquires a mutex in process context and then tries to acquire it again in its interrupt handler, lockdep will warn that this can cause a deadlock.
You can also force the kernel to dump blocked tasks and their wait states at any time:
echo w > /proc/sysrq-triggerThis prints a list of all tasks, their current state (such as D for uninterruptible sleep), and the stack trace for each. For deadlock debugging, look for tasks stuck in D state inside your driver’s code.
Common Driver Mistakes Leading to Lockups
Custom drivers often encounter lockups due to:
- Holding locks too long in interrupt context.
- Blocking inside spinlock-protected sections.
- Acquiring locks in inconsistent order across code paths.
- Polling hardware registers in infinite loops without timeouts.
- Using mutexes where spinlocks are required (and vice versa).
- Not calling
cond_resched()in long-running loops in process context.
A classic example is an SPI driver that polls for a hardware ready bit without a timeout. If the hardware never sets the bit due to an error, the loop becomes infinite and causes a softlockup.
Advanced Lockup Reproduction and Analysis
In embedded or custom hardware environments, lockups can be intermittent and difficult to reproduce. To catch them, you can combine stress tools with tracing:
stress-ng --io 4 --timeout 60sRun this while enabling lockstat:
echo 1 > /proc/lock_statAfter the test, read the statistics:
cat /proc/lock_statThis shows which locks were most contended and how long they took to acquire. If your driver’s locks appear at the top, investigate why they’re hot spots.
For reproducible softlockups, you can temporarily insert deliberate delays in your driver’s hot paths to force the watchdog to trigger, confirming your debugging setup is working.
Real-World Debugging Workflow
A typical deadlock/softlockup debugging session in a custom driver might follow this sequence:
- Enable kernel debugging options in
.config. - Boot with aggressive watchdog thresholds to make stalls show up faster.
- Trigger the suspected workload repeatedly until the lockup happens.
- Capture kernel logs and stack traces using
dmesgandsysrq-w. - Analyze lock ordering with lockdep warnings.
- Trace execution paths with
ftraceorperf. - Fix locking order or introduce yielding points as needed.
- Retest under heavy load to confirm stability.
Conclusion
Kernel lockup debugging, especially in the realm of custom drivers, is a craft that combines deep knowledge of Linux’s locking mechanisms with disciplined investigative techniques. Deadlocks and softlockups, though different in nature, both threaten system stability, and both can emerge from subtle mistakes in driver design. By enabling the right kernel features, using watchdogs and tracing effectively, and adopting careful locking practices, developers can not only detect and fix these issues but also design drivers resilient enough to withstand extreme conditions.
In the end, mastering lockup debugging is not just about preventing system freezes—it is about building robust, predictable, and high-performance kernel modules that behave correctly under all workloads, a skill that defines the most capable Linux driver developers.
