Benchmarking the GPU Hardware Acceleration Performance on ARM & RISC-V Systems in Linux

Benchmarking the GPU hardware acceleration performance on ARM and RISC-V systems in Linux has become one of the most meaningful ways to understand how modern low-power architectures are reshaping the landscape of graphics, compute, and embedded workloads. Over the past decade, both ARM and RISC-V have undergone an extraordinary transformation, moving from microcontroller-driven beginnings toward increasingly sophisticated SoCs that deliver desktop-class multimedia capabilities and high-efficiency computing at remarkably low power budgets. As Linux continues to evolve with improved kernel subsystems, better device drivers, and more mature user-space GPU stacks, the need to precisely measure, compare, and understand real-world hardware acceleration performance has never been more relevant. The process of benchmarking such systems demands attention not only to raw GPU throughput but also to the underlying architectural differences, driver maturity levels, device tree configurations, kernel parameters, compositing pipelines, available EGL/OpenGL/Vulkan implementations, and the capabilities of toolchains that support each architecture. When these elements come together, they shape a rich environment in which benchmarking becomes both a scientific exercise and a practical guide for developers trying to optimize workloads ranging from machine-learning inference to embedded desktop environments and media playback pipelines.

When beginning the exploration of GPU acceleration on ARM systems, one quickly discovers how deeply the Linux graphics ecosystem has been optimized around hardware produced by vendors like ARM Mali, Qualcomm Adreno, NVIDIA Tegra, and Broadcom VideoCore. The ARM architecture benefits significantly from years of open-source driver contributions, especially in the Mesa ecosystem where drivers such as Panfrost, Freedreno, and the open-source VideoCore V3D/V3DV stacks enable real GPU acceleration without relying on proprietary binaries. These drivers provide consistent access to OpenGL ES, desktop OpenGL, and in some cases Vulkan acceleration. For example, Panfrost allows Linux users to run accelerated desktop environments such as GNOME, KDE Plasma, or Sway on ARM SBCs with surprisingly smooth rendering performance. It also exposes the full shading and compute capabilities needed to perform GPU-bound workloads and benchmark them with real-world tools instead of synthetic or partial implementations. This strong driver foundation means that benchmarking GPU performance on ARM systems tends to yield stable, repeatable results, which is essential when measuring how optimizations at the kernel, driver, or user-space level influence frame rates or compute throughput.

In contrast, RISC-V systems are still at a much earlier stage of GPU maturity. While ARM SoCs ship with well-established GPUs, RISC-V boards are only beginning to support serious hardware acceleration as new SoCs emerge with integrated GPUs or support for external PCIe GPUs. The open ecosystem around RISC-V encourages experimentation, and the Linux kernel has rapidly added support for new hardware blocks, but GPU driver availability remains a major differentiating factor. Some RISC-V SoCs rely on soft-GPU implementations, while others integrate designs like Imagination’s IMG B-series architecture or rely on GPU IP expected to be supported through open-source Mesa drivers in the coming years. Even with these constraints, developers have begun running experimental workloads involving hardware-accelerated interfaces, especially through DRM, KMS, and software fallback drivers that mimic the early years of ARM Linux graphics development. This makes RISC-V benchmarking both challenging and exciting, because measuring performance on evolving drivers provides insight into how quickly the ecosystem is catching up to ARM’s decade-long head start.

Benchmarking GPU acceleration on both architectures requires an understanding of the kernel subsystems involved in graphics rendering. The Direct Rendering Manager (DRM) subsystem in Linux plays a critical role, providing kernel-space memory management, buffer allocation, and synchronization mechanisms between the GPU and the display controller. Benchmarkers often need to interact with DRM directly or indirectly, either to collect timing information, validate buffer allocation performance, or verify that hardware acceleration is being used rather than software rendering. It is not uncommon to discover, during early benchmarking attempts, that misconfigured systems quietly fall back to llvmpipe—a CPU-based OpenGL renderer that significantly skews results and hides the GPU’s true capabilities. For this reason, many developers begin their benchmarking workflow by querying the active renderer through commands like:

Bash

glxinfo | grep "OpenGL renderer"

glxinfo | grep "OpenGL renderer"

or, on systems using Wayland with EGL-based rendering:

Bash

eglinfo | grep -i device

eglinfo | grep -i device

This ensures that the system is indeed using the Mali, Adreno, VideoCore, or other GPU rather than a CPU software rasterizer. Early verification of the correct driver is critical because benchmarking GPU acceleration without confirming the active renderer can produce misleading numbers that may waste hours of optimization effort.

On ARM SoCs such as those built on Mali GPUs, GPU benchmarking often focuses on evaluating performance across OpenGL ES, Vulkan where supported, and compute workloads that utilize shader cores for general-purpose tasks. Tools like glmark2-es2 provide an accessible entry point for synthetic benchmarking, allowing developers to measure frame rates across different scenes that stress various GPU pipeline stages. Running it on an ARM board often looks like this:

Bash

glmark2-es2 --off-screen

glmark2-es2 --off-screen

The off-screen flag is particularly helpful when testing systems that lack a display environment or are running benchmarks remotely over SSH. ARM Vulkan performance, when available, can be benchmarked using tools such as vkcube, vkmark, or custom Vulkan applications compiled directly on the board. As Vulkan drivers for ARM GPUs continue to mature, these metrics become increasingly relevant when evaluating workloads involving compute shaders, machine learning operators, or advanced rendering techniques used in embedded graphics engines.

On RISC-V, benchmarking workflows may require more creativity depending on the maturity of the driver. Some early RISC-V boards rely on software rendering exclusively, which means performance must be interpreted within that limitation. However, emerging SoCs have begun supporting hardware-accelerated GPUs with Mesa drivers either in active development or in early test stages. In such cases, developers may test experimental versions of Mesa, sometimes compiled from source, using commands like:

Bash

meson setup builddir -Dplatforms=x11,wayland -Dgallium-drivers=swrast,kmsro -Dvulkan-drivers=...
ninja -C builddir

meson setup builddir -Dplatforms=x11,wayland -Dgallium-drivers=swrast,kmsro -Dvulkan-drivers=...
ninja -C builddir

Compiling the graphics stack directly on the system not only reveals performance differences due to compiler optimizations targeted at either ARM or RISC-V microarchitectures, but also allows developers to explore how tuning compiler flags affects shader performance. Because RISC-V systems often lack binary driver releases, building drivers manually becomes essential for evaluating GPU acceleration potential and revealing bottlenecks that can be optimized at the source level.

A critical part of GPU benchmarking on ARM and RISC-V systems involves analyzing the memory subsystem. GPU memory bandwidth, L2 cache behavior, and shared memory structures between the CPU and GPU have a significant impact on overall performance. Embedded GPUs typically rely on unified memory architectures, meaning the CPU and GPU share RAM. This simplifies design and reduces power consumption but also introduces potential contention under memory-heavy workloads. Benchmarkers frequently use tools like perf and kernel tracepoints to measure memory throughput and latency or to capture the timing of GPU driver events. Running perf alongside GPU workloads can reveal surprising details about how the GPU interacts with system memory:

Bash

perf stat -e bus-cycles,cache-misses,cache-references ./benchmark_app

perf stat -e bus-cycles,cache-misses,cache-references ./benchmark_app

Outputs from such commands help developers identify memory bottlenecks that might not be obvious from frame rate measurements alone. On RISC-V systems, such analysis is especially important because memory subsystems and cache coherency models may differ significantly between SoC vendors. Understanding how these architectural characteristics influence GPU compute performance offers valuable insight when tuning the system for real-world applications like video decoding pipelines, 3D rendering, and ML inference workloads.

Another essential component of benchmarking GPU acceleration on ARM and RISC-V systems is the evaluation of display pipelines, particularly when running Wayland or Xorg. While many developers focus exclusively on EGL and OpenGL performance, the underlying compositor has a significant impact on GPU load. ARM systems often perform better on Wayland, which is designed to minimize round-trips and avoid legacy overhead. Compositors like Weston, Sway, or GNOME Shell are frequently used to test rendering smoothness, frame pacing, and latency. Developers sometimes measure frame timing through utilities such as weston-simple-egl, which produces continuous animated output that stresses both the GPU and the display server. Running it with performance tools in parallel reveals how the compositing pipeline interacts with GPU drivers:

Bash

weston-simple-egl &
perf top

weston-simple-egl &
perf top

This combination helps identify whether the GPU driver, compositor, or kernel display subsystem is responsible for any rendering bottlenecks. RISC-V systems, being earlier in their graphical ecosystem development, still experience cases where Wayland acceleration is limited or unavailable. This means that benchmarking often takes place in framebuffer mode or under Xorg with partial acceleration. Even so, studying the behavior of early display drivers under load helps reveal which parts of the stack require further optimization.

Beyond synthetic benchmarks, real-world GPU performance evaluation relies heavily on media playback tests. Many ARM SoCs include specialized multimedia engines capable of hardware-accelerated video decoding for formats like H.264, H.265, VP9, and AV1. Benchmarking these subsystems involves using tools like ffmpeg to test whether hardware video acceleration APIs such as VA-API, V4L2 M2M, or vendor-specific interfaces are correctly utilized. Running media decoding pipelines while monitoring performance helps determine how well the GPU and multimedia engines offload work from the CPU:

Bash

ffmpeg -hwaccel auto -i input.mp4 -f null -

ffmpeg -hwaccel auto -i input.mp4 -f null -

Developers often monitor CPU usage during such tests to differentiate software decoding from GPU-assisted playback. On RISC-V, hardware video acceleration support is still emerging, and results vary significantly between boards. Measuring decoding speed under software fallback provides a baseline that can later be compared to future hardware-accelerated versions, making such early benchmarks valuable even when GPU acceleration is not fully available.

An often-overlooked aspect of GPU benchmarking is thermal performance. Both ARM and RISC-V SoCs rely heavily on passive cooling, especially in IoT and embedded environments where adding active cooling increases cost and complexity. Thermal throttling can significantly distort benchmarking results, particularly in sustained workloads involving compute shaders or high-resolution rendering. Monitoring temperatures using commands like:

Bash

cat /sys/class/thermal/thermal_zone0/temp

cat /sys/class/thermal/thermal_zone0/temp

helps ensure that results reflect actual GPU performance rather than the effects of clock throttling. ARM boards like the Raspberry Pi, Rockchip RK3399, and various MediaTek platforms exhibit different thermal behavior, depending on their GPU architecture and cooling design. RISC-V boards, still in early development, often exhibit wider thermal variability due to experimental board layouts. Developers must keep these differences in mind while running long-duration benchmarks to avoid misinterpreting temperature-induced slowdowns as driver regressions.

Benchmarking GPU acceleration also benefits from understanding the compiler toolchains used for building userspace graphics stacks. ARM systems typically rely on GCC or Clang with well-tuned optimization profiles that generate efficient code for ARMv8 or ARMv7 instruction sets. In contrast, RISC-V systems benefit greatly from compiler improvements, especially with vector extensions. Compiling Mesa or graphics-heavy applications with architecture-specific flags can influence shader compilation speed and runtime performance. For example, building GPU workloads on ARM with flags like:

Bash

CFLAGS="-O3 -mcpu=cortex-a72"

CFLAGS="-O3 -mcpu=cortex-a72"

can yield measurable improvements. On RISC-V systems, flags such as -march=rv64gc or future vector-enabled flags may significantly impact shader performance once GPUs become more tightly integrated with RISC-V vector extensions. Developers benchmarking these systems often compile both Mesa and GPU workloads themselves to ensure consistent testing across architectures.

A full analysis of GPU benchmark results requires deep reflection on architectural differences. ARM, with its advanced NEON SIMD engine, often complements GPU workloads by accelerating CPU-side tasks such as texture preparation or video demuxing. This tight integration between CPU and GPU subsystems can improve overall pipeline performance. RISC-V, with its modular extension-based design, is evolving toward similar capabilities through the vector extension (RVV). Once RISC-V SoCs begin shipping with native RVV support and GPU drivers optimized for it, benchmarkers will be able to measure the interaction between vectorized CPU code and GPU compute tasks in ways similar to ARM’s NEON pipeline. For now, comparing ARM and RISC-V GPU acceleration remains an analysis of maturity versus potential: ARM offers consistent and robust GPU performance today, while RISC-V offers a rapidly developing ecosystem whose performance characteristics evolve alongside the hardware and drivers.

When interpreting benchmark results, developers often collect logs from the GPU kernel drivers, captured through:

Bash

dmesg | grep -i gpu

dmesg | grep -i gpu

These logs reveal issues such as GPU resets, memory allocation failures, or driver timeouts. On early RISC-V GPU drivers, such logs are especially useful because they help identify which features are stable enough for benchmarking and which require further development. For ARM platforms, logs provide reassurance that the GPU is operating at its expected clock rates and using proper power management states. Developers sometimes adjust kernel command-line parameters to test how GPU performance responds to modified frequency governors or memory settings. Editing parameters in /boot/cmdline.txt or similar files allows experimentation with performance modes that may improve benchmarks.

In many benchmarking workflows, capturing the power consumption of the system becomes essential, especially for embedded or battery-powered use cases. ARM boards often expose power measurement interfaces through sysfs, enabling developers to correlate GPU load with power draw. RISC-V systems, depending on the vendor, may expose similar interfaces or require external measurement equipment. Understanding how power consumption scales with GPU workload helps determine the efficiency of the GPU architecture. A GPU that achieves high frame rates but consumes excessive power may not be desirable for low-power embedded environments, while a more efficient GPU with moderate performance might be more suitable for workloads such as mobile robotics, digital signage, or portable media devices.

Testing compute workloads on ARM and RISC-V GPUs represents another layer of benchmarking beyond traditional graphics rendering. Many modern GPUs include OpenCL or Vulkan compute support. On ARM systems, Mali GPUs have varying degrees of OpenCL support depending on the model and driver. Developers sometimes benchmark GPU compute workloads by compiling OpenCL kernels or Vulkan compute shaders and testing them using tools like clpeak or custom compute benchmarks. Running compute workloads helps evaluate shader compilation time, memory throughput under compute pressure, and the stability of drivers under high compute loads. On RISC-V systems with early GPU support, compute acceleration might not yet be available, making these tests a way to gauge how future hardware acceleration will change the performance landscape.

Another relevant aspect of benchmarking GPU acceleration is how different desktop environments impact GPU load on ARM and RISC-V systems. Heavy desktop environments like GNOME or KDE Plasma require robust GPU compositing performance to maintain smooth animations, while lightweight environments like XFCE or LXDE impose minimal overhead. Developers often test rendering smoothness by observing frame pacing during window movement, desktop effects, or video playback. Tools like wayland-info or xrandr provide insights into display modes and refresh rates, while monitoring tools like htop or powertop reveal how the GPU load affects CPU utilization. By comparing different environments on ARM and RISC-V systems, developers can determine the most optimal combinations of hardware and software for specific workloads.

Kernel configuration plays a substantial role in determining GPU performance. On embedded Linux systems, GPUs rely on kernel modules that must be enabled to expose DRM capabilities, memory management units, and hardware synchronization primitives. Developers building custom kernels for ARM or RISC-V platforms often enable specific options that affect GPU driver stability. Commands like:

Bash

zcat /proc/config.gz | grep DRM

zcat /proc/config.gz | grep DRM

help confirm which DRM components are active. Ensuring that CMA (Contiguous Memory Allocator) is configured correctly is essential for GPU drivers that require large contiguous memory buffers. This is particularly important for high-resolution displays, video decoding workloads, and compute shaders that operate on large datasets. RISC-V kernels may require additional experimental options to support early GPU drivers, making kernel configuration a significant part of the benchmarking process.

One of the most interesting areas of benchmarking GPU acceleration on ARM and RISC-V is the measurement of shader compilation performance. GPUs rely heavily on runtime compilation for both graphics and compute shaders, especially in Mesa-based open-source drivers. Measuring shader compilation time helps developers understand how well the GPU driver optimizes workloads for the underlying architecture. Some testers instrument shader compilation using environment variables such as:

Bash

export MESA_SHADER_CACHE_DISABLE=false
export MESA_SHADER_CACHE_DIR=/path/to/cache

export MESA_SHADER_CACHE_DISABLE=false
export MESA_SHADER_CACHE_DIR=/path/to/cache

By clearing the shader cache and timing applications, developers analyze the cold-start performance of graphical workloads. ARM systems, with more established shader compilers in Mesa, generally show consistent compilation times, while RISC-V systems may experience longer compilation phases as drivers are still optimized. These results become excellent indicators of driver readiness and overall GPU usability.

Because GPU benchmarking is often used to guide system optimization, developers frequently tweak CPU governors, GPU frequency scaling parameters, and thermal trip points to determine how performance changes under different conditions. Adjusting CPU frequency governors using:

Bash

cpupower frequency-set -g performance

cpupower frequency-set -g performance

can significantly improve GPU-bounded workloads that rely on fast CPU-side preparation. Similarly, adjusting GPU frequencies on supported ARM systems using sysfs interfaces reveals how the GPU behaves when operating at maximum versus scaled frequencies. For RISC-V, such controls may not yet be available on early SoCs, but they will become important as GPU support matures.

Another important aspect involves benchmarking Vulkan performance when available. Vulkan is particularly relevant because it provides a lower-level API with reduced driver overhead, often showing performance differences between architectures more starkly than OpenGL ES. Tools like vkmark allow developers to test various Vulkan rendering pipelines, measure synchronization overhead, and analyze draw call throughput. ARM systems with Vulkan-capable GPUs typically perform well in these scenarios, demonstrating the benefits of mature driver implementations. On RISC-V systems with experimental Vulkan drivers, early tests provide valuable feedback to upstream Mesa developers working to stabilize these features.

Over time, benchmark results collected on ARM and RISC-V systems reveal trends that reflect not only hardware capabilities but also the health of the Linux graphics ecosystem supporting them. ARM boards tend to deliver predictable GPU acceleration, with performance scaling upward as newer SoCs introduce more powerful Mali or Adreno GPUs. RISC-V boards show a curve of rapid improvement, with early GPUs functioning as proofs of concept and newer SoCs gradually enabling full hardware acceleration. This comparison highlights a divergence between ARM’s mature ecosystem and RISC-V’s fast-moving development environment, making benchmarking essential for anyone working with cutting-edge open hardware.

The importance of benchmarking GPU acceleration on these architectures goes beyond raw performance numbers. It informs developers about driver stability, identifies regressions across kernel or Mesa updates, reveals thermal performance characteristics, and guides optimizations for applications ranging from embedded GUIs to machine-learning inference engines. Detailed performance measurements also help hardware vendors refine their future SoC designs, ensuring that the next generation of ARM and RISC-V processors deliver even better GPU capabilities.

As the Linux ecosystem continues to evolve, GPU benchmarking on ARM and RISC-V systems will remain an indispensable part of understanding how open-source graphics, low-power computing, and emerging hardware architectures converge. ARM systems today provide a polished, reliable experience with strong driver support, while RISC-V systems offer an exciting glimpse into the future of open hardware acceleration. The benchmarking process bridges the gap between these two worlds, offering a comprehensive view of their capabilities and guiding developers toward optimal system configurations for their applications.

In conclusion, conducting GPU hardware acceleration benchmarking on ARM and RISC-V Linux systems is not merely a technical exercise—it is a deep exploration of architectural philosophy, driver maturity, kernel integration, and the open-source graphics stack itself. Through the careful measurement of rendering throughput, compute shader performance, thermal stability, memory behavior, and real-world workloads, developers can gain a holistic understanding of how these architectures perform today and how they are likely to evolve in the coming years. Such insights will continue to shape the future of embedded systems, mobile computing, open hardware innovation, and the broader Linux graphics ecosystem.

Benchmarking the GPU Hardware Acceleration Performance on ARM & RISC-V Systems in Linux

You may also like...

What’s Hot?

Categories

Highlights

systemd-journald: journal corrupted or uncleanly shut down, renaming and replacing — a deep Linux narrative

A Technical Comparison of Desktop/Server vs Embedded Linux Boot Flows

A Generic Linux Boot Flow

A Deep Architectural Comparison of GTK and Qt on Linux: Framework Design, Rendering Models, Performance Characteristics, and Platform Integration

A Core-Level Architectural Deep Dive into Wayland Graphics Acceleration on Linux

VAAPI vs VDPAU Video Acceleration in Mozilla Firefox on Linux: A Deep Technical Exploration

Linux-Specific Performance and CPU Utilisation Optimisation Guide for Mozilla Firefox