AraOS: Analyzing the Impact of Virtual Memory Management on Vector Unit Performance
Abstract
Vector processor architectures offer an efficient solution for accelerating data-parallel workloads (e.g., ML, AI), reducing instruction count, and enhancing processing efficiency. This is evidenced by the increasing adoption of vector ISAs, such as Arm's SVE/SVE2 and RISC-V's RVV, not only in high-performance computers but also in embedded systems. The open-source nature of RVV has particularly encouraged the development of numerous vector processor designs across industry and academia. However, despite the growing number of open-source RVV processors, there is a lack of published data on their performance in a complex application environment hosted by a full-fledged operating system (Linux). In this work, we add OS support to the open-source bare-metal Ara2 vector processor (AraOS) by sharing the MMU of CVA6, the scalar core used for instruction dispatch to Ara2, and integrate AraOS into the open-source Cheshire SoC platform. We evaluate the performance overhead of virtual-to-physical address translation by benchmarking matrix multiplication kernels across several problem sizes and translation lookaside buffer (TLB) configurations in CVA6's shared MMU, providing insights into vector performance in a full-system environment with virtual memory. With at least 16 TLB entries, the virtual memory overhead remains below 3.5%. Finally, we benchmark a 2-lane AraOS instance with the open-source RiVEC benchmark suite for RVV architectures, with peak average speedups of 3.2x against scalar-only execution.