Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
New UAV technologies and the NewSpace era are transforming Earth Observation missions and data acquisition. Numerous small platforms generate large data volume, straining bandwidth and requiring onboard decision-making to transmit high-quality information in time. While Machine Learning allows real-time autonomous processing, FPGAs balance performance with adaptability to mission-specific requirements, enabling onboard deployment. This review systematically analyzes 66 experiments deploying ML models on FPGAs for Remote Sensing applications. We introduce two distinct taxonomies to capture both efficient model architectures and FPGA implementation strategies. For transparency and reproducibility, we follow PRISMA 2020 guidelines and share all data and code at https://github.com/CedricLeon/Survey_RS-ML-FPGA.
Simulation-based design space exploration (DSE) aims to efficiently optimize high-dimensional structured designs under complex constraints and expensive evaluation costs. Existing approaches, including heuristic and multi-step reinforcement learning (RL) methods, struggle to balance sampling efficiency and constraint satisfaction due to sparse, delayed feedback, and large hybrid action spaces. In this paper, we introduce CORE, a constraint-aware, one-step RL method for simulationguided DSE. In CORE, the policy agent learns to sample design configurations by defining a structured distribution over them, incorporating dependencies via a scaling-graph-based decoder, and by reward shaping to penalize invalid designs based on the feedback obtained from simulation. CORE updates the policy using a surrogate objective that compares the rewards of designs within a sampled batch, without learning a value function. This critic-free formulation enables efficient learning by encouraging the selection of higher-reward designs. We instantiate CORE for hardware-mapping co-design of neural network accelerators, demonstrating that it significantly improves sample efficiency and achieves better accelerator configurations compared to state-of-the-art baselines. Our approach is general and applicable to a broad class of discrete-continuous constrained design problems.
Computer System Architecture serves as a crucial bridge between software applications and the underlying hardware, encompassing components like compilers, CPUs, coprocessors, and RTL designs. Its development, from early mainframes to modern domain-specific architectures, has been driven by rising computational demands and advancements in semiconductor technology. However, traditional paradigms in computer system architecture design are confronting significant challenges, including a reliance on manual expertise, fragmented optimization across software and hardware layers, and high costs associated with exploring expansive design spaces. While automated methods leveraging optimization algorithms and machine learning have improved efficiency, they remain constrained by a single-stage focus, limited data availability, and a lack of comprehensive human domain knowledge. The emergence of large language models offers transformative opportunities for the design of computer system architecture. By leveraging the capabilities of LLMs in areas such as code generation, data analysis, and performance modeling, the traditional manual design process can be transitioned to a machine-based automated design approach. To harness this potential, we present the Large Processor Chip Model (LPCM), an LLM-driven framework aimed at achieving end-to-end automated computer architecture design. The LPCM is structured into three levels: Human-Centric; Agent-Orchestrated; and Model-Governed. This paper utilizes 3D Gaussian Splatting as a representative workload and employs the concept of software-hardware collaborative design to examine the implementation of the LPCM at Level 1, demonstrating the effectiveness of the proposed approach. Furthermore, this paper provides an in-depth discussion on the pathway to implementing Level 2 and Level 3 of the LPCM, along with an analysis of the existing challenges.
Deploying large language models (LLMs) on edge devices is crucial for delivering fast responses and ensuring data privacy. However, the limited storage, weight, and power of edge devices make it difficult to deploy LLM-powered applications. These devices must balance latency requirements with energy consumption and model accuracy. In this paper, we first quantify the challenges of deploying LLMs on off-the-shelf edge devices and then we present CLONE, an in-depth algorithm-hardware co-design at both the model- and system-level that intelligently integrates real-time, energy optimization while maintaining robust generality. In order to maximize the synergistic benefits of these algorithms in always-on and intermediate edge computing settings, we specialize in a 28nm scalable hardware accelerator system. We implement and extensively evaluate CLONE on two off-the-shelf edge platforms. Experiments show that CLONE effectively accelerates the inference process up to 11.92x, and saves energy up to 7.36x, while maintaining high-generation.
Multi-Head Latent Attention (MLA), introduced in DeepSeek-V2, improves the efficiency of large language models by projecting query, key, and value tensors into a compact latent space. This architectural change reduces the KV-cache size and significantly lowers memory bandwidth demands, particularly in the autoregressive decode phase. This letter presents the first hardware-centric analysis of MLA, comparing it to conventional Multi-Head Attention (MHA) and evaluating its implications for accelerator performance. We identify two alternative execution schemes of MLA--reusing, resp. recomputing latent projection matrices--which offer distinct trade-offs between compute and memory access. Using the Stream design space exploration framework, we model their throughput and energy cost across a range of hardware platforms and find that MLA can shift attention workloads toward the compute-bound regime. Our results show that MLA not only reduces bandwidth usage but also enables adaptable execution strategies aligned with hardware constraints. Compared to MHA, it provides more stable and efficient performance, particularly on bandwidth-limited hardware platforms. These findings emphasize MLA's relevance as a co-design opportunity for future AI accelerators.
Accurate performance projection of large-scale benchmarks is essential for CPU architects to evaluate and optimize future processor designs. SimPoint sampling, which uses Basic Block Vectors (BBVs), is a widely adopted technique to reduce simulation time by selecting representative program phases. However, BBVs often fail to capture the behavior of applications with extensive array-indirect memory accesses, leading to inaccurate projections. In particular, the 523.xalancbmk_r benchmark exhibits complex data movement patterns that challenge traditional SimPoint methods. To address this, we propose enhancing SimPoint's BBV methodology by incorporating Memory Access Vectors (MAV), a microarchitecture independent technique that tracks functional memory access patterns. This combined approach significantly improves the projection accuracy of 523.xalancbmk_r on a 192-core system-on-chip, increasing it from 80% to 98%.
Spiking Neural Networks have earned increased recognition in recent years owing to their biological plausibility and event-driven computation. Spiking neurons are the fundamental building components of Spiking Neural Networks. Those neurons act as computational units that determine the decision to fire an action potential. This work presents a methodology to implement biologically plausible yet scalable spiking neurons in hardware. We show that it is more efficient to design neurons that mimic the $I_{Na,p}+I_{K}$ model rather than the more complicated Hodgkin-Huxley model. We demonstrate our methodology by presenting eleven novel minimal spiking neuron circuits in Parts I and II of the paper. We categorize the neuron circuits presented into two types: Resonators and Integrators. We discuss the methodology employed in designing neurons of the resonator type in Part I, while we discuss neurons of the integrator type in Part II. In part I, we postulate that Sodium channels exhibit type-N negative differential resistance. Consequently, we present three novel minimal neuron circuits that use type-N negative differential resistance circuits or devices as the Sodium channel. Nevertheless, the aim of the paper is not to present a set of minimal neuron circuits but rather the methodology utilized to construct those circuits.
Compute-in-memory (CIM) architecture has been widely explored to address the von Neumann bottleneck in accelerating deep neural networks (DNNs). However, its reliability remains largely understudied, particularly in the emerging domain of floating-point (FP) CIM, which is crucial for speeding up high-precision inference and on device training. This paper introduces Unicorn-CIM, a framework to uncover the vulnerability and improve the resilience of high-precision CIM, built on static random-access memory (SRAM)-based FP CIM architecture. Through the development of fault injection and extensive characterizations across multiple DNNs, Unicorn-CIM reveals how soft errors manifest in FP operations and impact overall model performance. Specifically, we find that high-precision DNNs are extremely sensitive to errors in the exponent part of FP numbers. Building on this insight, Unicorn-CIM develops an efficient algorithm-hardware co-design method that optimizes model exponent distribution through fine-tuning and incorporates a lightweight Error Correcting Code (ECC) scheme to safeguard high-precision DNNs on FP CIM. Comprehensive experiments show that our approach introduces just an 8.98% minimal logic overhead on the exponent processing path while providing robust error protection and maintaining model accuracy. This work paves the way for developing more reliable and efficient CIM hardware.
In modern computing systems, compilation employs numerous optimization techniques to enhance code performance. Source-to-source code transformations, which include control flow and datapath transformations, have been widely used in High-Level Synthesis (HLS) and compiler optimization. While researchers actively investigate methods to improve performance with source-to-source code transformations, they often overlook the significance of verifying their correctness. Current tools cannot provide a holistic verification of these transformations. This paper introduces HEC, a framework for equivalence checking that leverages the e-graph data structure to comprehensively verify functional equivalence between programs. HEC utilizes the MLIR as its frontend and integrates MLIR into the e-graph framework. Through the combination of dynamic and static e-graph rewriting, HEC facilitates the validation of comprehensive code transformations. We demonstrate effectiveness of HEC on PolyBenchC benchmarks, successfully verifying loop unrolling, tiling, and fusion transformations. HEC processes over 100,000 lines of MLIR code in 40 minutes with predictable runtime scaling. Importantly, HEC identified two critical compilation errors in mlir-opt: loop boundary check errors causing unintended executions during unrolling, and memory read-after-write violations in loop fusion that alter program semantics. These findings demonstrate HEC practical value in detecting real-world compiler bugs and highlight the importance of formal verification in optimization pipelines.
As machine learning algorithms are shown to be an increasingly valuable tool, the demand for their access has grown accordingly. Oftentimes, it is infeasible to run inference with larger models without an accelerator, which may be unavailable in environments that have constraints such as energy consumption, security, or cost. To increase the availability of these models, we aim to improve the LLM inference speed on a CPU-only environment by modifying the cache architecture. To determine what improvements could be made, we conducted two experiments using Llama.cpp and the QWEN model: running various cache configurations and evaluating their performance, and outputting a trace of the memory footprint. Using these experiments, we investigate the memory access patterns and performance characteristics to identify potential optimizations.
Artificial Intelligence (AI) algorithms, such as Deep Neural Networks (DNNs), have become an important tool for a wide range of applications, from computer vision to natural language processing. However, the computational complexity of DNN inference poses a significant challenge, particularly for processing on resource-constrained edge devices. One promising approach to address this challenge is the exploitation of sparsity in DNN operator weights. In this work, we present FlexiSAGA, an architecturally configurable and dataflow-flexible AI hardware accelerator for the sparse and dense processing of general matrix multiplications (GEMMs). FlexiSAGA supports seven different sparse and dense dataflows, enabling efficient processing of resource intensive DNN operators. Additionally, we propose a DNN pruning method specifically tailored towards the FlexiSAGA architecture, allowing for near-optimal processing of dense and sparse convolution and fully-connected operators, facilitating a DNN/HW co-design flow. Our results show a whole DNN sparse-over-dense inference speedup ranging from 1.41 up to 4.28, outperforming commercial and literature-reported accelerator platforms.
This paper introduces SpiceMixer, a genetic algorithm developed to synthesize novel analog circuits by evolving SPICE netlists. Unlike conventional methods, SpiceMixer operates directly on netlist lines, enabling compatibility with any component or subcircuit type and supporting general-purpose genetic operations. By using a normalized netlist format, the algorithm enhances the effectiveness of its genetic operators: crossover, mutation, and pruning. We show that SpiceMixer achieves superior performance in synthesizing standard cells (inverter, two-input NAND, and latch) and in designing an analog classifier circuit for the Iris dataset, reaching an accuracy of 89% on the test set. Across all evaluated tasks, SpiceMixer consistently outperforms existing synthesis methods.
In recent years, the development of specialized edge computing devices has significantly increased, driven by the growing demand for AI models. These devices, such as the NVIDIA Jetson series, must efficiently handle increased data processing and storage requirements. However, despite these advancements, there remains a lack of frameworks that automate the optimal execution of optimal execution of deep neural network (DNN). Therefore, efforts have been made to create schedulers that can manage complex data processing needs while ensuring the efficient utilization of all available accelerators within these devices, including the CPU, GPU, deep learning accelerator (DLA), programmable vision accelerator (PVA), and video image compositor (VIC). Such schedulers would maximize the performance of edge computing systems, crucial in resource-constrained environments. This paper aims to comprehensively review the various DNN schedulers implemented on NVIDIA Jetson devices. It examines their methodologies, performance, and effectiveness in addressing the demands of modern AI workloads. By analyzing these schedulers, this review highlights the current state of the research in the field. It identifies future research and development areas, further enhancing edge computing devices' capabilities.
Leveraging high degrees of unstructured sparsity is a promising approach to enhance the efficiency of deep neural network DNN accelerators - particularly important for emerging Edge-AI applications. We introduce VUSA, a systolic-array architecture that virtually grows based on the present sparsity to perform larger matrix multiplications with the same number of physical multiply-accumulate MAC units. The proposed architecture achieves saving by 37% and 68% in area and power efficiency, respectively, at the same peak-performance, compared to a baseline systolic array architecture in a commercial 16-nm technology. Still, the proposed architecture supports acceleration for any DNN with any sparsity - even no sparsity at all. Thus, the proposed architecture is application-independent, making it viable for general-purpose AI acceleration.
Ternary large language models (LLMs), which utilize ternary precision weights and 8-bit activations, have demonstrated competitive performance while significantly reducing the high computational and memory requirements of full-precision LLMs. The energy efficiency and performance of Ternary LLMs can be further improved by deploying them on ternary computing-in-memory (TCiM) accelerators, thereby alleviating the von-Neumann bottleneck. However, TCiM accelerators are prone to memory stuck-at faults (SAFs) leading to degradation in the model accuracy. This is particularly severe for LLMs due to their low weight sparsity. To boost the SAF tolerance of TCiM accelerators, we propose ReTern that is based on (i) fault-aware sign transformations (FAST) and (ii) TCiM bit-cell reprogramming exploiting their natural redundancy. The key idea is to utilize FAST to minimize computations errors due to SAFs in +1/-1 weights, while the natural bit-cell redundancy is exploited to target SAFs in 0 weights (zero-fix). Our experiments on BitNet b1.58 700M and 3B ternary LLMs show that our technique furnishes significant fault tolerance, notably 35% reduction in perplexity on the Wikitext dataset in the presence of faults. These benefits come at the cost of < 3%, < 7%, and < 1% energy, latency and area overheads respectively.
Low-cost, high-throughput DNA and RNA sequencing (HTS) data is the main workforce for the life sciences. Genome sequencing is now becoming a part of Predictive, Preventive, Personalized, and Participatory (termed 'P4') medicine. All genomic data are currently processed in energy-hungry computer clusters and centers, necessitating data transfer, consuming substantial energy, and wasting valuable time. Therefore, there is a need for fast, energy-efficient, and cost-efficient technologies that enable genomics research without requiring data centers and cloud platforms. We recently started the BioPIM Project to leverage the emerging processing-in-memory (PIM) technologies to enable energy and cost-efficient analysis of bioinformatics workloads. The BioPIM Project focuses on co-designing algorithms and data structures commonly used in genomics with several PIM architectures for the highest cost, energy, and time savings benefit.
As hardware design complexity increases, hardware fuzzing emerges as a promising tool for automating the verification process. However, a significant gap still exists before it can be applied in industry. This paper aims to summarize the current progress of hardware fuzzing from an industry-use perspective and propose solutions to bridge the gap between hardware fuzzing and industrial verification. First, we review recent hardware fuzzing methods and analyze their compatibilities with industrial verification. We establish criteria to assess whether a hardware fuzzing approach is compatible. Second, we examine whether current verification tools can efficiently support hardware fuzzing. We identify the bottlenecks in hardware fuzzing performance caused by insufficient support from the industrial environment. To overcome the bottlenecks, we propose a prototype, HwFuzzEnv, providing the necessary support for hardware fuzzing. With this prototype, the previous hardware fuzzing method can achieve a several hundred times speedup in industrial settings. Our work could serve as a reference for EDA companies, encouraging them to enhance their tools to support hardware fuzzing efficiently in industrial verification.
Embedded edge devices are often used as a computing platform to run real-world point cloud applications, but recent deep learning-based methods may not fit on such devices due to limited resources. In this paper, we aim to fill this gap by introducing PointODE, a parameter-efficient ResNet-like architecture for point cloud feature extraction based on a stack of MLP blocks with residual connections. We leverage Neural ODE (Ordinary Differential Equation), a continuous-depth version of ResNet originally developed for modeling the dynamics of continuous-time systems, to compress PointODE by reusing the same parameters across MLP blocks. The point-wise normalization is proposed for PointODE to handle the non-uniform distribution of feature points. We introduce PointODE-Elite as a lightweight version with 0.58M trainable parameters and design its dedicated accelerator for embedded FPGAs. The accelerator consists of a four-stage pipeline to parallelize the feature extraction for multiple points and stores the entire parameters on-chip to eliminate most of the off-chip data transfers. Compared to the ARM Cortex-A53 CPU, the accelerator implemented on a Xilinx ZCU104 board speeds up the feature extraction by 4.9x, leading to 3.7x faster inference and 3.5x better energy-efficiency. Despite the simple architecture, PointODE-Elite shows competitive accuracy to the state-of-the-art models on both synthetic and real-world classification datasets, greatly improving the trade-off between accuracy and inference cost.