Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
This work presents a multi-layer DNN scheduling framework as an extension of OctopuScheduler, providing an end-to-end flow from PyTorch models to inference on a single SpiNNaker2 chip. Together with a front-end comprised of quantization and lowering steps, the proposed framework enables the edge-based execution of large and complex DNNs up to transformer scale using the neuromorphic platform SpiNNaker2.
Computation-in-Memory (CiM) is attracting attention as a technology that can perform MAC calculations required for AI accelerators, at high speed with low power consumption. However, there is a problem regarding power consumption and device-derived errors that increase as row parallelism increases. In this paper, a 4T2R ReRAM cell and an 8T SRAM CiM suitable for CiM is proposed. It is shown that adopting the proposed 4T2R ReRAM cell reduces the errors due to variation in ReRAM devices compared to conventional 4T4R ReRAM cells.
This work-in-progress paper presents a case study in which counterfeit TL074 operational amplifiers, discovered in a junior level electronics course, became the basis for a hands on learning experience. Counterfeit integrated circuits (IC) are increasingly common, posing a significant threat to the integrity of undergraduate electronics laboratories. Instead of simply replacing the counterfeit components, we turned the issue into a teaching moment. Students engaged in hands-on diagnostics measuring current, analyzing waveforms, and troubleshooting. By working with fake chip components, they gained deeper insight into analog circuits, supply chain security, and practical engineering.
An increasing number of applications are exploiting sampling-based algorithms for planning, optimization, and inference. The Markov Chain Monte Carlo (MCMC) algorithms form the computational backbone of this emerging branch of machine learning. Unfortunately, the high computational cost limits their feasibility for large-scale problems and real-world applications, and the existing MCMC acceleration solutions are either limited in hardware flexibility or fail to maintain efficiency at the system level across a variety of end-to-end applications. This paper introduces \textbf{MC$^2$A}, an algorithm-hardware co-design framework, enabling efficient and flexible optimization for MCMC acceleration. Firstly, \textbf{MC$^2$A} analyzes the MCMC workload diversity through an extension of the processor performance roofline model with a 3rd dimension to derive the optimal balance between the compute, sampling and memory parameters. Secondly, \textbf{MC$^2$A} proposes a parametrized hardware accelerator architecture with flexible and efficient support of MCMC kernels with a pipeline of ISA-programmable tree-structured processing units, reconfigurable samplers and a crossbar interconnect to support irregular access. Thirdly, the core of \textbf{MC$^2$A} is powered by a novel Gumbel sampler that eliminates exponential and normalization operations. In the end-to-end case study, \textbf{MC$^2$A} achieves an overall {$307.6\times$, $1.4\times$, $2.0\times$, $84.2\times$} speedup compared to the CPU, GPU, TPU and state-of-the-art MCMC accelerator. Evaluated on various representative MCMC workloads, this work demonstrates and exploits the feasibility of general hardware acceleration to popularize MCMC-based solutions in diverse application domains.
Transformers have revolutionized deep learning with applications in natural language processing, computer vision, and beyond. However, their computational demands make it challenging to deploy them on low-power edge devices. This paper introduces an ultra-low-power, Coarse-Grained Reconfigurable Array (CGRA) architecture specifically designed to accelerate General Matrix Multiplication (GEMM) operations in transformer models tailored for the energy and resource constraints of edge applications. The proposed architecture integrates a 4 x 4 array of Processing Elements (PEs) for efficient parallel computation and dedicated 4 x 2 Memory Operation Blocks (MOBs) for optimized LOAD/STORE operations, reducing memory bandwidth demands and enhancing data reuse. A switchless mesh torus interconnect network further minimizes power and latency by enabling direct communication between PEs and MOBs, eliminating the need for centralized switching. Through its heterogeneous array design and efficient dataflow, this CGRA architecture addresses the unique computational needs of transformers, offering a scalable pathway to deploy sophisticated machine learning models on edge devices.
Reducing latency in the Internet of Things (IoT) is a critical concern. While cloud computing facilitates communication, it falls short of meeting real-time requirements reliably. Edge and fog computing have emerged as viable solutions by positioning computing nodes closer to end users, offering lower latency and increased processing power. An edge-fog framework comprises various components, including edge and fog nodes, whose strategic placement is crucial as it directly impacts latency and system cost. This paper presents an effective and tunable node placement strategy based on a genetic algorithm to address the optimization problem of deploying edge and fog nodes. The main objective is to minimize latency and cost through optimal node placement. Simulation results demonstrate that the proposed framework achieves up to 2.77% latency and 31.15% cost reduction.
The demand for machine intelligence capable of processing continuous, long-context inputs on local devices is growing rapidly. However, the quadratic complexity and memory requirements of traditional Transformer architectures make them inefficient and often unusable for these tasks. This has spurred a paradigm shift towards new architectures like State Space Models (SSMs) and hybrids, which promise near-linear scaling. While most current research focuses on the accuracy and theoretical throughput of these models, a systematic performance characterization on practical consumer hardware is critically needed to guide system-level optimization and unlock new applications. To address this gap, we present a comprehensive, comparative benchmarking of carefully selected Transformer, SSM, and hybrid models specifically for long-context inference on consumer and embedded GPUs. Our analysis reveals that SSMs are not only viable but superior for this domain, capable of processing sequences up to 220K tokens on a 24GB consumer GPU-approximately 4x longer than comparable Transformers. While Transformers may be up to 1.8x faster at short sequences, SSMs demonstrate a dramatic performance inversion, becoming up to 4x faster at very long contexts (~57K tokens). Our operator-level analysis reveals that custom, hardware-aware SSM kernels dominate the inference runtime, accounting for over 55% of latency on edge platforms, identifying them as a primary target for future hardware acceleration. We also provide detailed, device-specific characterization results to guide system co-design for the edge. To foster further research, we will open-source our characterization framework.
The Number Theoretic Transform (NTT) is a fundamental operation in privacy-preserving technologies, particularly within fully homomorphic encryption (FHE). The efficiency of NTT computation directly impacts the overall performance of FHE, making hardware acceleration a critical technology that will enable realistic FHE applications. Custom accelerators, in FPGAs or ASICs, offer significant performance advantages due to their ability to exploit massive parallelism and specialized optimizations. However, the operation of NTT over large moduli requires large word-length modulo arithmetic that limits achievable clock frequencies in hardware and increases hardware area costs. To overcome such deficits, digit-serial arithmetic has been explored for modular multiplication and addition independently. The goal of this work is to leverage digit-serial modulo arithmetic combined with appropriate redundant data representation to design modular pipelined NTT accelerators that operate uniformly on arbitrary small digits, without the need for intermediate (de)serialization. The proposed architecture enables high clock frequencies through regular pipelining while maintaining parallelism. Experimental results demonstrate that the proposed approach outperforms state-of-the-art implementations and reduces hardware complexity under equal performance and input-output bandwidth constraints.
Large Language Models (LLMs) have become widely used across diverse NLP tasks and domains, demonstrating their adaptability and effectiveness. In the realm of Electronic Design Automation (EDA), LLMs show promise for tasks like Register-Transfer Level (RTL) code generation and summarization. However, despite the proliferation of LLMs for general code-related tasks, there's a dearth of research focused on evaluating and refining these models for hardware description languages (HDLs), notably VHDL. In this study, we evaluate the performance of existing code LLMs for VHDL code generation and summarization using various metrics and two datasets -- VHDL-Eval and VHDL-Xform. The latter, an in-house dataset, aims to gauge LLMs' understanding of functionally equivalent code. Our findings reveal consistent underperformance of these models across different metrics, underscoring a significant gap in their suitability for this domain. To address this challenge, we propose Chain-of-Descriptions (CoDes), a novel approach to enhance the performance of LLMs for VHDL code generation and summarization tasks. CoDes involves generating a series of intermediate descriptive steps based on: (i) the problem statement for code generation, and (ii) the VHDL code for summarization. These steps are then integrated with the original input prompt (problem statement or code) and provided as input to the LLMs to generate the final output. Our experiments demonstrate that the CoDes approach significantly surpasses the standard prompting strategy across various metrics on both datasets. This method not only improves the quality of VHDL code generation and summarization but also serves as a framework for future research aimed at enhancing code LLMs for VHDL.
Task offloading in three-layer fog computing environments presents a critical challenge due to user equipment (UE) mobility, which frequently triggers costly service migrations and degrades overall system performance. This paper addresses this problem by proposing MOFCO, a novel Mobility- and Migration-aware Task Offloading algorithm for Fog Computing environments. The proposed method formulates task offloading and resource allocation as a Mixed-Integer Nonlinear Programming (MINLP) problem and employs a heuristic-aided evolutionary game theory approach to solve it efficiently. To evaluate MOFCO, we simulate mobile users using SUMO, providing realistic mobility patterns. Experimental results show that MOFCO reduces system cost, defined as a combination of latency and energy consumption, by an average of 19% and up to 43% in certain scenarios compared to state-of-the-art methods.
Flexibility and customization are key strengths of Field-Programmable Gate Arrays (FPGAs) when compared to other computing devices. For instance, FPGAs can efficiently implement arbitrary-precision arithmetic operations, and can perform aggressive synthesis optimizations to eliminate ineffectual operations. Motivated by sparsity and mixed-precision in deep neural networks (DNNs), we investigate how to optimize the current logic block architecture to increase its arithmetic density. We find that modern FPGA logic block architectures prevent the independent use of adder chains, and instead only allow adder chain inputs to be fed by look-up table (LUT) outputs. This only allows one of the two primitives -- either adders or LUTs -- to be used independently in one logic element and prevents their concurrent use, hampering area optimizations. In this work, we propose the Double Duty logic block architecture to enable the concurrent use of the adders and LUTs within a logic element. Without adding expensive logic cluster inputs, we use 4 of the existing inputs to bypass the LUTs and connect directly to the adder chain inputs. We accurately model our changes at both the circuit and CAD levels using open-source FPGA development tools. Our experimental evaluation on a Stratix-10-like architecture demonstrates area reductions of 21.6% on adder-intensive circuits from the Kratos benchmarks, and 9.3% and 8.2% on the more general Koios and VTR benchmarks respectively. These area improvements come without an impact to critical path delay, demonstrating that higher density is feasible on modern FPGA architectures by adding more flexibility in how the adder chain is used. Averaged across all circuits from our three evaluated benchmark set, our Double Duty FPGA architecture improves area-delay product by 9.7%.
To meet the increasing demand of deep learning (DL) models, AI chips are employing both off-chip memory (e.g., HBM) and high-bandwidth low-latency interconnect for direct inter-core data exchange. However, it is not easy to explore the efficiency of these inter-core connected AI (ICCA) chips, due to a fundamental tussle among compute (per-core execution), communication (inter-core data exchange), and I/O (off-chip data access). In this paper, we develop Elk, a DL compiler framework to maximize the efficiency of ICCA chips by jointly trading off all the three performance factors discussed above. Elk structures these performance factors into configurable parameters and forms a global trade-off space in the DL compiler. To systematically explore this space and maximize overall efficiency, Elk employs a new inductive operator scheduling policy and a cost-aware on-chip memory allocation algorithm. It generates globally optimized execution plans that best overlap off-chip data loading and on-chip execution. To examine the efficiency of Elk, we build a full-fledged emulator based on a real ICCA chip IPU-POD4, and an ICCA chip simulator for sensitivity analysis with different interconnect network topologies. Elk achieves 94% of the ideal roofline performance of ICCA chips on average, showing the benefits of supporting large DL models on ICCA chips. We also show Elk's capability of enabling architecture design space exploration for new ICCA chip development.
Transformer models rely heavily on scaled dot-product attention (SDPA), typically implemented using the FlashAttention algorithm. However, current systolic-array-based accelerators face significant challenges when executing FlashAttention. Systolic arrays can only achieve high utilization for consecutive and large matrix multiplications. In contrast, FlashAttention requires frequently interleaved matrix multiplications and softmax operations. The frequent data swaps between the systolic array and external vector units result in low systolic array utilization. This is further exacerbated by the fact that softmax involves numerous non-matrix operations, which are not well-suited for systolic arrays. Moreover, the concurrent execution of matrix multiplication on systolic arrays and softmax on vector units leads to register file and SRAM port contention, further degrading performance. To overcome these limitations, we propose FSA, an enhanced systolic array architecture that enables the entire FlashAttention algorithm to run entirely within a single systolic array, eliminating the need for external vector units. At the core of FSA is SystolicAttention, a novel scheduling algorithm that maps FlashAttention operations onto systolic arrays with fine-grained, element-wise overlap. This significantly improves array utilization while preserving the original floating-point operation order to maintain numerical stability. We implement FSA in synthesizable RTL and evaluate its performance against state-of-the-art commercial accelerators. Our results show that FSA achieves 1.77x and 4.83x higher attention FLOPs/s utilization compared to AWS NeuronCore-v2 and Google TPUv5e, respectively, with only about 10% area overhead.
The growing demand for edge computing and AI drives research into analog in-memory computing using memristors, which overcome data movement bottlenecks by computing directly within memory. However, device failures and variations critically limit analog systems' precision and reliability. Existing fault-tolerance techniques, such as redundancy and retraining, are often inadequate for high-precision applications or scenarios requiring fixed matrices and privacy preservation. Here, we introduce and experimentally demonstrate a fault-free matrix representation where target matrices are decomposed into products of two adjustable sub-matrices programmed onto analog hardware. This indirect, adaptive representation enables mathematical optimization to bypass faulty devices and eliminate differential pairs, significantly enhancing computational density. Our memristor-based system achieved >99.999% cosine similarity for a Discrete Fourier Transform matrix despite 39% device fault rate, a fidelity unattainable with conventional direct representation, which fails with single device faults (0.01% rate). We demonstrated 56-fold bit-error-rate reduction in wireless communication and >196% density with 179% energy efficiency improvements compared to state-of-the-art techniques. This method, validated on memristors, applies broadly to emerging memories and non-electrical computing substrates, showing that device yield is no longer the primary bottleneck in analog computing hardware.
Designing secure architectures for system-on-chip (SoC) platforms is a highly intricate and time-intensive task, often requiring months of development and meticulous verification. Even minor architectural oversights can lead to critical vulnerabilities that undermine the security of the entire chip. In response to this challenge, we introduce CITADEL, a modular security framework aimed at streamlining the creation of robust security architectures for SoCs. CITADEL offers a configurable, plug-and-play subsystem composed of custom intellectual property (IP) blocks, enabling the construction of diverse security mechanisms tailored to specific threats. As a concrete demonstration, we instantiate CITADEL to defend against supply-chain threats, illustrating how the framework adapts to one of the most pressing concerns in hardware security. This paper explores the range of obstacles encountered when building a unified security architecture capable of addressing multiple attack vectors and presents CITADEL's strategies for overcoming them. Through several real-world case studies, we showcase the practical implementation of CITADEL and present a thorough evaluation of its impact on silicon area and power consumption across various ASIC technologies. Results indicate that CITADEL introduces only minimal resource overhead, making it a practical solution for enhancing SoC security.
LUT (Look-Up Table) mapping is a critical step in FPGA logic synthesis, where a logic network is transformed into a form that can be directly implemented using the FPGA's LUTs. An FPGA LUT is a flexible digital memory structure that can implement any logic function of a limited number of inputs, typically 4 to 6 inputs, depending on the FPGA architecture. The goal of LUT mapping is to map the Boolean network into LUTs, where each LUT can implement any function with a fixed number of inputs. In parallel to FPGA technology mapping, ASIC technology mapping maps the Boolean network to user-defined standard cells, which has traditionally been developed separately from LUT mapping algorithms. However, in this work, our motivating examples demonstrate that ASIC technology mappers can potentially improve the performance of LUT mappers, such that standard cell mapping and LUT mapping work in an incremental manner. Therefore, we propose the FuseMap framework, which explores this opportunity to improve LUT mapping in the FPGA design flow by utilizing reinforcement learning to make design-specific choices during cell selection. The effectiveness of FuseMap is evaluated on a wide range of benchmarks, different technology libraries, and technology mappers. The experimental results demonstrate that FuseMap achieves higher mapping accuracy while reducing delay and area across diverse circuit designs collected from ISCAS 85/89, ITC/ISCAS 99, VTR 8.0, and EPFL benchmarks.
Gain Cell memory (GCRAM) offers higher density and lower power than SRAM, making it a promising candidate for on-chip memory in domain-specific accelerators. To support workloads with varying traffic and lifetime metrics, GCRAM also offers high bandwidth, ultra low leakage power and a wide range of retention times, which can be adjusted through transistor design (like threshold voltage and channel material) and on-the-fly by changing the operating voltage. However, designing and optimizing GCRAM sub-systems can be time-consuming. In this paper, we present OpenGCRAM, an open-source GCRAM compiler capable of generating GCRAM bank circuit designs and DRC- and LVS-clean layouts for commercially available foundry CMOS, while also providing area, delay, and power simulations based on user-specified configurations (e.g., word size and number of words). OpenGCRAM enables fast, accurate, customizable, and optimized GCRAM block generation, reduces design time, ensure process compliance, and delivers performance-tailored memory blocks that meet diverse application requirements.
Neuromorphic systems using in-memory or event-driven computing are motivated by the need for more energy-efficient processing of artificial intelligence workloads. Emerging neuromorphic architectures aim to combine traditional digital designs with the computational efficiency of analog computing and novel device technologies. A crucial problem in the rapid exploration and co-design of such architectures is the lack of tools for fast and accurate modeling and simulation. Typical mixed-signal design tools integrate a digital simulator with an analog solver like SPICE, which is prohibitively slow for large systems. By contrast, behavioral modeling of analog components is faster, but existing approaches are fixed to specific architectures with limited energy and performance modeling. In this paper, we propose LASANA, a novel approach that leverages machine learning to derive data-driven surrogate models of analog sub-blocks in a digital backend architecture. LASANA uses SPICE-level simulations of a circuit to train ML models that predict circuit energy, performance, and behavior at analog/digital interfaces. Such models can provide energy and performance annotation on top of existing behavioral models or function as replacements to analog simulation. We apply LASANA to an analog crossbar array and a spiking neuron circuit. Running MNIST and spiking MNIST, LASANA surrogates demonstrate up to three orders of magnitude speedup over SPICE, with energy, latency, and behavioral error less than 7%, 8%, and 2%, respectively.