Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
Escalating artificial intelligence (AI) demands expose a critical "compute crisis" characterized by unsustainable energy consumption, prohibitive training costs, and the approaching limits of conventional CMOS scaling. Physics-based Application-Specific Integrated Circuits (ASICs) present a transformative paradigm by directly harnessing intrinsic physical dynamics for computation rather than expending resources to enforce idealized digital abstractions. By relaxing the constraints needed for traditional ASICs, like enforced statelessness, unidirectionality, determinism, and synchronization, these devices aim to operate as exact realizations of physical processes, offering substantial gains in energy efficiency and computational throughput. This approach enables novel co-design strategies, aligning algorithmic requirements with the inherent computational primitives of physical systems. Physics-based ASICs could accelerate critical AI applications like diffusion models, sampling, optimization, and neural network inference as well as traditional computational workloads like scientific simulation of materials and molecules. Ultimately, this vision points towards a future of heterogeneous, highly-specialized computing platforms capable of overcoming current scaling bottlenecks and unlocking new frontiers in computational power and efficiency.
Assertion-Based Verification (ABV) is critical for ensuring functional correctness in modern hardware systems. However, manually writing high-quality SVAs remains labor-intensive and error-prone. To bridge this gap, we propose AssertCoder, a novel unified framework that automatically generates high-quality SVAs directly from multimodal hardware design specifications. AssertCoder employs a modality-sensitive preprocessing to parse heterogeneous specification formats (text, tables, diagrams, and formulas), followed by a set of dedicated semantic analyzers that extract structured representations aligned with signal-level semantics. These representations are utilized to drive assertion synthesis via multi-step chain-of-thought (CoT) prompting. The framework incorporates a mutation-based evaluation approach to assess assertion quality via model checking and further refine the generated assertions. Experimental evaluation across three real-world Register-Transfer Level (RTL) designs demonstrates AssertCoder's superior performance, achieving an average increase of 8.4% in functional correctness and 5.8% in mutation detection compared to existing state-of-the-art approaches.
Transformers are the driving force behind today's Large Language Models (LLMs), serving as the foundation for their performance and versatility. Yet, their compute and memory costs grow with sequence length, posing scalability challenges for long-context inferencing. In response, the algorithm community is exploring alternative architectures, such as state space models (SSMs), linear attention, and recurrent neural networks (RNNs), which we refer to as post-transformers. This shift presents a key challenge: building a serving system that efficiently supports both transformer and post-transformer LLMs within a unified framework. To address this challenge, we analyze the performance characteristics of transformer and post-transformer LLMs. Despite their algorithmic differences, both are fundamentally limited by memory bandwidth under batched inference due to attention in transformers and state updates in post-transformers. Further analyses suggest two additional insights: (1) state update operations, unlike attention, incur high hardware cost, making per-bank PIM acceleration inefficient, and (2) different low-precision arithmetic methods offer varying accuracy-area tradeoffs, while we identify Microsoft's MX as the Pareto-optimal choice. Building on these insights, we design Pimba as an array of State-update Processing Units (SPUs), each shared between two banks to enable interleaved access to PIM. Each SPU includes a State-update Processing Engine (SPE) that comprises element-wise multipliers and adders using MX-based quantized arithmetic, enabling efficient execution of state update and attention operations. Our evaluation shows that, compared to LLM-optimized GPU and GPU+PIM systems, Pimba achieves up to 3.2x and 2.1x higher token generation throughput, respectively.
Recent advancements have demonstrated the significant potential of large language models (LLMs) in analog circuit design. Nevertheless, testbench construction for analog circuits remains manual, creating a critical bottleneck in achieving fully automated design processes. Particularly when replicating circuit designs from academic papers, manual Testbench construction demands time-intensive implementation and frequent adjustments, which fails to address the dynamic diversity and flexibility requirements for automation. AnalogTester tackles automated analog design challenges through an LLM-powered pipeline: a) domain-knowledge integration, b) paper information extraction, c) simulation scheme synthesis, and d) testbench code generation with Tsinghua Electronic Design (TED). AnalogTester has demonstrated automated Testbench generation capabilities for three fundamental analog circuit types: operational amplifiers (op-amps), bandgap references (BGRs), and low-dropout regulators (LDOs), while maintaining a scalable framework for adaptation to broader circuit topologies. Furthermore, AnalogTester can generate circuit knowledge data and TED code corpus, establishing fundamental training datasets for LLM specialization in analog circuit design automation.
Deep learning-based prediction models for High-Level Synthesis (HLS) of hardware designs often struggle to generalize. In this paper, we study how to close the generalizability gap of these models through pretraining on synthetic data and introduce Iceberg, a synthetic data augmentation approach that expands both large language model (LLM)-generated programs and weak labels of unseen design configurations. Our weak label generation method is integrated with an in-context model architecture, enabling meta-learning from actual and proximate labels. Iceberg improves the geometric mean modeling accuracy by $86.4\%$ when adapt to six real-world applications with few-shot examples and achieves a $2.47\times$ and a $1.12\times$ better offline DSE performance when adapting to two different test datasets. Our open-sourced code is here: \href{https://github.com/UCLA-VAST/iceberg}{https://github.com/UCLA-VAST/iceberg}
Bit-level sparsity in quantized deep neural networks (DNNs) offers significant potential for optimizing Multiply-Accumulate (MAC) operations. However, two key challenges still limit its practical exploitation. First, conventional bit-serial approaches cannot simultaneously leverage the sparsity of both factors, leading to a complete waste of one factor' s sparsity. Methods designed to exploit dual-factor sparsity are still in the early stages of exploration, facing the challenge of partial product explosion. Second, the fluctuation of bit-level sparsity leads to variable cycle counts for MAC operations. Existing synchronous scheduling schemes that are suitable for dual-factor sparsity exhibit poor flexibility and still result in significant underutilization of MAC units. To address the first challenge, this study proposes a MAC unit that leverages dual-factor sparsity through the emerging particlization-based approach. The proposed design addresses the issue of partial product explosion through simple control logic, resulting in a more area- and energy-efficient MAC unit. In addition, by discarding less significant intermediate results, the design allows for further hardware simplification at the cost of minor accuracy loss. To address the second challenge, a quasi-synchronous scheme is introduced that adds cycle-level elasticity to the MAC array, reducing pipeline stalls and thereby improving MAC unit utilization. Evaluation results show that the exact version of the proposed MAC array architecture achieves a 29.2% improvement in area efficiency compared to the state-of-the-art bit-sparsity-driven architecture, while maintaining comparable energy efficiency. The approximate variant further improves energy efficiency by 7.5%, compared to the exact version. Index-Terms: DNN acceleration, Bit-level sparsity, MAC unit
Analog in-memory computing (AIMC) is an energy-efficient alternative to digital architectures for accelerating machine learning and signal processing workloads. However, its energy efficiency is limited by the high energy cost of the column analog-to-digital converters (ADCs). Reducing the ADC precision is an effective approach to lowering its energy cost. However, doing so also reduces the AIMC's computational accuracy thereby making it critical to identify the minimum precision required to meet a target accuracy. Prior works overestimate the ADC precision requirements by modeling quantization error as input-independent noise, maximizing the signal-to-quantization-noise ratio (SQNR), and ignoring the discrete nature of ideal pre-ADC signal. We address these limitations by developing analytical expressions for estimating the compute signal-to-noise ratio (CSNR), a true metric of accuracy for AIMCs, and propose CACTUS, an algorithm to obtain CSNR-optimal ADC parameters. Using a circuit-aware behavioral model of an SRAM-based AIMC in a 28nm CMOS process, we show that for a 256-dimensional binary dot product, CACTUS reduces the ADC precision requirements by 3b while achieving 6dB higher CSNR over prior methods. We also delineate operating conditions under which our proposed CSNR-optimal ADCs outperform conventional SQNR-optimal ADCs.
This paper presents the design and development of a low-cost fuel dispensing system prototype based on the STM32 microcontroller and L298N motor driver. The system aims to provide an affordable and scalable solution for fuel delivery in remote or small-scale environments where conventional, high-cost systems are not feasible. The core control unit is built using an STM32 microcontroller, which manages user input through a 4x4 matrix keypad and displays operational data on a 16x4 LCD screen via I2C communication. A 12V DC pump motor is used to simulate the fuel dispensing mechanism, precisely controlled via the dual H-bridge L298N motor driver. The system is powered by a 11.1V battery and is designed for ease of deployment and portability. The keypad allows users to input the desired fuel amount, while the system ensures accurate motor runtime corresponding to the volume to be dispensed. This project demonstrates how embedded systems can be leveraged to build cost-effective, user-friendly, and energy-efficient solutions. The proposed design can be further enhanced with flow sensors, GSM connectivity, RFID cards, and payment integration for real-world applications in fuel stations or agricultural use.
The accuracy of floating-random-walk (FRW) based capacitance extraction stands only when the recursive FRW transitions are sampled unbiasedly according to surrounding dielectrics. Advanced technology profiles, featuring complicated non-stratified dielectrics, challenge the accuracy of existing FRW transition schemes that approximate dielectrics with stratified or eight-octant patterns. In this work, we propose an algorithm named MicroWalk, enabling accurate FRW transitions for arbitrary dielectrics while keeping high efficiency. It is provably unbiased and equivalent to using transition probabilities solved by finite difference method, but at orders of magnitude lower cost (802$\times$ faster). An enhanced 3-D capacitance solver is developed with a hybrid strategy for complicated dielectrics, combining MicroWalk with the special treatment for the first transition cube and the analytical algorithm for stratified cubes. Experiments on real-world structures show that our solver achieves a significant accuracy advantage over existing FRW solvers, while preserving high efficiency.
System-level design, once the province of board designers, has now become a central concern for chip designers. Because chip design is a less forgiving design medium -- design cycles are longer and mistakes are harder to correct -- system-on-chip designers need a more extensive tool suite than may be used by board designers and a variety of tools and methodologies have been developed for system-level design of systems-on-chips (SoCs). System-level design is less amenable to synthesis than are logic or physical design. As a result, system-level tools concentrate on modeling, simulation, design space exploration, and design verification. The goal of modeling is to correctly capture the system's operational semantics, which helps with both implementation and verification. The study of models of computation provides a framework for the description of digital systems. Not only do we need to understand a particular style of computation, such as dataflow, but we also need to understand how different models of computation can reliably communicate with each other. Design space exploration tools, such as hardware/software co-design, develop candidate designs to understand trade-offs. Simulation can be used not only to verify functional correctness but also to supply performance and power/energy information for design analysis. This chapter employs two applications -- video and neural networks -- as examples. Both are leading-edge applications that illustrate many important aspects of system-level design.
Large language models (LLMs) have demonstrated exceptional proficiency in understanding and generating human language, but efficient inference on resource-constrained embedded devices remains challenging due to large model sizes and memory-intensive operations in feedforward network (FFN) and multi-head attention (MHA) layers. While existing accelerators offload LLM inference to expensive heterogeneous computing systems, they fail to exploit the significant sparsity inherent in LLM operations, leaving hardware resources underutilized. We propose SLIM, an algorithm-hardware co-design optimized for sparse LLM serving on edge devices. SLIM exploits LLM sparsity through an adaptive thresholding algorithm that enables runtime-configurable sparsity with negligible accuracy loss, fetching only activated neurons to dramatically reduce data movement. Our heterogeneous hardware architecture strategically combines near-storage processing (NSP) and processing-in-memory (PIM): FFN weights are stored in high-density 3D NAND and computed using NSP units, while memory-intensive MHA operations are processed in PIM modules. This design significantly reduces memory footprint, data movement, and energy consumption. Our comprehensive evaluation demonstrates SLIM's effectiveness, achieving 13-18x throughput improvements over SSD-GPU systems and 9-10x better energy efficiency over DRAM-GPU systems while maintaining low latency, making cost-effective LLM deployment viable for edge computing environments.
Edge inference for large language models (LLM) offers secure, low-latency, and cost-effective inference solutions. We emphasize that an edge accelerator should achieve high area efficiency and minimize external memory access (EMA) during the memory-bound decode stage, while maintaining high energy efficiency during the compute intensive prefill stage. This paper proposes an edge LLM inference accelerator featuring a hybrid systolic array (HSA) architecture that optimizes inference efficiency in both stages. To further reduce EMA, we adopt MXINT4 weight quantization and propose an optimized dataflow tailored for HSA, ensuring negligible dequantization overhead and achieving 100% hardware utilization with minimal accuracy loss under edge DRAM bandwidth constraints. For non-linear operations, we incorporate optimized root mean square normalization (RMSNorm) and rotary position embedding (RoPE) units, reducing their latency, area, and memory access overhead while enabling end-to-end inference on our accelerator. Our solution achieves 247/117 (token/s/mm2) while running a 1.3B LLM on long-input/long-output scenarios, providing >2.45x/13.5x improvement over existing approaches, while maintaining superior energy efficiency in token generation.
The rapid expansion of data centers (DCs) to support large-scale AI and scientific workloads is driving unsustainable growth in energy consumption and greenhouse gas emissions. While successive generations of hardware platforms have improved performance and energy efficiency, the question remains whether new, more efficient platforms can realistically offset the rising emissions associated with increasing demand. Prior studies often overlook the complex trade-offs in such transitions by failing to account for both the economic incentives and the projected compute demand growth over the operational lifetime of the devices. In response, we present CEO-DC, an integrated model and decision-making methodology for Carbon and Economy Optimization in Data Centers. CEO-DC models the competing forces of cost, carbon, and compute demand to guide optimal platform procurement and replacement strategies. We propose metrics to steer procurement, platform design, and policy decisions toward sustainable DC technologies. Given current platform trends, our AI case study using CEO-DC shows that upgrading legacy devices on a 4-year cycle reduces total emissions. However, these upgrades fail to scale with DC demand growth trends without increasing total emissions in over 44% of cases, and require economic incentives for adoption in over 72%. Furthermore, current carbon prices are insufficient to motivate upgrades in 9 out of the 14 countries with the highest number of DCs globally. We also find that optimizing platforms for energy efficiency at the expense of latency can increase the carbon price required to justify their adoption. In summary, CEO-DC provides actionable insights for DC architects, platform designers, and policymakers by timing legacy platform upgrades, constraining DC growth to sustainable levels, optimizing platform performance-to-cost ratios, and increasing incentives.
A new set of hardware merge sort devices are introduced here, which merge multiple sorted input lists into a single sorted output list in a fast and efficient manner. In each merge sorter, the values from the sorted input lists are arranged in an input 2-D setup array, but with the order of each sorted input list offset from the order of each of the other sorted input lists. In these new devices, called List Offset Merge Sorters (LOMS), a minimal set of column sort stages alternating with row sort stages process the input setup array into a final output array, now in the defined sorted order. LOMS 2-way sorters, which merge 2 sorted input lists, require only 2 merge stages and are significantly faster than Kenneth Batcher's previous state-of-the-art 2-way merge devices, Bitonic Merge Sorters and Odd-Even Merge Sorters. LOMS 2-way sorters utilize the recently-introduced Single-Stage 2-way Merge Sorters (S2MS) in their first stage. Both LOMS and S2MS devices can merge any mixture of input list sizes, while Batcher's merge sorters are difficult to design unless the 2 input lists are equal, and a power-of-2. By themselves, S2MS devices are the fastest 2-way merge sorters when implemented in this study's target FPGA devices, but they tend to use a large number of LUT resources. LOMS 2-way devices use fewer resources than comparable S2MS devices, enabling some large LOMS devices to be implemented in a given FPGA when comparable S2MS devices cannot fit in that FPGA. A List Offset 2-way sorter merges 2 lists, each with 32 values, into a sorted output list of those 64 values in 2.24 nS, a speedup of 2.63 versus a comparable Batcher device. A LOMS 3-way merge sorter, merging 3 sorted input lists with 7 values, fully merges the 21 values in 3.4 nS, a speedup of 1.36 versus the comparable state-of-the-art 3-way merge device.
As transistor counts in a single chip exceed tens of billions, the complexity of RTL-level simulation and verification has grown exponentially, often extending simulation campaigns to several months. In industry practice, RTL simulation is divided into two phases: functional debug and system validation. While system validation demands high simulation speed and is typically accelerated using FPGAs, functional debug relies on rapid compilation-rendering multi-core CPUs the primary choice. However, the limited simulation speed of CPUs has become a major bottleneck. To address this challenge, we propose CCSS, a scalable multi-core RTL simulation platform that achieves both fast compilation and high simulation throughput. CCSS accelerates combinational logic computation and sequential logic synchronization through specialized architecture and compilation strategies. It employs a balanced DAG partitioning method and efficient boolean computation cores for combinational logic, and adopts a low-latency network-on-chip (NoC) design to synchronize sequential states across cores efficiently. Experimental results show that CCSS delivers up to 12.9x speedup over state-of-the-art multi-core simulators.
Transposed Convolutions (TCONV) enable the up-scaling mechanism within generative Artificial Intelligence (AI) models. However, the predominant Input-Oriented Mapping (IOM) method for implementing TCONV has complex output mapping, overlapping sums, and ineffectual computations. These inefficiencies further exacerbate the performance bottleneck of TCONV and generative models on resource-constrained edge devices. To address this problem, in this paper we propose MM2IM, a hardware-software co-designed accelerator that combines Matrix Multiplication (MatMul) with col2IM to process TCONV layers on resource-constrained edge devices efficiently. Using the SECDA-TFLite design toolkit, we implement MM2IM and evaluate its performance across 261 TCONV problem configurations, achieving an average speedup of 1.9x against a dual-thread ARM Neon optimized CPU baseline. We then evaluate the performance of MM2IM on a range of TCONV layers from well-known generative models achieving up to 4.2x speedup, and compare it against similar resource-constrained TCONV accelerators, outperforming them by at least 2x GOPs/DSP. Finally, we evaluate MM2IM on the DCGAN and pix2pix GAN models, achieving up to 3x speedup and 2.4x energy reduction against the CPU baseline.
Modern AI workloads such as large language models (LLMs) and retrieval-augmented generation (RAG) impose severe demands on memory, communication bandwidth, and resource flexibility. Traditional GPU-centric architectures struggle to scale due to growing inter-GPU communication overheads. This report introduces key AI concepts and explains how Transformers revolutionized data representation in LLMs. We analyze large-scale AI hardware and data center designs, identifying scalability bottlenecks in hierarchical systems. To address these, we propose a modular data center architecture based on Compute Express Link (CXL) that enables disaggregated scaling of memory, compute, and accelerators. We further explore accelerator-optimized interconnects-collectively termed XLink (e.g., UALink, NVLink, NVLink Fusion)-and introduce a hybrid CXL-over-XLink design to reduce long-distance data transfers while preserving memory coherence. We also propose a hierarchical memory model that combines local and pooled memory, and evaluate lightweight CXL implementations, HBM, and silicon photonics for efficient scaling. Our evaluations demonstrate improved scalability, throughput, and flexibility in AI infrastructure.
Vision Transformers (ViTs) have emerged as a powerful architecture for computer vision tasks due to their ability to model long-range dependencies and global contextual relationships. However, their substantial compute and memory demands hinder efficient deployment in scenarios with strict energy and bandwidth limitations. In this work, we propose OptoViT, the first near-sensor, region-aware ViT accelerator leveraging silicon photonics (SiPh) for real-time and energy-efficient vision processing. Opto-ViT features a hybrid electronic-photonic architecture, where the optical core handles compute-intensive matrix multiplications using Vertical-Cavity Surface-Emitting Lasers (VCSELs) and Microring Resonators (MRs), while nonlinear functions and normalization are executed electronically. To reduce redundant computation and patch processing, we introduce a lightweight Mask Generation Network (MGNet) that identifies regions of interest in the current frame and prunes irrelevant patches before ViT encoding. We further co-optimize the ViT backbone using quantization-aware training and matrix decomposition tailored for photonic constraints. Experiments across device fabrication, circuit and architecture co-design, to classification, detection, and video tasks demonstrate that OptoViT achieves 100.4 KFPS/W with up to 84% energy savings with less than 1.6% accuracy loss, while enabling scalable and efficient ViT deployment at the edge.