Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
Write-ahead logs (WALs) are a fundamental fault-tolerance technique found in many areas of computer science. WALs must be reliable while maintaining high performance, because all operations will be written to the WAL to ensure their stability. Without reliability a WAL is useless, because its utility is tied to its ability to recover data after a failure. In this paper we describe our experience creating a prototype user space WAL in Rust. We observed that Rust is easy to use, compact and has a very rich set of libraries. More importantly, we have found that the overhead is minimal, with the WAL prototype operating at basically the expected performance of the stable memory device.
Securing sensitive cloud workloads requires composing confidential virtual machines (CVMs) with nested enclaves or sandboxes. Unfortunately, each new isolation boundary adds ad-hoc access control mechanisms, hardware extensions, and trusted software. This escalating complexity bloats the TCB, complicates end-to-end attestation, and leads to fragmentation across platforms and cloud service providers (CSPs). We introduce a unified isolation model that delegates enforceable, composable, and attestable isolation to a single trusted security monitor: Tyche. Tyche provides an API for partitioning, sharing, attesting, and reclaiming resources through its core abstraction, trust domains (TDs). To provide fine-grain isolation, TDs can recursively create and manage sub-TDs. Tyche captures these relationships in attestations, allowing cloud tenants to reason about end-to-end security. TDs serve as the building blocks for constructing composable enclaves, sandboxes, and CVMs. Tyche runs on commodity x86_64 without hardware security extensions and can maintain backward compatibility with existing software. We provide an SDK to run and compose unmodified workloads as sandboxes, enclaves, and CVMs with minimal overhead compared to native Linux execution. Tyche supports complex cloud scenarios, such as confidential inference with mutually distrustful users, model owners, and CSPs. An additional RISC-V prototype demonstrates Tyche's portability across platforms.
KV cache accelerates LLM inference by avoiding redundant computation, at the expense of memory. To support larger KV caches, prior work extends GPU memory with CPU memory via CPU-offloading. This involves swapping KV cache between GPU and CPU memory. However, because the cache updates dynamically, such swapping incurs high CPU memory traffic. We make a key observation that model parameters remain constant during runtime, unlike the dynamically updated KV cache. Building on this, we introduce MIRAGE, which avoids KV cache swapping by remapping, and thereby repurposing, the memory allocated to model parameters for KV cache. This parameter remapping is especially beneficial in multi-tenant environments, where the memory used for the parameters of the inactive models can be more aggressively reclaimed. Exploiting the high CPU-GPU bandwidth offered by the modern hardware, such as the NVIDIA Grace Hopper Superchip, we show that MIRAGE significantly outperforms state-of-the-art solutions, achieving a reduction of 44.8%-82.5% in tail time-between-token latency, 20.7%-99.3% in tail time-to-first-token latency, and 6.6%-86.7% higher throughput compared to vLLM.
Access control is a cornerstone of computer security that prevents unauthorised access to resources. In this paper, we study access control in quantum computer systems. We present the first explicit scenario of a security breach when a classically secure access control system is straightforwardly adapted to the quantum setting. The breach is ultimately due to that quantum mechanics allows the phenomenon of entanglement and violates Mermin inequality, a multi-party variant of the celebrated Bell inequality. This reveals a threat from quantum entanglement to access control if existing computer systems integrate with quantum computing. To protect against such threat, we propose several new models of quantum access control, and rigorously analyse their security, flexibility and efficiency.
Large Language Models (LLMs) are increasingly being integrated into various applications and services running on billions of mobile devices. However, deploying LLMs on resource-limited mobile devices faces a significant challenge due to their high demand for computation, memory, and ultimately energy. While current LLM frameworks for mobile use three power-hungry components-CPU, GPU, and Memory-even when running primarily-GPU LLM models, optimized DVFS governors for CPU, GPU, and memory featured in modern mobile devices operate independently and are oblivious of each other. Motivated by the above observation, in this work, we first measure the energy-efficiency of a SOTA LLM framework consisting of various LLM models on mobile phones which showed the triplet mobile governors result in up to 40.4% longer prefilling and decoding latency compared to optimal combinations of CPU, GPU, and memory frequencies with the same energy consumption for sampled prefill and decode lengths. Second, we conduct an in-depth measurement study to uncover how the intricate interplay (or lack of) among the mobile governors cause the above inefficiency in LLM inference. Finally, based on these insights, we design FUSE - a unified energy-aware governor for optimizing the energy efficiency of LLM inference on mobile devices. Our evaluation using a ShareGPT dataset shows FUSE reduces the time-to-first-token and time-per-output-token latencies by 7.0%-16.9% and 25.4%-36.8% on average with the same energy-per-token for various mobile LLM models.
This analysis focuses on a single Azure-hosted Virtual Machine at 52.230.23.114 that the adversary converted into an all-in-one delivery, staging and Command-and-Control node. The host advertises an out-of-date Apache 2.4.52 instance whose open directory exposes phishing lures, PowerShell loaders, Reflective Shell-Code, compiled Havoc Demon implants and a toolbox of lateral-movement binaries; the same server also answers on 8443/80 for encrypted beacon traffic. The web tier is riddled with publicly documented critical vulnerabilities, that would have allowed initial code-execution had the attackers not already owned the device. Initial access is delivered through an HTML file that, once de-obfuscated, perfectly mimics Google Unusual sign-in attempt notification and funnels victims toward credential collection. A PowerShell command follows: it disables AMSI in-memory, downloads a Base64-encoded stub, allocates RWX pages and starts the shell-code without ever touching disk. That stub reconstructs a DLL in memory using the Reflective-Loader technique and hands control to Havoc Demon implant. Every Demon variant-32- and 64-bit alike-talks to the same backend, resolves Windows APIs with hashed look-ups, and hides its activity behind indirect syscalls. Runtime telemetry shows interests in registry under Image File Execution Options, deliberate queries to Software Restriction Policy keys, and heavy use of Crypto DLLs to protect payloads and C2 traffic. The attacker toolkit further contains Chisel, PsExec, Doppelganger and Whisker, some of them re-compiled under user directories that leak the developer personas tonzking123 and thobt. Collectively the findings paint a picture of a technically adept actor who values rapid re-tooling over deep operational security, leaning on Havoc modularity and on legitimate cloud services to blend malicious flows into ordinary enterprise traffic.
There are many sandboxing mechanisms provided by operating systems to limit what resources applications can access, however, sometimes the use of these mechanisms requires developers to refactor their code to fit the sandboxing model. In this work, we investigate what makes existing sandboxing mechanisms challenging to apply to certain types of applications, and propose Threadbox, a sandboxing mechanism that enables having modular and independent sandboxes, and can be applied to threads and sandbox specific functions. We present case studies to illustrate the applicability of the idea and discuss its limitations.
The cost of communication between the operating system kernel and user applications has long blocked improvements in software performance. Traditionally, operating systems encourage software developers to use the system call interface to transfer (or initiate transfer of) data between user applications and the kernel. This approach not only hurts performance at the software level due to memory copies between user space address spaces and kernel space address spaces, it also hurts system performance at the microarchitectural level by flushing processor pipelines and other microarchitectural state. In this paper, we propose a new communication interface between user applications and the kernel by setting up a shared memory region between user space applications and the kernel's address space. We acknowledge the danger in breaking the golden law of user-kernel address space isolation, so we coupled a uBPF VM (user-space BPF Virtual Machine) with shared memory to control access to the kernel's memory from the user's application. In this case, user-space programs can access the shared memory under the supervision of the uBPF VM (and the kernel's blessing of its shared library) to gain non-blocking data transfer to and from the kernel's memory space. We test our implementation in several use cases and find this mechanism can bring speedups over traditional user-kernel information passing mechanisms.
A sophisticated malspam campaign was recently uncovered targeting Latin American countries, with a particular focus on Brazil. This operation utilizes a highly deceptive phishing email to trick users into executing a malicious MSI file, initiating a multi-stage infection. The core of the attack leverages DLL side-loading, where a legitimate executable from Valve Corporation is used to load a trojanized DLL, thereby bypassing standard security defenses. Once active, the malware, a variant of QuasarRAT known as BlotchyQuasar, is capable of a wide range of malicious activities. It is designed to steal sensitive browser-stored credentials and banking information, the latter through fake login windows mimicking well-known Brazilian banks. The threat establishes persistence by modifying the Windows registry , captures user keystrokes through keylogging , and exfiltrates stolen data to a Command-and-Control (C2) server using encrypted payloads. Despite its advanced capabilities, the malware code exhibits signs of rushed development, with inefficiencies and poor error handling that suggest the threat actors prioritized rapid deployment over meticulous design. Nonetheless, the campaign extensive reach and sophisticated mechanisms pose a serious and immediate threat to the targeted regions, underscoring the need for robust cybersecurity defenses.
Advanced Large Language Models (LLMs) have achieved impressive performance across a wide range of complex and long-context natural language tasks. However, performing long-context LLM inference locally on a commodity GPU (a PC) with privacy concerns remains challenging due to the increasing memory demands of the key-value (KV) cache. Existing systems typically identify important tokens and selectively offload their KV data to GPU and CPU memory. The KV data needs to be offloaded to disk due to the limited memory on a commodity GPU, but the process is bottlenecked by token importance evaluation overhead and the disk's low bandwidth. In this paper, we present LeoAM, the first efficient importance-aware long-context LLM inference system for a single commodity GPU with adaptive hierarchical GPU-CPU-Disk KV management. Our system employs an adaptive KV management strategy that partitions KV data into variable-sized chunks based on the skewed distribution of attention weights across different layers to reduce computational and additional transmission overheads. Moreover, we propose a lightweight KV abstract method, which minimizes transmission latency by storing and extracting the KV abstract of each chunk on disk instead of the full KV data. LeoAM also leverages the dynamic compression and pipeline techniques to further accelerate inference. Experimental results demonstrate that LongInfer achieves an average inference latency speedup of 3.46x, while maintaining comparable LLM response quality. In scenarios with larger batch sizes, it achieves up to a 5.47x speedup.
As the demand for on-device Large Language Model (LLM) inference grows, energy efficiency has become a major concern, especially for battery-limited mobile devices. Our analysis shows that the memory-bound LLM decode phase dominates energy use, and yet most existing works focus on accelerating the prefill phase, neglecting energy concerns. We introduce Adaptive Energy-Centric Core Selection (AECS) and integrate it into MNN to create the energy-efficient version, MNN-AECS, the first engine-level system solution without requiring root access or OS modifications for energy-efficient LLM decoding. MNN-AECS is designed to reduce LLM decoding energy while keeping decode speed within an acceptable slowdown threshold by dynamically selecting low-power CPU cores. MNN-AECS is evaluated across 5 Android and 2 iOS devices on 5 popular LLMs of various sizes. Compared to original MNN, MNN-AECS cuts down energy use by 23% without slowdown averaged over all 7 devices and 4 datasets. Against other engines, including llama.cpp, executorch, mllm, and MediaPipe, MNN-AECS delivers 39% to 78% energy saving and 12% to 363% speedup on average.
Networked mission-critical applications (e.g., avionic control and industrial automation systems) require deterministic packet transmissions to support a range of sensing and control tasks with stringent timing constraints. While specialized network infrastructure (e.g., time-sensitive networking (TSN) switches) provides deterministic data transport across the network, achieving strict end-to-end timing guarantees requires equally capable end devices to support deterministic traffic. These end devices, however, often employ general-purpose computing platforms like standard PCs, which lack native support for deterministic traffic and suffer from unpredictable delays introduced by their software stack and system architecture. Although specialized NICs with hardware scheduling offload can mitigate this problem, the limited compatibility hinders their widespread adoption, particularly for cost-sensitive applications or in legacy devices. To fill this gap, this paper proposes a novel software-based driver model, namely KeepON, to enable the support of deterministic packet transmissions on end devices equipped with standard NICs. The key idea of KeepON is to have the NIC keep on transmitting fixed-size data chunks as placeholders, thereby maintaining a predictable temporal transmission pattern. The real-time packets generated by the mission-critical application(s) will then be precisely inserted into this stream by replacing placeholders at the designated position to ensure their accurate transmission time. We implement and evaluate KeepON by modifying the network driver on a Raspberry Pi using its standard NIC. Our experiments demonstrate that KeepON can achieve x162 times scheduling accuracy comparable to its default driver, and x2.6 times compared to hardware-based solution, thus enabling precise timing control on standard commodity hardware.
The recent shift in Generative AI (GenAI) applications from cloud-only environments to end-user devices introduces new challenges in resource management, system efficiency, and user experience. This paper presents ConsumerBench, a comprehensive benchmarking framework designed to evaluate the system efficiency and response time of GenAI models running on end-user devices. Unlike existing benchmarks that assume exclusive model access on dedicated GPUs, ConsumerBench simulates realistic multi-application scenarios executing concurrently on constrained hardware. Furthermore, ConsumerBench supports customizable workflows that simulate complex tasks requiring coordination among multiple applications. ConsumerBench captures both application-level metrics, including latency and Service Level Objective (SLO) attainment, and system-level metrics like CPU/GPU utilization and memory bandwidth. Through extensive experiments, ConsumerBench reveals inefficiencies in resource sharing, unfair scheduling under greedy allocation, and performance pitfalls of static model server configurations. The paper also provides practical insights for model developers and system designers, highlighting the benefits of custom kernels tailored to consumer-grade GPU architectures and the value of implementing SLO-aware scheduling strategies.
Modern GPUs evolve rapidly, yet production compilers still rely on hand-crafted register allocation heuristics that require substantial re-tuning for each hardware generation. We introduce VeriLocc, a framework that combines large language models (LLMs) with formal compiler techniques to enable generalizable and verifiable register allocation across GPU architectures. VeriLocc fine-tunes an LLM to translate intermediate representations (MIRs) into target-specific register assignments, aided by static analysis for cross-architecture normalization and generalization and a verifier-guided regeneration loop to ensure correctness. Evaluated on matrix multiplication (GEMM) and multi-head attention (MHA), VeriLocc achieves 85-99% single-shot accuracy and near-100% pass@100. Case study shows that VeriLocc discovers more performant assignments than expert-tuned libraries, outperforming rocBLAS by over 10% in runtime.
Robot applications, comprising independent components that mutually publish/subscribe messages, are built on inter-process communication (IPC) middleware such as Robot Operating System 2 (ROS 2). In large-scale ROS 2 systems like autonomous driving platforms, true zero-copy communication -- eliminating serialization and deserialization -- is crucial for efficiency and real-time performance. However, existing true zero-copy middleware solutions lack widespread adoption as they fail to meet three essential requirements: 1) Support for all ROS 2 message types including unsized ones; 2) Minimal modifications to existing application code; 3) Selective implementation of zero-copy communication between specific nodes while maintaining conventional communication mechanisms for other inter-node communications including inter-host node communications. This first requirement is critical, as production-grade ROS 2 projects like Autoware rely heavily on unsized message types throughout their codebase to handle diverse use cases (e.g., various sensors), and depend on the broader ROS 2 ecosystem, where unsized message types are pervasive in libraries. The remaining requirements facilitate seamless integration with existing projects. While IceOryx middleware, a practical true zero-copy solution, meets all but the first requirement, other studies achieving the first requirement fail to satisfy the remaining criteria. This paper presents Agnocast, a true zero-copy IPC framework applicable to ROS 2 C++ on Linux that fulfills all these requirements. Our evaluation demonstrates that Agnocast maintains constant IPC overhead regardless of message size, even for unsized message types. In Autoware PointCloud Preprocessing, Agnocast achieves a 16% improvement in average response time and a 25% improvement in worst-case response time.
Generative AI is being leveraged to solve a variety of computer-use tasks involving desktop applications. State-of-the-art systems have focused solely on improving accuracy on leading benchmarks. However, these systems are practically unusable due to extremely high end-to-end latency (e.g., tens of minutes) for tasks that typically take humans just a few minutes to complete. To understand the cause behind this and to guide future developments of computer agents, we conduct the first study on the temporal performance of computer-use agents on OSWorld, the flagship benchmark in computer-use AI. We find that large model calls for planning and reflection account for the majority of the overall latency, and as an agent uses more steps to complete a task, each successive step can take 3x longer than steps at the beginning of a task. We then construct OSWorld-Human, a manually annotated version of the original OSWorld dataset that contains a human-determined trajectory for each task. We evaluate 16 agents on their efficiency using OSWorld-Human and found that even the highest-scoring agents on OSWorld take 1.4-2.7x more steps than necessary.
We explore how a shell that uses an LLM to accept natural language input might be designed differently from the shells of today. As LLMs may produce unintended or unexplainable outputs, we argue that a natural language shell should provide guardrails that empower users to recover from such errors. We concretize some ideas for doing so by designing a new shell called NaSh, identify remaining open problems in this space, and discuss research directions to address them.
Conventional operating system scheduling algorithms are largely content-ignorant, making decisions based on factors such as latency or fairness without considering the actual intents or semantics of processes. Consequently, these algorithms often do not prioritize tasks that require urgent attention or carry higher importance, such as in emergency management scenarios. However, recent advances in language models enable semantic analysis of processes, allowing for more intelligent and context-aware scheduling decisions. In this paper, we introduce the concept of semantic scheduling in scheduling of requests from large language models (LLM), where the semantics of the process guide the scheduling priorities. We present a novel scheduling algorithm with optimal time complexity, designed to minimize the overall waiting time in LLM-based prompt scheduling. To illustrate its effectiveness, we present a medical emergency management application, underscoring the potential benefits of semantic scheduling for critical, time-sensitive tasks. The code and data are available at https://github.com/Wenyueh/latency_optimization_with_priority_constraints.