Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
As an enabling architecture of Large Models (LMs), Mixture of Experts (MoE) has become prevalent thanks to its sparsely-gated mechanism, which lowers computational overhead while maintaining learning performance comparable to dense LMs. The essence of MoE lies in utilizing a group of neural networks (called experts) with each specializing in different types of tasks, along with a trainable gating network that selectively activates a subset of these experts to handle specific tasks. Traditional cloud-based MoE encounters challenges such as prolonged response latency, high bandwidth consumption, and data privacy leakage. To address these issues, researchers have proposed to deploy MoE over distributed edge networks. However, a key concern of distributed MoE frameworks is the lack of trust in data interactions among distributed experts without the surveillance of any trusted authority, and thereby prone to potential attacks such as data manipulation. In response to the security issues of traditional distributed MoE, we propose a blockchain-aided trustworthy MoE (B-MoE) framework that consists of three layers: the edge layer, the blockchain layer, and the storage layer. In this framework, the edge layer employs the activated experts downloaded from the storage layer to process the learning tasks, while the blockchain layer functions as a decentralized trustworthy network to trace, verify, and record the computational results of the experts from the edge layer. The experimental results demonstrate that B-MoE is more robust to data manipulation attacks than traditional distributed MoE during both the training and inference processes.
3D Gaussian Splatting (3D-GS) has recently emerged as a powerful technique for real-time, photorealistic rendering by optimizing anisotropic Gaussian primitives from view-dependent images. While 3D-GS has been extended to scientific visualization, prior work remains limited to single-GPU settings, restricting scalability for large datasets on high-performance computing (HPC) systems. We present a distributed 3D-GS pipeline tailored for HPC. Our approach partitions data across nodes, trains Gaussian splats in parallel using multi-nodes and multi-GPUs, and merges splats for global rendering. To eliminate artifacts, we add ghost cells at partition boundaries and apply background masks to remove irrelevant pixels. Benchmarks on the Richtmyer-Meshkov datasets (about 106.7M Gaussians) show up to 3X speedup across 8 nodes on Polaris while preserving image quality. These results demonstrate that distributed 3D-GS enables scalable visualization of large-scale scientific data and provide a foundation for future in situ applications.
Translating programs between various parallel programming languages is an important problem in the high-performance computing (HPC) community. Existing tools for this problem are either too narrow in scope and/or outdated. Recent explosive growth in the popularity of large language models (LLMs) and their ability to generate and translate code offers a potential alternative approach. Toward that end, we first need to systematically evaluate the ability of LLMs to translate between parallel languages. In this work, we introduce UniPar, a systematic evaluation framework for LLM-based parallel code translation. Specifically, in this work, we target translations between serial code, CUDA, and OpenMP. Our goal is to assess how well current instruction-tuned LLMs -- specifically GPT-4o-mini and LLaMA-3.3-70B-Instruct -- can be used out of the box or enhanced through known strategies. We evaluated four major usage modes: hyperparameter optimization for decoding, zero- and few-shot prompting, supervised fine-tuning, and iterative feedback through compiler-based repair. As a part of the evaluation, we construct a new dataset called PARATRANS, covering both serial-to-parallel translation and cross-paradigm transformations. Our findings reveal that while off-the-shelf models struggle under the default settings (e.g., GPT-4o-mini achieves only 46% compilation and 15% functional correctness), our UniPar methodology -- combining fine-tuning, hyperparameter tuning, and compiler-guided repair -- improves performance by up to 2X (69% compilation and 33% correctness). We believe that our findings will provide useful insights for researchers to further improve LLMs for the parallel language translation problem. UniPar source code and PARATRANS dataset are available at our GitHub repository https://github.com/Scientific-Computing-Lab/UniPar_AI.
We introduce a learning-augmented peer-to-peer (P2P) network design that leverages the predictions of traffic patterns to optimize the network's topology. While keeping formal guarantees on the standard P2P metrics (routing path length, maximum degree), we optimize the network in a demand-aware manner and minimize the path lengths weighted by the peer-to-peer communication demands. Our protocol is learning-augmented, meaning that each node receives an individual, possibly inaccurate prediction about the future traffic patterns, with the goal of improving the network's performances. We strike a trade-off between significantly improved performances when the predictions are correct (consistency) and polylogarithmic performances when the predictions are arbitrary (robustness). We have two main contributions. First, we consider the centralized setting and show that the problem of constructing an optimum static skip list network (SLN) is solvable in polynomial time and can be computed via dynamic programming. This problem is the natural demand-aware extension of the optimal skip list problem. Second, we introduce the Uniform P2P protocol which generalizes skip list networks (SLN) by relaxing the node's heights from discrete to continuous. We show that Uniform achieves state-of-the-art performances: logarithmic routing and maximum degree, both with high probability. We then use Uniform to build a learning-augmented P2P protocol in order to incorporate demand-awareness, leading to our main contribution, LASLiN. We prove that the performances of LASLiN are consistent with those of an optimum static SLN with correct predictions (given via our dynamic programming approach), and are at most a logarithmic factor off the state-of-the-art P2P protocols if the predictions are arbitrary wrong. For the special case of highly sparse demands, we show that LASLiN achieves improved performances.
Foundational models of computation often abstract away physical hardware limitations. However, in extreme environments like In-Network Computing (INC), these limitations become inviolable laws, creating an acute trilemma among communication efficiency, bounded memory, and robust scalability. Prevailing distributed paradigms, while powerful in their intended domains, were not designed for this stringent regime and thus face fundamental challenges. This paper demonstrates that resolving this trilemma requires a shift in perspective - from seeking engineering trade-offs to deriving solutions from logical necessity. We establish a rigorous axiomatic system that formalizes these physical constraints and prove that for the broad class of computations admitting an idempotent merge operator, there exists a unique, optimal paradigm. Any system satisfying these axioms must converge to a single normal form: Self-Describing Parallel Flows (SDPF), a purely data-centric model where stateless executors process flows that carry their own control logic. We further prove this unique paradigm is convergent, Turing-complete, and minimal. In the same way that the CAP theorem established a boundary for what is impossible in distributed state management, our work provides a constructive dual: a uniqueness theorem that reveals what is \textit{inevitable} for distributed computation flows under physical law.
In order to support the real-time interaction with LLMs and the instant search or the instant recommendation on social media, it becomes an imminent problem to build k-NN graph or indexing graph for the massive number of vectorized multimedia data. In such scenarios, the scale of the data or the scale of the graph may exceed the processing capacity of a single machine. This paper aims to address the graph construction problem of such scale via efficient graph merge. For the graph construction on a single node, two generic and highly parallelizable algorithms, namely Two-way Merge and Multi-way Merge are proposed to merge subgraphs into one. For the graph construction across multiple nodes, a multi-node procedure based on Two-way Merge is presented. The procedure makes it feasible to construct a large-scale k-NN graph/indexing graph on either a single node or multiple nodes when the data size exceeds the memory capacity of one node. Extensive experiments are conducted on both large-scale k-NN graph and indexing graph construction. For the k-NN graph construction, the large-scale and high-quality k-NN graphs are constructed by graph merge in parallel. Typically, a billion-scale k-NN graph can be built in approximately 17h when only three nodes are employed. For the indexing graph construction, similar NN search performance as the original indexing graph is achieved with the merged indexing graphs while requiring much less time of construction.
The collaborative efforts of large communities in science experiments, often comprising thousands of global members, reflect a monumental commitment to exploration and discovery. Recently, advanced and complex data processing has gained increasing importance in science experiments. Data processing workflows typically consist of multiple intricate steps, and the precise specification of resource requirements is crucial for each step to allocate optimal resources for effective processing. Estimating resource requirements in advance is challenging due to a wide range of analysis scenarios, varying skill levels among community members, and the continuously increasing spectrum of computing options. One practical approach to mitigate these challenges involves initially processing a subset of each step to measure precise resource utilization from actual processing profiles before completing the entire step. While this two-staged approach enables processing on optimal resources for most of the workflow, it has drawbacks such as initial inaccuracies leading to potential failures and suboptimal resource usage, along with overhead from waiting for initial processing completion, which is critical for fast-turnaround analyses. In this context, our study introduces a novel pipeline of machine learning models within a comprehensive workflow management system, the Production and Distributed Analysis (PanDA) system. These models employ advanced machine learning techniques to predict key resource requirements, overcoming challenges posed by limited upfront knowledge of characteristics at each step. Accurate forecasts of resource requirements enable informed and proactive decision-making in workflow management, enhancing the efficiency of handling diverse, complex workflows across heterogeneous resources.
The paper deals with the makespan minimization in the hybrid flow shop scheduling problem with multiprocessor tasks. The hybrid flow shop (HFS) generalizes the classical flow shop processor configuration by replacing each processor (processing stage) by some number of identical parallel processors. Similarly, the multiprocessor tasks generalize the classical assumption, by allowing a task to require more than one processor simultaneously for its processing. In this work we present the algorithm for solving the problem based on the tabu search technique. The proposed algorithm uses parallel and distributed mechanisms for neighborhood evaluation and well balances heterogeneous network environment.
This paper addresses the deadline-constrained task offloading and resource allocation problem in multi-access edge computing. We aim to determine where each task is offloaded and processed, as well as corresponding communication and computation resource allocations, to maximize the total saved energy for IoT devices, while considering task deadline and system resource constraints. Especially, our system allows each task to be offloaded to one of its accessible access points (APs) and processed on a server that is not co-located with its offloading AP. We formulate this problem as an Integer Nonlinear Programming problem and show it is NP-Hard. To address this problem, we propose a Graph-Matching-based Approximation Algorithm ($\mathtt{GMA}$), the first approximation algorithm of its kind. $\mathtt{GMA}$ leverages linear relaxation, tripartite graph construction, and a Linear Programming rounding technique. We prove that $\mathtt{GMA}$ is a $\frac{1-\alpha}{2+\epsilon}$-approximation algorithm, where $\epsilon$ is a small positive value, and $\alpha$ ($0$$\le$$\alpha$$<$$1$) is a system parameter that ensures the resource allocated to any task by an AP or a server cannot exceed $\alpha$ times its resource capacity. Experiments show that, in practice, $\mathtt{GMA}$'s energy saving achieves $97\%$ of the optimal value on average.
Recent advances in data analytics have enabled the accurate prediction of user access patterns, giving rise to the idea of packed caching delivering multiple co accessed data items together as a bundle. This improves caching efficiency, as accessing one item often implies the need for others. Prior work has explored only 2 item pairwise packing. In this paper, we extend the concept to general K packing, allowing variable size bundles for improved flexibility and performance. We formulate the K PackCache problem from a content delivery network CDN operator perspective, aiming to minimize total cost comprising two components: transfer cost modeled as a base cost plus a linearly increasing term with the number of items packed, and memory rental cost for caching, which depends on how long and how much is stored. Overpacking increases cost due to low utility, underpacking leads to missed sharing opportunities. We propose an online algorithm, Adaptive K PackCache AKPC, which dynamically forms, merges, and splits data cliques based on user access patterns and content correlation. Our approach supports batch requests, enables approximate clique merging, and offers a formal competitive guarantee. Through extensive evaluation on the Netflix and Spotify datasets, AKPC reduces total cost by up to 63 and 55 percentage over online baselines, respectively, and achieves performance within 15 and 13 percentage of the optimal. This demonstrates its scalability and effectiveness for real world caching systems.
We present factorization and solution phases for a new linear complexity direct solver designed for concurrent batch operations on fine-grained parallel architectures, for matrices amenable to hierarchical representation. We focus on the strong-admissibility-based $\mathcal{H}^2$ format, where strong recursive skeletonization factorization compresses remote interactions. We build upon previous implementations of $\mathcal{H}^2$ matrix construction for efficient factorization and solution algorithm design, which are illustrated graphically in stepwise detail. The algorithms are ``blackbox'' in the sense that the only inputs are the matrix and right-hand side, without analytical or geometrical information about the origin of the system. We demonstrate linear complexity scaling in both time and memory on four representative families of dense matrices up to one million in size. Parallel scaling up to 16 threads is enabled by a multi-level matrix graph coloring and avoidance of dynamic memory allocations thanks to prefix-sum memory management. An experimental backward error analysis is included. We break down the timings of different phases, identify phases that are memory-bandwidth limited, and discuss alternatives for phases that may be sensitive to the trend to employ lower precisions for performance.
The surge in large language models (LLMs) has fundamentally reshaped the landscape of GPU usage patterns, creating an urgent need for more efficient management strategies. While cloud providers employ spot instances to reduce costs for low-priority (LP) tasks, existing schedulers still grapple with high eviction rates and lengthy queuing times. To address these limitations, we present GFS, a novel preemptive scheduling framework that enhances service-level objective (SLO) compliance for high-priority (HP) tasks while minimizing preemptions to LP tasks. Firstly, GFS utilizes a lightweight forecasting model that predicts GPU demand among different tenants, enabling proactive resource management. Secondly, GFS employs a dynamic allocation mechanism to adjust the spot quota for LP tasks with guaranteed durations. Lastly, GFS incorporates a preemptive scheduling policy that prioritizes HP tasks while minimizing the impact on LP tasks. We demonstrate the effectiveness of GFS through both real-world implementation and simulations. The results show that GFS reduces eviction rates by 33.0\%, and cuts queuing delays by 44.1\% for LP tasks. Furthermore, GFS enhances the GPU allocation rate by up to 22.8\% in real production clusters. In a production cluster of more than 10,000 GPUs, GFS yields roughly \$459,715 in monthly benefits.
The increasing size of large language models (LLMs) has led to a surge in memory requirements during training, often exceeding the capacity of high-bandwidth memory (HBM). Swap-based memory optimization incurs neither accuracy loss nor additional end-to-end overhead when effectively overlapped, thus being an attractive solution. However, existing swap methods assume consistent operator sequences, which is impractical in Eager Mode, where operator sequences can vary during change. We propose Chameleon, which redesigns the end-to-end process of swap-based memory optimization and is the first work to consider varying operator sequences in Eager Mode. Chameleon (i) introduces a lightweight online profiler to enable continuous profiling for monitoring operator sequences, (ii) generates effective swap policies with limited operator information, and (iii) optimizes the policy execution module for accurate policy application and better performance. Experimental results demonstrate that Chameleon reduces profiling overhead by 84.25%, enables training models up to 4x larger than hardware memory while adapting to changes in operator sequences, improves performance by up to 38.94% compared to recomputation or high-degree parallelism.
Blockchain technology offers decentralization and security but struggles with scalability, particularly in enterprise settings where efficiency and controlled access are paramount. Sharding is a promising solution for private blockchains, yet existing approaches face challenges in coordinating shards, ensuring fault tolerance with limited nodes, and minimizing the high overhead of consensus mechanisms like PBFT. This paper proposes the Range-Based Sharding (RBS) Protocol, a novel sharding mechanism tailored for enterprise blockchains, implemented on Quorum. Unlike traditional sharding models such as OmniLedger and non-sharding Corda framework, RBS employs a commit-reveal scheme for secure and unbiased shard allocation, ensuring fair validator distribution while reducing cross-shard transaction delays. Our approach enhances scalability by balancing computational loads across shards, reducing consensus overhead, and improving parallel transaction execution. Experimental evaluations demonstrate that RBS achieves significantly higher throughput and lower latency compared to existing enterprise sharding frameworks, making it a viable and efficient solution for largescale blockchain deployments.
Cross-chain bridges and oracle DAOs represent some of the most vulnerable components of decentralized systems, with more than $2.8 billion lost due to trust failures, opaque validation behavior, and weak incentives. Current oracle designs are based on multisigs, optimistic assumptions, or centralized aggregation, exposing them to attacks and delays. Moreover, predictable committee selection enables manipulation, which threatens data integrity across chains. We propose V-ZOR, a verifiable oracle relay that integrates zero-knowledge proofs, quantum-grade randomness, and cross-chain restaking to mitigate these risks. Each oracle packet includes a Halo 2 proof verifying that the reported data was correctly aggregated using a deterministic median. To prevent committee manipulation, VZOR reseeds its VRF using auditable quantum entropy, ensuring unpredictable and secure selection of reporters. Reporters stake once on a shared restaking hub; any connected chain can submit a fraud proof to trigger slashing, removing the need for multisigs or optimistic assumptions. A prototype in Sepolia and Scroll achieves sub-300k gas verification, one-block latency, and a 10x increase in collusion cost. V-ZOR demonstrates that combining ZK attestation with quantum-randomized restaking enables a trust-minimized, high-performance oracle layer for cross-chain DeFi.
The Open Network (TON) is a high-performance blockchain platform designed for scalability and efficiency, leveraging an asynchronous execution model and a multi-layered architecture. While TON's design offers significant advantages, it also introduces unique challenges for smart contract development and security. This paper introduces a comprehensive audit checklist for TON smart contracts, based on an analysis of 34 professional audit reports containing 233 real-world vulnerabilities. The checklist addresses TON-specific challenges, such as asynchronous message handling, and provides actionable insights for developers and auditors. We also present detailed case studies of vulnerabilities in TON smart contracts, highlighting their implications and offering lessons learned. By adopting this checklist, developers and auditors can systematically identify and mitigate vulnerabilities, enhancing the security and reliability of TON-based projects. Our work bridges the gap between Ethereum's mature audit methodologies and the emerging needs of the TON ecosystem, fostering a more secure and robust blockchain environment.
The Message Passing Interface (MPI) is a fundamental tool for building high-performance computing (HPC) applications, enabling efficient communication across distributed systems. Despite its widespread adoption, MPI's low-level interface and lack of built-in type safety make it prone to runtime errors, undefined behavior, and debugging challenges, especially in large-scale applications. Rust, a modern systems programming language, offers a compelling solution with its strong type system, which enforces memory and type safety at compile time without compromising performance. This paper introduces a type-safe communication framework for MPI, built on the RSMPI library, to address the limitations of traditional MPI programming. At its core is the TypedCommunicator, an abstraction that enforces static type safety in point-to-point communication operations. By leveraging Rust's Equivalence trait, our framework guarantees that only compatible types can participate in communication, catching mismatches either at compile time or through runtime validation. The framework supports both single-value and slice-based communication, providing an intuitive API for diverse data structures. Our implementation demonstrates that this approach eliminates common MPI errors, improves developer productivity, and maintains performance, adhering to Rust's principle of zero-cost abstractions. This work lays the foundation for extending type safety to collective operations, advancing the robustness of parallel computing in Rust.
The tracking module of a visual-inertial SLAM system processes incoming image frames and IMU data to estimate the position of the frame in relation to the map. It is important for the tracking to complete in a timely manner for each frame to avoid poor localization or tracking loss. We therefore present a new approach which leverages GPU computing power to accelerate time-consuming components of tracking in order to improve its performance. These components include stereo feature matching and local map tracking. We implement our design inside the ORB-SLAM3 tracking process using CUDA. Our evaluation demonstrates an overall improvement in tracking performance of up to 2.8x on a desktop and Jetson Xavier NX board in stereo-inertial mode, using the well-known SLAM datasets EuRoC and TUM-VI.