Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
We analyse the problem of combining linearity, effects, and exceptions, in abstract models of programming languages, as the issue of providing some kind of strength for a monad $T(- \oplus E)$ in a linear setting. We consider in particular for $T$ the allocation monad, which we introduce to model and study resource-safety properties. We apply these results to a series of two linear effectful calculi for which we establish their resource-safety properties. The first calculus is a linear call-by-push-value language with two allocation effects $\mathit{new}$ and $\mathit{delete}$. The resource-safety properties follow from the linear (and even ordered) character of the typing rules. We then explain how to integrate exceptions on top of linearity and effects by adjoining default destruction actions to types, as inspired by C++/Rust destructors. We see destructors as objects $\delta : A\rightarrow TI$ in the slice category over $TI$. This construction gives rise to a second calculus, an affine ordered call-by-push-value language with exceptions and destructors, in which the weakening rule performs a side-effect. As in C++/Rust, a ``move'' operation is necessary to allow random-order release of resources, as opposed to last-in-first-out order. Moving resources is modelled as an exchange rule that performs a side-effect.
Directed greybox fuzzing (DGF) aims to efficiently trigger bugs at specific target locations by prioritizing seeds whose execution paths are more likely to mutate into triggering target bugs. However, existing DGF approaches suffer from imprecise probability calculations due to their reliance on complex distance metrics derived from static analysis. The over-approximations inherent in static analysis cause a large number of irrelevant execution paths to be mistakenly considered to potentially mutate into triggering target bugs, significantly reducing fuzzing efficiency. We propose to replace static analysis-based distance metrics with precise call stack representations. Call stacks represent precise control flows, thereby avoiding false information in static analysis. We leverage large language models (LLMs) to predict vulnerability-triggering call stacks for guiding seed prioritization. Our approach constructs call graphs through static analysis to identify methods that can potentially reach target locations, then utilizes LLMs to predict the most likely call stack sequence that triggers the vulnerability. Seeds whose execution paths have higher overlap with the predicted call stack are prioritized for mutation. This is the first work to integrate LLMs into the core seed prioritization mechanism of DGF. We implement our approach and evaluate it against several state-of-the-art fuzzers. On a suite of real-world programs, our approach triggers vulnerabilities $1.86\times$ to $3.09\times$ faster compared to baselines. In addition, our approach identifies 10 new vulnerabilities and 2 incomplete fixes in the latest versions of programs used in our controlled experiments through directed patch testing, with 10 assigned CVE IDs.
Large language models routinely hallucinate APIs and mislocalize edits, while language servers compute verified, IDE-grade facts about real code. We present Lanser-CLI, a CLI-first orchestration layer that pins and mediates a Language Server Protocol (LSP) server for coding agents and CI, exposing deterministic, replayable workflows. Our position is that language servers provide not only structural information (definitions, references, types, diagnostics) but also an actionable process reward: machine-checked, step-wise signals that align an agent's planning loop with program reality. In this work, Lanser-CLI contributes: (i) a robust addressing scheme beyond brittle "file:line:col" via a Selector DSL (symbolic, AST-path, and content-anchored selectors) with a principled relocation algorithm; (ii) deterministic Analysis Bundles that normalize Language Server responses and capture environment/capability metadata with stable content hashes; (iii) a safety envelope for mutating operations (rename, code actions) with preview, workspace jails, and Git-aware, transactional apply; and (iv) a process-reward functional derived from Language Server facts (diagnostic deltas, disambiguation confidence, and safe-apply checks) that is computable online and replayable offline. We formalize determinism under frozen snapshots and establish a monotonicity property for the process reward, making it suitable for process supervision and counterfactual analysis. Project Page: https://github.com/yifanzhang-pro/lanser-cli
Large language models (LLMs) have recently enabled coding agents capable of generating, executing, and revising visualization code. However, existing models often fail in practical workflows due to limited language coverage, unreliable execution, and lack of iterative correction mechanisms. Progress has been constrained by narrow datasets and benchmarks that emphasize single-round generation and single-language tasks. To address these challenges, we introduce three complementary resources for advancing visualization coding agents. VisCode-Multi-679K is a large-scale, supervised dataset containing 679K validated and executable visualization samples with multi-turn correction dialogues across 12 programming languages. VisPlotBench is a benchmark for systematic evaluation, featuring executable tasks, rendered outputs, and protocols for both initial generation and multi-round self-debug. Finally, we present VisCoder2, a family of multi-language visualization models trained on VisCode-Multi-679K. Experiments show that VisCoder2 significantly outperforms strong open-source baselines and approaches the performance of proprietary models like GPT-4.1, with further gains from iterative self-debug, reaching 82.4% overall execution pass rate at the 32B scale, particularly in symbolic or compiler-dependent languages.
Building robust and general reasoning ability is a central goal in the development of large language models (LLMs). Recent efforts increasingly turn to code as a rich training source, given its inherent logical structure and diverse reasoning paradigms such as divide-and-conquer, topological ordering, and enumeration. However, reasoning in code is often expressed implicitly and entangled with syntactic or implementation noise, making direct training on raw code suboptimal.To address this, we introduce TracePile, a large-scale corpus of 2.6 million samples that transforms code execution into explicit, step-by-step chain-of-thought-style rationales, which we call Chain of Execution (CoE). The corpus spans domains including mathematics, classical algorithms and algorithmic competition, and is enriched with variable-tracing questions and code rewritings to enhance logical granularity and code diversity. We evaluate TracePile using three training setups: continue-pretraining, instruction tuning after pretraining, and two-stage finetuning. Experiments across four base models (LLaMA 3, LLaMA 3.1, Qwen-2.5, and Qwen-2.5 Coder) and 20 benchmarks covering math, code, logic, and algorithms demonstrate consistent improvements. Notably, TracePile boosts LLaMA3.1-8B by 7.1\% on average across nine math datasets and delivers clear gains on LiveCodeBench, CRUX, and MMLU under two-stage fine-tuning.
Unsafe Rust code is necessary for interoperability with C/C++ libraries and implementing low-level data structures, but it can cause memory safety violations in otherwise memory-safe Rust programs. Sanitizers can catch such memory errors at runtime, but introduce many unnecessary checks even for memory accesses guaranteed safe by the Rust type system. We introduce SafeFFI, a system for optimizing memory safety instrumentation in Rust binaries such that checks occur at the boundary between unsafe and safe code, handing over the enforcement of memory safety from the sanitizer to the Rust type system. Unlike previous approaches, our design avoids expensive whole-program analysis and adds much less compile-time overhead (2.64x compared to over 8.83x). On a collection of popular Rust crates and known vulnerable Rust code, SafeFFI achieves superior performance compared to state-of-the-art systems, reducing sanitizer checks by up to 98%, while maintaining correctness and flagging all spatial and temporal memory safety violations.
This paper introduces a compilation scheme for programs written in the Mimosa programming language, which builds upon the MIMOS model of computation. Mimosa describes embedded systems software as a collection of time-triggered processes which communicate through FIFO queues. We formally describe an adaptation of the Lustre compilation scheme to the semantics of Mimosa and show how the coordination layer can be mapped to real-time operating system primitives.
Type-and-effect systems help the programmer to organize data and computational effects in a program. While for traditional type systems expressive variants with sophisticated inference algorithms have been developed and widely used in programming languages, type-and-effect systems did not yet gain widespread adoption. One reason for this is that type-and-effect systems are more complex and the existing inference algorithms make compromises between expressiveness, intuitiveness, and decidability. In this work, we present an effect inference algorithm for a type-and-effect system with subtyping, expressive higher-rank polymorphism, and intuitive set-like semantics of effects. In order to deal with scoping issues of higher-rank polymorphism, we delay solving of effect constraints by transforming them into formulae of propositional logic. We prove soundness and completeness of our algorithm with respect to a declarative type-and-effect system. All the presented results have been formalized in the Rocq proof assistant, and the algorithm has been successfully implemented in a realistic programming language.
The Proto-Quipper family of programming languages aims to provide a formal foundation for the Quipper quantum programming language. Unfortunately, Proto-Quipper languages have complex operational semantics: they are inherently effectful, and they rely on set-theoretic operations and fresh name generation to manipulate quantum circuits. This makes them difficult to reason about using standard programming language techniques and, ultimately, to mechanize. We introduce Proto-Quipper-A, a rational reconstruction of Proto-Quipper languages for static circuit generation. It uses a linear $\lambda$-calculus to describe quantum circuits with normal forms that closely correspond to box-and-wire circuit diagrams. Adjoint-logical foundations integrate this circuit language with a linear/non-linear functional language and let us reconstruct Proto-Quipper's circuit programming abstractions using more primitive adjoint-logical operations. Proto-Quipper-A enjoys a simple call-by-value reduction semantics, and to illustrate its tractability as a foundation for Proto-Quipper languages, we show that it is normalizing. We show how to use standard logical relations to prove normalization of linear and substructural systems, thereby avoiding the inherent complexity of existing linear logical relations.
Memory tiering in datacenters does not achieve its full potential due to hotness fragmentation -- the intermingling of hot and cold objects within memory pages. This fragmentation prevents page-based reclamation systems from distinguishing truly hot pages from pages containing mostly cold objects, fundamentally limiting memory efficiency despite highly skewed accesses. We introduce address-space engineering: dynamically reorganizing application virtual address spaces to create uniformly hot and cold regions that any page-level tiering backend can manage effectively. HADES demonstrates this frontend/backend approach through a compiler-runtime system that tracks and migrates objects based on access patterns, requiring minimal developer intervention. Evaluations across ten data structures achieve up to 70% memory reduction with 3% performance overhead, showing that address space engineering enables existing reclamation systems to reclaim memory aggressively without performance degradation.
Despite significant evolution of CUDA programming and domain-specific libraries, effectively utilizing GPUs with massively parallel engines remains difficult. Large language models (LLMs) show strong potential in generating optimized CUDA code from sequential code. However, using LLMs in practice faces two major challenges: cloud-based APIs pose risks of code leakage, and local deployment is often computationally expensive and inefficient. These drawbacks have spurred interest in small language models (SLMs), which are more lightweight and privacy-friendly. Encouragingly, recent studies show that SLMs can achieve performance comparable to LLMs on specific tasks. While SLMs can match LLMs on domain-specific tasks, their limited reasoning abilities lead to suboptimal performance in complex CUDA generation according to our experiments. To bridge this gap, we propose ReGraphT, a training-free, retrieval-augmented generation framework that transfers LLM-level reasoning to smaller models. ReGraphT organizes CUDA optimization trajectories into a structured reasoning graph, modeling the combined CUDA optimizations as state transitions, and leverages Monte Carlo Graph Search (MCGS) for efficient exploration. We also present a CUDA-specific benchmark with difficulty tiers defined by reasoning complexity to evaluate models more comprehensively. Experiments show that ReGraphT outperforms HPC-specific fine-tuned models and other retrieval-augmented approaches, achieving an average 2.33X speedup on CUDAEval and ParEval. When paired with DeepSeek-Coder-V2-Lite-Instruct and Qwen2.5-Coder-7B-Instruct, ReGraphT enables SLMs to approach LLM-level performance without the associated privacy risks or excessive computing overhead.
The remarkable progress of Large Language Models (LLMs) presents promising opportunities for Verilog code generation which is significantly important for automated circuit design. The lacking of meaningful functional rewards hinders the preference optimization based on Reinforcement Learning (RL) for producing functionally correct Verilog code. In this paper, we propose Signal-Aware Learning for Verilog code generation (QiMeng-SALV) by leveraging code segments of functionally correct output signal to optimize RL training. Considering Verilog code specifies the structural interconnection of hardware gates and wires so that different output signals are independent, the key insight of QiMeng-SALV is to extract verified signal-aware implementations in partially incorrect modules, so as to enhance the extraction of meaningful functional rewards. Roughly, we verify the functional correctness of signals in generated module by comparing with that of reference module in the training data. Then abstract syntax tree (AST) is employed to identify signal-aware code segments which can provide meaningful functional rewards from erroneous modules. Finally, we introduce signal-aware DPO which is optimized on the correct signal-level code segments, thereby preventing noise and interference from incorrect signals. The proposed QiMeng-SALV underscores the paradigm shift from conventional module-level to fine-grained signal-level optimization in Verilog code generation, addressing the issue of insufficient functional rewards. Experiments demonstrate that our method achieves state-of-the-art performance on VerilogEval and RTLLM, with a 7B parameter model matching the performance of the DeepSeek v3 671B model and significantly outperforming the leading open-source model CodeV trained on the same dataset. Our code is available at https://github.com/zy1xxx/SALV.
Objectives: This study aims to investigate the readability and understandability of bitwise operators in programming, with the main hypothesis that there will be a difference in the performance metrics (response time and error rate) between participants exposed to various bitwise operators related questions and those who are not. Participants: Participants in this human research study include people without programming background, novice programmers, and university students with varying programming experience (from freshmen to PhD level). There were 23 participants in this study. Study Methods: This study uses a within-subjects experimental design to assess how people with diverse programming backgrounds understand and use bitwise operators. Participants complete tasks in a JavaScript program, and their task completion times and task accuracy are recorded for analysis. Findings: The results indicate that operators can be one of the factors predicting response time, showing a small but significant effect (R-squared = 0.032, F(1, 494) = 16.5, p < .001). Additionally, operators such as OR, NOT, and Left Shift showed statistical significance in task completion times compared to other operators. Conclusions: While the complexity of bitwise operators did not generally result in longer task completion times, certain operators were found to be less intuitive, suggesting the need for further investigation and potential redesign for improved understandability.
As computational analysis becomes increasingly more complex in health research, transparent sharing of analytical code is vital for reproducibility and trust. This practical guide, aligned to open science practices, outlines actionable recommendations for code sharing in healthcare research. Emphasising the FAIR (Findable, Accessible, Interoperable, Reusable) principles, the authors address common barriers and provide clear guidance to help make code more robust, reusable, and scrutinised as part of the scientific record. This supports better science and more reliable evidence for computationally-driven practice and helps to adhere to new standards and guidelines of codesharing mandated by publishers and funding bodies.
We present TLLC which extends the Two-Level Linear dependent type theory (TLL) with session-based concurrency. Equipped with Martin-L\"{o}f style dependency, the session types of TLLC allow protocols to specify properties of communicated messages. When used in conjunction with the dependent type machinery already present in TLL, dependent session types facilitate a form of relational verification by relating concurrent programs with their idealized sequential counterparts. Correctness properties proven for sequential programs can be easily lifted to their corresponding concurrent implementations. TLLC makes session types a powerful tool for intrinsically verifying the correctness of data structures such as queues and concurrent algorithms such as map-reduce. To extend TLL with session types, we develop a novel formulation of intuitionistic session type which we believe to be widely applicable for integrating session types into other type systems beyond the context of TLLC. We study the meta-theory of our language, proving its soundness as both a term calculus and a process calculus. To demonstrate the practicality of TLLC, we have implemented a prototype compiler that translates TLLC programs into concurrent C code, which has been extensively evaluated.
During the study, the results of a comparative analysis of the process of handling large datasets using the Apache Spark platform in Java, Python, and Scala programming languages were obtained. Although prior works have focused on individual stages, comprehensive comparisons of full ETL workflows across programming languages using Apache Iceberg remain limited. The analysis was performed by executing several operations, including downloading data from CSV files, transforming and loading it into an Apache Iceberg analytical table. It was found that the performance of the Spark algorithm varies significantly depending on the amount of data and the programming language used. When processing a 5-megabyte CSV file, the best result was achieved in Python: 6.71 seconds, which is superior to Scala's score of 9.13 seconds and Java's time of 9.62 seconds. For processing a large CSV file of 1.6 gigabytes, all programming languages demonstrated similar results: the fastest performance was showed in Python: 46.34 seconds, while Scala and Java showed results of 47.72 and 50.56 seconds, respectively. When performing a more complex operation that involved combining two CSV files into a single dataset for further loading into an Apache Iceberg table, Scala demonstrated the highest performance, at 374.42 seconds. Java processing was completed in 379.8 seconds, while Python was the least efficient, with a runtime of 398.32 seconds. It follows that the programming language significantly affects the efficiency of data processing by the Apache Spark algorithm, with Scala and Java being more productive for processing large amounts of data and complex operations, while Python demonstrates an advantage in working with small amounts of data. The results obtained can be useful for optimizing data handling processes depending on specific performance requirements and the amount of information being processed.
An algorithm specification in natural language or pseudocode is expected to be clear and explicit enough to enable mechanical execution. In this position paper we contribute an initial characterization of the knowledge that an executing agent, human or machine, should possess in order to be able to carry out the instructions of a given algorithm specification as a stand-alone entity, independent of any system implementation. We argue that, for that algorithm specification, such prerequisite knowledge, whether unique or shared with other specifications, can be summarized in a document of practical size. We term this document the realm of the algorithm specification. The generation of such a realm is itself a systematic analytical process, significant parts of which can be automated with the help of large language models and the reuse of existing documents. The algorithm-specification's realm would consist of specification language syntax and semantics, domain knowledge restricted to the referenced entities, inter-entity relationships, relevant underlying cause-and-effect rules, and detailed instructions and means for carrying out certain operations. Such characterization of the realm can contribute to methodological implementation of the algorithm specification in diverse systems and to its formalization for mechanical verification. The paper also touches upon the question of assessing execution faithfulness, which is distinct from correctness: in the absence of a reference interpretation of natural language or pseudocode specification with a given vocabulary, how can we determine if an observed agent's execution indeed complies with the input specification.
Large Language Models (LLMs) are central to reasoning, writing, and decision-support workflows, yet users lack consistent control over how they reason and express outputs. Conventional prompt engineering relies on verbose natural-language instructions, limiting reproducibility, modularity, and interpretability. This paper introduces Prompt Decorators, a declarative, composable syntax that governs LLM behavior through compact control tokens such as +++Reasoning, +++Tone(style=formal), and +++Import(topic="Systems Thinking"). Each decorator modifies a behavioral dimension, such as reasoning style, structure, or tone, without changing task content. The framework formalizes twenty core decorators organized into two functional families (Cognitive & Generative and Expressive & Systemic), each further decomposed into subcategories that govern reasoning, interaction, expression, and session-control. It defines a unified syntax, scoping model, and deterministic processing pipeline enabling predictable and auditable behavior composition. By decoupling task intent from execution behavior, Prompt Decorators create a reusable and interpretable interface for prompt design. Illustrative use cases demonstrate improved reasoning transparency, reduced prompt complexity, and standardized model behavior across domains. The paper concludes with implications for interoperability, behavioral consistency, and the development of declarative interfaces for scalable AI systems.