Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
Large Language Models (LLMs) have significantly advanced the state-of-the-art in various coding tasks. Beyond directly answering user queries, LLMs can also serve as judges, assessing and comparing the quality of responses generated by other models. Such an evaluation capability is crucial both for benchmarking different LLMs and for improving response quality through response ranking. However, despite the growing adoption of the LLM-as-a-Judge paradigm, its effectiveness in coding scenarios remains underexplored due to the absence of dedicated benchmarks. To address this gap, we introduce CodeJudgeBench, a benchmark explicitly designed to evaluate the performance of LLM-as-a-Judge models across three critical coding tasks: code generation, code repair, and unit test generation. Through comprehensive benchmarking of 26 LLM-as-a-Judge models, we find that recent thinking models significantly outperform non-thinking models on our carefully designed code judging tasks. Notably, even relatively small thinking models, such as Qwen3-8B, can outperform specially trained LLM-as-a-Judge models up to 70B in size. Nevertheless, all models still exhibit significant randomness in their judgment of coding tasks. For pairwise judging tasks, simply changing the order in which responses are presented can substantially impact accuracy. In addition, when judging code and unit tests written by different LLMs, LLM-as-a-Judge models also show variance in performance. This sensitivity raises concerns about the reliability and consistency of LLM-as-a-Judge in coding scenarios. Lastly, we study optimal prompting strategies for LLM-as-a-Judge. We find that using pair-wise comparison outperforms scalar point-wise judging. Furthermore, retaining comments and reasoning in the full, unprocessed LLM response leads to improved judge performance.
The widespread adoption of generative AI (GenAI) tools such as GitHub Copilot and ChatGPT is transforming software development. Since generated source code is virtually impossible to distinguish from manually written code, their real-world usage and impact on open-source software development remain poorly understood. In this paper, we introduce the concept of self-admitted GenAI usage, that is, developers explicitly referring to the use of GenAI tools for content creation in software artifacts. Using this concept as a lens to study how GenAI tools are integrated into open-source software projects, we analyze a curated sample of more than 250,000 GitHub repositories, identifying 1,292 such self-admissions across 156 repositories in commit messages, code comments, and project documentation. Using a mixed methods approach, we derive a taxonomy of 32 tasks, 10 content types, and 11 purposes associated with GenAI usage based on 284 qualitatively coded mentions. We then analyze 13 documents with policies and usage guidelines for GenAI tools and conduct a developer survey to uncover the ethical, legal, and practical concerns behind them. Our findings reveal that developers actively manage how GenAI is used in their projects, highlighting the need for project-level transparency, attribution, and quality control practices in the new era of AI-assisted software development. Finally, we examine the longitudinal impact of GenAI adoption on code churn in 151 repositories with self-admitted GenAI usage and find no general increase, contradicting popular narratives on the impact of GenAI on software development.
Assertion-Based Verification (ABV) is critical for ensuring functional correctness in modern hardware systems. However, manually writing high-quality SVAs remains labor-intensive and error-prone. To bridge this gap, we propose AssertCoder, a novel unified framework that automatically generates high-quality SVAs directly from multimodal hardware design specifications. AssertCoder employs a modality-sensitive preprocessing to parse heterogeneous specification formats (text, tables, diagrams, and formulas), followed by a set of dedicated semantic analyzers that extract structured representations aligned with signal-level semantics. These representations are utilized to drive assertion synthesis via multi-step chain-of-thought (CoT) prompting. The framework incorporates a mutation-based evaluation approach to assess assertion quality via model checking and further refine the generated assertions. Experimental evaluation across three real-world Register-Transfer Level (RTL) designs demonstrates AssertCoder's superior performance, achieving an average increase of 8.4% in functional correctness and 5.8% in mutation detection compared to existing state-of-the-art approaches.
Interaction-Oriented Programming (IOP) is an approach to building a multiagent system by modeling the interactions between its roles via a flexible interaction protocol and implementing agents to realize the interactions of the roles they play in the protocol. In recent years, we have developed an extensive suite of software that enables multiagent system developers to apply IOP. These include tools for efficiently verifying protocols for properties such as liveness and safety and middleware that simplifies the implementation of agents. This paper presents some of that software suite.
The aerospace industry has experienced significant transformations over the last decade, driven by technological advancements and innovative solutions in goods and personal transportation. This evolution has spurred the emergence of numerous start-ups that now face challenges traditionally encountered by established aerospace companies. Among these challenges is the efficient processing of digital intra-device communication interfaces for onboard equipment - a critical component for ensuring seamless system integration and functionality. Addressing this challenge requires solutions that emphasize clear and consistent interface descriptions, automation of processes, and reduced labor-intensive efforts. This paper presents a novel process and toolchain designed to streamline the development of digital interfaces and onboard software, which our team has successfully applied in several completed projects. The proposed approach focuses on automation and flexibility while maintaining compliance with design assurance requirements.
Context: Pair programming is an established (agile) practice and is practiced throughout the industry. Objective: Understand under what circumstances knowledge transfer can harm a pair programming session. Method: Grounded Theory Methodology based on 17 recorded pair programming sessions with 18 developers from 5 German software companies accompanied, by 6 interviews with different developers from 4 other German companies. Results: We define the student and teacher roles to help developers deal with a one-sided knowledge gap. We describe pitfalls to avoid and develop a grounded theory centered around the Power Gap in pair programming. Conclusions: Knowledge transfer can be harmful when developers don't pay attention to their partners needs and desires. If developers don't pay attention to the Power Gap and keep it in check, Defensive Behavior may arise that leads to a vicious cycle impacting the knowledge transfer, the Togetherness and the code quality in a negative way.
Software developers often have to gain an understanding of a codebase. Be it programmers getting onboarded onto a team project or, for example, developers striving to grasp an external open-source library. In either case, they frequently turn to the project's documentation. However, documentation in its traditional textual form is ill-suited for this kind of high-level exploratory analysis, since it is immutable from the readers' perspective and thus forces them to follow a predefined path. We have designed an approach bringing aspects of software architecture visualization to API reference documentation. It utilizes a highly interactive node-link diagram with expressive node glyphs and flexible filtering capabilities, providing a high-level overview of the codebase as well as details on demand. To test our design, we have implemented a prototype named Helveg, capable of automatically generating diagrams of C\# codebases. User testing of Helveg confirmed its potential, but it also revealed problems with the readability, intuitiveness, and user experience of our tool. Therefore, in this paper, which is an extended version of our VISSOFT paper with DOI 10.1109/VISSOFT64034.2024.00012, we address many of these problems through major changes to the glyph design, means of interaction, and user interface of the tool. To assess the improvements, this new version of Helveg was evaluated again with the same group of participants as the previous version.
Modern robotic systems integrate multiple independent software and hardware components, each responsible for distinct functionalities such as perception, decision-making, and execution. These components interact extensively to accomplish complex end-to-end tasks. As a result, the overall system reliability depends not only on the correctness of individual components, but also on the correctness of their interactions. Failures often manifest at the boundaries between components, yet interaction-related reliability issues in robotics--referred to here as interaction bugs (iBugs)--remain underexplored. This work presents an empirical study of iBugs within robotic systems built using the Robot Operating System (ROS), a widely adopted open-source robotics framework. A total of 121 iBugs were analyzed across ten actively maintained and representative ROS projects. The identified iBugs are categorized into three major types: intra-system iBugs, hardware iBugs, and environmental iBugs, covering a broad range of interaction scenarios in robotics. The analysis includes an examination of root causes, fixing strategies, and the impact of these bugs. Several findingsa are derived that shed light on the nature of iBugs and suggest directions for improving their prevention and detection. These insights aim to inform the design of more robust and safer robotic systems.
Growing concerns around the trustworthiness of AI-enabled systems highlight the role of requirements engineering (RE) in addressing emergent, context-dependent properties that are difficult to specify without structured approaches. In this short vision paper, we propose the integration of two complementary approaches: AMDiRE, an artefact-based approach for RE, and PerSpecML, a perspective-based method designed to support the elicitation, analysis, and specification of machine learning (ML)-enabled systems. AMDiRE provides a structured, artefact-centric, process-agnostic methodology and templates that promote consistency and traceability in the results; however, it is primarily oriented toward deterministic systems. PerSpecML, in turn, introduces multi-perspective guidance to uncover concerns arising from the data-driven and non-deterministic behavior of ML-enabled systems. We envision a pathway to operationalize trustworthiness-related requirements, bridging stakeholder-driven concerns and structured artefact models. We conclude by outlining key research directions and open challenges to be discussed with the RE community.
Formal specifications are essential for ensuring software correctness, yet manually writing them is tedious and error-prone. Large Language Models (LLMs) have shown promise in generating such specifications from natural language intents, but the giant model size and high computational demands raise a fundamental question: Do we really need large models for this task? In this paper, we show that a small, fine-tuned language model can achieve high-quality postcondition generation with much lower computational costs. We construct a specialized dataset of prompts, reasoning logs, and postconditions, then supervise the fine-tuning of a $7$B-parameter code model. Our approach tackles real-world repository dependencies and preserves pre-state information, allowing for expressive and accurate specifications. We evaluate the model on a benchmark of real-world Java bugs (Defects4J) and compare against both proprietary giants (e.g., GPT-4o) and open-source large models. Empirical results demonstrate that our compact model matches or outperforms significantly larger counterparts in syntax correctness, semantic correctness, and bug-distinguishing capability. These findings highlight that targeted fine-tuning on a modest dataset can enable small models to achieve results formerly seen only in massive, resource-heavy LLMs, offering a practical and efficient path for the real-world adoption of automated specification generation.
Automated Program Repair (APR) is essential for ensuring software reliability and quality while enhancing efficiency and reducing developers' workload. Although rule-based and learning-based APR methods have demonstrated their effectiveness, their performance was constrained by the defect type of repair, the quality of training data, and the size of model parameters. Recently, Large Language Models (LLMs) combined with Retrieval-Augmented-Generation (RAG) have been increasingly adopted in APR tasks. However, current code LLMs and RAG designs neither fully address code repair tasks nor consider code-specific features. To overcome these limitations, we propose SelRepair, a novel APR approach with integration of a fine-tuned LLM with a newly-designed dual RAG module. This approach uses a bug-fix pair dataset for fine-tuning and incorporates semantic and syntactic/structural similarity information through an RAG selection gate. This design ensures relevant information is retrieved efficiently, thereby reducing token length and inference time. Evaluations on Java datasets show SelRepair outperforms other APR methods, achieving 26.29% and 17.64% in terms of exact match (EM) on different datasets while reducing inference time by at least 6.42% with controlled input lengths.
Snapshot testing has emerged as a critical technique for UI validation in modern software development, yet it suffers from substantial maintenance overhead due to frequent UI changes causing test failures that require manual inspection to distinguish between genuine regressions and intentional design changes. This manual triage process becomes increasingly burdensome as applications evolve, creating a need for automated analysis solutions. This paper introduces LLMShot, a novel framework that leverages vision-based Large Language Models to automatically analyze snapshot test failures through hierarchical classification of UI changes. To evaluate LLMShot's effectiveness, we developed a comprehensive dataset using a feature-rich iOS application with configurable feature flags, creating realistic scenarios that produce authentic snapshot differences representative of real development workflows. Our evaluation using Gemma3 models demonstrates strong classification performance, with the 12B variant achieving over 84% recall in identifying failure root causes while the 4B model offers practical deployment advantages with acceptable performance for continuous integration environments. However, our exploration of selective ignore mechanisms revealed significant limitations in current prompting-based approaches for controllable visual reasoning. LLMShot represents the first automated approach to semantic snapshot test analysis, offering developers structured insights that can substantially reduce manual triage effort and advance toward more intelligent UI testing paradigms.
Large Language Models (LLMs) are increasingly used as code assistants, yet their behavior when explicitly asked to generate insecure code remains poorly understood. While prior research has focused on unintended vulnerabilities or adversarial prompting techniques, this study examines a more direct threat scenario: open-source LLMs generating vulnerable code when prompted either directly or indirectly. We propose a dual experimental design: (1) Dynamic Prompting, which systematically varies vulnerability type, user persona, and directness across structured templates; and (2) Reverse Prompting, which derives prompts from real vulnerable code samples to assess vulnerability reproduction accuracy. We evaluate three open-source 7B-parameter models (Qwen2, Mistral, and Gemma) using ESBMC static analysis to assess both the presence of vulnerabilities and the correctness of the generated vulnerability type. Results show all models frequently produce vulnerable outputs, with Qwen2 achieving highest correctness rates. User persona significantly affects success, where student personas achieved higher vulnerability rates than professional roles, while direct prompts were marginally more effective. Vulnerability reproduction followed an inverted-U pattern with cyclomatic complexity, peaking at moderate ranges. Our findings expose limitations of safety mechanisms in open-source models, particularly for seemingly benign educational requests.
Context: Agile IT organizations, which are characterized by self-organization and collaborative social interactions, require motivating, efficient and flexible work environments to maximize value creation. Compressed work schedules such as the four-day workweek have evolved into multiple facets over the last decades and are associated with various benefits for organizations and their employees. Objective: Our objective in this study is to deepen our comprehension of the impact of compressed work schedules on the operational efficacy of IT enterprises, while concurrently developing a comprehensive framework delineating the intricacies of compressed work schedules.Method: We conducted a systematic review of available conceptualizations related to four-day workweek schedules and elaborate on their organizational and social effects. To cover scientific and practice-oriented literature, our review combined a systematic literature review and a web content analysis. Results: Based on the generated insights, we derive a meta-framework that matches conceptualizations and effects, finally guiding the adoption of compressed work schedules based on individual managerial prerequisites and circumstances.
Agile methods are defined through guidelines comprising various practices intended to enable agile ways of working. These guidelines further comprise a specific set of agile practices aiming to enable teams for an agile way of working. However, due to its wide-spread use in practice we know that agile practices are adopted and tailored intensively, which lead to a high variety of agile practices in terms of their level of detail. Problem: A high variety of agile practices can be challenging as we do not know how different agile practices are interrelated with each other. To be more precise, tailoring and adopting agile practices may lead to the challenge, that the combinatorial use of several agile practices can only be successful to a limited extent, as practices support or even require each other for a effective use in practice. Objective: Our study aims to provide an enabler for this problem. We want to identify interrelations between agile practices and describe them in a systematic manner. Contribution: The core contribution of this paper is the Agile Map, a theoretical model describing relations between agile practices following a systematic approach aiming to provide an overview of coherences between agile practices. The model aims to support practitioners in selecting and combining agile practices in a meaningful way.
Estimating worst-case resource consumption is a critical task in software development. The worst-case analysis (WCA) problem is an optimization-based abstraction of this task. Fuzzing and symbolic execution are widely used techniques for addressing the WCA problem. However, improving code coverage in fuzzing or managing path explosion in symbolic execution within the context of WCA poses significant challenges. In this paper, we propose PathFuzzing, aiming to combine the strengths of both techniques to design a WCA method. The key idea is to transform a program into a symbolic one that takes an execution path (encoded as a binary string) and interprets the bits as branch decisions. PathFuzzing then applies evolutionary fuzzing techniques to the transformed program to search for binary strings that represent satisfiable path conditions and lead to high resource consumption. We evaluate the performance of PathFuzzing experimentally on a benchmark suite that consists of prior work's benchmarks and some added by us. Results show that PathFuzzing generally outperforms a fuzzing and a symbolic-execution baseline.
Code large language models (LLMs) enhance programming by understanding and generating code across languages, offering intelligent feedback, bug detection, and code updates through reflection, improving development efficiency and accessibility. While benchmarks (e.g. HumanEval/LiveCodeBench) evaluate code generation and real-world relevance, previous works ignore the scenario of modifying code in repositories. Considering challenges remaining in improving reflection capabilities and avoiding data contamination in dynamic benchmarks, we introduce LiveRepoReflection, a challenging benchmark for evaluating code understanding and generation in multi-file repository contexts, featuring 1,888 rigorously filtered test cases across $6$ programming languages to ensure diversity, correctness, and high difficulty. Further, we create RepoReflection-Instruct, a large-scale, quality-filtered instruction-tuning dataset derived from diverse sources, used to train RepoReflectionCoder through a two-turn dialogue process involving code generation and error-driven repair. The leaderboard evaluates over 40 LLMs to reflect the model performance of repository-based code reflection.
Most safety testing efforts for large language models (LLMs) today focus on evaluating foundation models. However, there is a growing need to evaluate safety at the application level, as components such as system prompts, retrieval pipelines, and guardrails introduce additional factors that significantly influence the overall safety of LLM applications. In this paper, we introduce a practical framework for evaluating application-level safety in LLM systems, validated through real-world deployment across multiple use cases within our organization. The framework consists of two parts: (1) principles for developing customized safety risk taxonomies, and (2) practices for evaluating safety risks in LLM applications. We illustrate how the proposed framework was applied in our internal pilot, providing a reference point for organizations seeking to scale their safety testing efforts. This work aims to bridge the gap between theoretical concepts in AI safety and the operational realities of safeguarding LLM applications in practice, offering actionable guidance for safe and scalable deployment.