Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
The Maven Central ecosystem forms the backbone of Java dependency management, hosting artifacts that vary significantly in their adoption, security, and ecosystem roles. Artifact reuse is fundamental in software development, with ecosystems like Maven Central facilitating this process. However, prior studies predominantly analyzed popular artifacts with numerous dependencies, leaving those without incoming dependencies (independent artifacts) unexplored. In this study, we analyzed 658,078 artifacts, of which 635,003 had at least one release. Among these, 93,101 artifacts (15.4%) were identified as independent (in-degree = 0), while the rest were classified as dependent. We looked at the impact of separate artifacts using PageRank and out-degree centrality and discovered that they were very important to the ecosystem. Further analysis across 18 different metrics revealed several advantages and comparability of independent artifacts with dependent artifacts: comparable popularity (25.58 vs. 7.30), fewer vulnerabilities (60 CVEs vs. 179 CVEs), and zero propagated vulnerabilities. Based on these results, it seems that independent artifacts make a big difference in the ecosystem and give developers a safe, self-contained alternative to traditional dependencies. These findings suggest that independent artifacts might be a beneficial choice for dependencies but have some maintainability issues. Therefore, developers should carefully incorporate independent artifacts into their projects, and artifact maintainers should prioritize this group of artifacts to mitigate the risk of transitive vulnerability propagation and improve software sustainability.
The scarcity of high-quality public log datasets has become a critical bottleneck in advancing log-based anomaly detection techniques. Current datasets exhibit three fundamental limitations: (1) incomplete event coverage, (2) artificial patterns introduced by static analysis-based generation frameworks, and (3) insufficient semantic awareness. To address these challenges, we present AnomalyGen, the first automated log synthesis framework specifically designed for anomaly detection. Our framework introduces a novel four-phase architecture that integrates enhanced program analysis with Chain-of-Thought reasoning (CoT reasoning), enabling iterative log generation and anomaly annotation without requiring physical system execution. Evaluations on Hadoop and HDFS distributed systems demonstrate that AnomalyGen achieves substantially broader log event coverage (38-95 times improvement over existing datasets) while producing more operationally realistic log sequences compared to static analysis-based approaches. When augmenting benchmark datasets with synthesized logs, we observe maximum F1-score improvements of 3.7% (average 1.8% improvement across three state-of-the-art anomaly detection models). This work not only establishes a high-quality benchmarking resource for automated log analysis but also pioneers a new paradigm for applying large language models (LLMs) in software engineering workflows.
Smart contract vulnerabilities pose significant security risks to blockchain systems, potentially leading to severe financial losses. Existing methods face several limitations: (1) Program analysis-based approaches rely on predefined patterns, lacking flexibility for new vulnerability types; (2) Deep learning-based methods lack explanations; (3) Large language model-based approaches suffer from high false positives. We propose MOS, a smart contract vulnerability detection framework based on mixture-of-experts tuning (MOE-Tuning) of large language models. First, we conduct continual pre-training on a large-scale smart contract dataset to provide domain-enhanced initialization. Second, we construct a high-quality MOE-Tuning dataset through a multi-stage pipeline combining LLM generation and expert verification for reliable explanations. Third, we design a vulnerability-aware routing mechanism that activates the most relevant expert networks by analyzing code features and their matching degree with experts. Finally, we extend the feed-forward layers into multiple parallel expert networks, each specializing in specific vulnerability patterns. We employ a dual-objective loss function: one for optimizing detection and explanation performance, and another for ensuring reasonable distribution of vulnerability types to experts through entropy calculation. Experiments show that MOS significantly outperforms existing methods with average improvements of 6.32% in F1 score and 4.80% in accuracy. The vulnerability explanations achieve positive ratings (scores of 3-4 on a 4-point scale) of 82.96%, 85.21% and 94.58% for correctness, completeness, and conciseness through human and LLM evaluation.
In the AI community, benchmarks to evaluate model quality are well established, but an equivalent approach to benchmarking products built upon generative AI models is still missing. This has had two consequences. First, it has made teams focus on model quality over the developer experience, while successful products combine both. Second, product team have struggled to answer questions about their products in relation to their competitors. In this case study, we share: (1) our process to create robust, enterprise-grade and modular components to support the benchmarking of the developer experience (DX) dimensions of our team's AI for code offerings, and (2) the components we have created to do so, including demographics and attitudes towards AI surveys, a benchmarkable task, and task and feature surveys. By doing so, we hope to lower the barrier to the DX benchmarking of genAI-enhanced code products.
To support junior and senior architects, I propose developing a new architecture creation method that leverages LLMs' evolving capabilities to support the architect. This method involves the architect's close collaboration with LLM-fueled tooling over the whole process. The architect is guided through Domain Model creation, Use Case specification, architectural decisions, and architecture evaluation. While the architect can take complete control of the process and the results, and use the tooling as a building set, they can follow the intended process for maximum tooling support. The preliminary results suggest the feasibility of this process and indicate major time savings for the architect.
Context. Microservice-based systems have gained significant attention over the past years. A critical factor for understanding and analyzing the behavior of these systems is the collection of monitoring data such as logs, metrics, and traces. These data modalities can be used for anomaly detection and root cause analysis of failures. In particular, multi-modal methods utilizing several types of this data at once have gained traction in the research community since these three modalities capture different dimensions of system behavior. Aim. We provide a dataset that supports research on anomaly detection and architectural degradation in microservice systems. We generate a comprehensive dataset of logs, metrics, and traces from a production microservice system to enable the exploration of multi-modal fusion methods that integrate multiple data modalities. Method. We dynamically tested the various APIs of the MS-based system, implementing the OAuth2.0 protocol using the Locust tool. For each execution of the prepared test suite, we collect logs and performance metrics for correct and erroneous calls with data labeled according to the error triggered during the call. Contributions. We collected approximately 657,000 individual log files, totaling over two billion log lines. In addition, we collected more than 45 million individual metric files that contain 485 unique metrics. We provide an initial analysis of logs, identify key metrics through PCA, and discuss challenges in collecting traces for this system. Moreover, we highlight the possibilities for making a more fine-grained version of the data set. This work advances anomaly detection in microservice systems using multiple data sources.
Just in time defect prediction (JIT DP) leverages ML to identify defect-prone code commits, enabling quality assurance (QA) teams to allocate resources more efficiently by focusing on commits that are most likely to contain defects. Although JIT DP techniques have introduced improvements in terms of predictive accuracy, they are still susceptible to misclassification errors such as false positives and negatives. This can lead to wasted resources or undetected defects, a particularly critical concern when QA resources are limited. To mitigate these challenges and preserve the practical utility of JIT DP tools, it becomes essential to estimate the reliability of the predictions, i.e., computing confidence scores. Such scores can help practitioners determine the trustworthiness of predictions and thus prioritize them efficiently. A simple approach to computing confidence scores is to extract, alongside each prediction, the corresponding prediction probabilities and use them as indicators of confidence. However, for these probabilities to reliably serve as confidence scores, the predictive model must be well-calibrated. This means that the prediction probabilities must accurately represent the true likelihood of each prediction being correct. Miscalibration, common in modern ML models, distorts probability scores such that they do not align with the actual correctness probability. In this study, we evaluate the calibration of three JIT DP techniques to determine whether and to what extent they exhibit poor calibration. Furthermore, we assess whether post-calibration methods can improve the calibration of existing JIT defect prediction models. Our results reveal that all evaluated JIT DP models exhibit some level of miscalibration, with ECE ranging from 2-35%. Furthermore, post-calibration methods do not consistently improve the calibration.
As Ethereum continues to thrive, the Ethereum Virtual Machine (EVM) has become the cornerstone powering tens of millions of active smart contracts. Intuitively, security issues in EVMs could lead to inconsistent behaviors among smart contracts or even denial-of-service of the entire blockchain network. However, to the best of our knowledge, only a limited number of studies focus on the security of EVMs. Moreover, they suffer from 1) insufficient test input diversity and invalid semantics; and 2) the inability to automatically identify bugs and locate root causes. To bridge this gap, we propose OpDiffer, a differential testing framework for EVM, which takes advantage of LLMs and static analysis methods to address the above two limitations. We conducted the largest-scale evaluation, covering nine EVMs and uncovering 26 previously unknown bugs, 22 of which have been confirmed by developers and three have been assigned CNVD IDs. Compared to state-of-the-art baselines, OpDiffer can improve code coverage by at most 71.06%, 148.40% and 655.56%, respectively. Through an analysis of real-world deployed Ethereum contracts, we estimate that 7.21% of the contracts could trigger our identified EVM bugs under certain environmental settings, potentially resulting in severe negative impact on the Ethereum ecosystem.
Zero-knowledge (ZK) circuits enable privacy-preserving computations and are central to many cryptographic protocols. Systems like Circom simplify ZK development by combining witness computation and circuit constraints in one program. However, even small errors can compromise security of ZK programs --under-constrained circuits may accept invalid witnesses, while over-constrained ones may reject valid ones. Static analyzers are often imprecise with high false positives, and formal tools struggle with real-world circuit scale. Additionally, existing tools overlook several critical behaviors, such as intermediate computations and program aborts, and thus miss many vulnerabilities. Our theoretical contribution is the Trace-Constraint Consistency Test (TCCT), a foundational language-independent formulation of ZK circuit bugs that defines bugs as discrepancies between the execution traces of the computation and the circuit constraints. TCCT captures both intermediate computations and program aborts, detecting bugs that elude prior tools. Our systems contribution is zkFuzz, a novel program mutation-based fuzzing framework for detecting TCCT violations. zkFuzz systematically mutates the computational logic of Zk programs guided by a novel fitness function, and injects carefully crafted inputs using tailored heuristics to expose bugs. We evaluated zkFuzz on 354 real-world ZK circuits written in Circom, a leading programming system for ZK development. zkFuzz successfully identified 66 bugs, including 38 zero-days --18 of which were confirmed by developers and 6 fixed, earning bug bounties.
Continuous Integration and Continuous Deployment (CI/CD) pipeline automates software development to speed up and enhance the efficiency of engineering software. These workflows consist of various jobs, such as code validation and testing, which developers must wait to complete before receiving feedback. The jobs can fail, which leads to unnecessary delays in build times, decreasing productivity for developers, and increasing costs for companies. To explore how companies adopt CI/CD workflows and balance cost with quality assurance during optimization, we studied 4 companies, reporting industry experiences with CI/CD practices. Our findings reveal that organizations can confuse the distinction between CI and CD, whereas code merge and product release serve as more effective milestones for process optimization and risk control. While numerous tools and research efforts target the post-merge phase to enhance productivity, limited attention has been given to the pre-merge phase, where early failure prevention brings more impacts and less risks.
Performance benchmarking is a common practice in software engineering, particularly when building large-scale, distributed, and data-intensive systems. While cloud environments offer several advantages for running benchmarks, it is often reported that benchmark results can vary significantly between repetitions -- making it difficult to draw reliable conclusions about real-world performance. In this paper, we empirically quantify the impact of cloud performance variability on benchmarking results, focusing on stream processing applications as a representative type of data-intensive, performance-critical system. In a longitudinal study spanning more than three months, we repeatedly executed an application benchmark used in research and development at Dynatrace. This allows us to assess various aspects of performance variability, particularly concerning temporal effects. With approximately 591 hours of experiments, deploying 789 Kubernetes clusters on AWS and executing 2366 benchmarks, this is likely the largest study of its kind and the only one addressing performance from an end-to-end, i.e., application benchmark perspective. Our study confirms that performance variability exists, but it is less pronounced than often assumed (coefficient of variation of < 3.7%). Unlike related studies, we find that performance does exhibit a daily and weekly pattern, although with only small variability (<= 2.5%). Re-using benchmarking infrastructure across multiple repetitions introduces only a slight reduction in result accuracy (<= 2.5 percentage points). These key observations hold consistently across different cloud regions and machine types with different processor architectures. We conclude that for engineers and researchers focused on detecting substantial performance differences (e.g., > 5%) in...
In Agile/Scrum software development, the idea of retrospective meetings (retros) is one of the core elements of the project process. In this paper, we present our work in progress focusing on two aspects: analysis of potential usage of generative AI for information interaction within retrospective meetings, and visualisation of retros' information to software development teams. We also present our prototype tool RetroAI++, focusing on retros-related functionalities.
This study explores the dynamic landscape of Technical Debt (TD) topics in software engineering by examining its evolution across time, programming languages, and repositories. Despite the extensive research on identifying and quantifying TD, there remains a significant gap in understanding the diversity of TD topics and their temporal development. To address this, we have conducted an explorative analysis of TD data extracted from GitHub issues spanning from 2015 to September 2023. We employed BERTopic for sophisticated topic modelling. This study categorises the TD topics and tracks their progression over time. Furthermore, we have incorporated sentiment analysis for each identified topic, providing a deeper insight into the perceptions and attitudes associated with these topics. This offers a more nuanced understanding of the trends and shifts in TD topics through time, programming language, and repository.
Static analysis is a cornerstone for software vulnerability detection, yet it often struggles with the classic precision-scalability trade-off. In practice, such tools often produce high false positive rates, particularly in large codebases like the Linux kernel. This imprecision can arise from simplified vulnerability modeling and over-approximation of path and data constraints. While large language models (LLMs) show promise in code understanding, their naive application to program analysis yields unreliable results due to inherent reasoning limitations. We introduce BugLens, a post-refinement framework that significantly improves static analysis precision. BugLens guides an LLM to follow traditional analysis steps by assessing buggy code patterns for security impact and validating the constraints associated with static warnings. Evaluated on real-world Linux kernel bugs, BugLens raises precision from 0.10 (raw) and 0.50 (semi-automated refinement) to 0.72, substantially reducing false positives and revealing four previously unreported vulnerabilities. Our results suggest that a structured LLM-based workflow can meaningfully enhance the effectiveness of static analysis tools.
Testing Android apps effectively requires a systematic exploration of the app's possible states by simulating user interactions and system events. While existing approaches have proposed several fuzzing techniques to generate various text inputs and trigger user and system events for UI state exploration, achieving high code coverage remains a significant challenge in Android app testing. The main challenges are (1) reasoning about the complex and dynamic layout of UI screens; (2) generating required inputs/events to deal with certain widgets like pop-ups; and (3) coordination between current test inputs and previous inputs to avoid getting stuck in the same UI screen without improving test coverage. To address these problems, we propose a novel, automated fuzzing approach called VLM-Fuzz for effective UI testing of Android apps. We present a novel heuristic-based depth-first search (DFS) exploration algorithm, assisted with a vision language model (VLM), to effectively explore the UI states of the app. We use static analysis to analyze the Android Manifest file and the runtime UI hierarchy XML to extract the list of components, intent-filters and interactive UI widgets. VLM is used to reason about complex UI layout and widgets on an on-demand basis. Based on the inputs from static analysis, VLM, and the current UI state, we use some heuristics to deal with the above-mentioned challenges. We evaluated VLM-Fuzz based on a benchmark containing 59 apps obtained from a recent work and compared it against two state-of-the-art approaches: APE and DeepGUI. VLM-Fuzz outperforms the best baseline by 9.0%, 3.7%, and 2.1% in terms of class coverage, method coverage, and line coverage, respectively. We also ran VLM-Fuzz on 80 recent Google Play apps (i.e., updated in 2024). VLM-Fuzz detected 208 unique crashes in 24 apps, which have been reported to respective developers.
This paper introduces FlowUnits, a novel programming and deployment model that extends the traditional dataflow paradigm to address the unique challenges of edge-to-cloud computing environments. While conventional dataflow systems offer significant advantages for large-scale data processing in homogeneous cloud settings, they fall short when deployed across distributed, heterogeneous infrastructures. FlowUnits addresses three critical limitations of current approaches: lack of locality awareness, insufficient resource adaptation, and absence of dynamic update mechanisms. FlowUnits organize processing operators into cohesive, independently manageable components that can be transparently replicated across different regions, efficiently allocated on nodes with appropriate hardware capabilities, and dynamically updated without disrupting ongoing computations. We implement and evaluate the FlowUnits model within Renoir, an existing dataflow system, demonstrating significant improvements in deployment flexibility and resource utilization across the computing continuum. Our approach maintains the simplicity of dataflow while enabling seamless integration of edge and cloud resources into unified data processing pipelines.
This study investigates AI-driven modernization of legacy COBOL code into Java, addressing a critical challenge in aging software systems. Leveraging the Legacy COBOL 2024 Corpus -- 50,000 COBOL files from public and enterprise sources -- Java parses the code, AI suggests upgrades, and React visualizes gains. Achieving 93% accuracy, complexity drops 35% (from 18 to 11.7) and coupling 33% (from 8 to 5.4), surpassing manual efforts (75%) and rule-based tools (82%). The approach offers a scalable path to rejuvenate COBOL systems, vital for industries like banking and insurance.
Recognizing that technical debt is a persistent and significant challenge requiring sophisticated management tools, TD-Suite offers a comprehensive software framework specifically engineered to automate the complex task of its classification within software projects. It leverages the advanced natural language understanding of state-of-the-art transformer models to analyze textual artifacts, such as developer discussions in issue reports, where subtle indicators of debt often lie hidden. TD-Suite provides a seamless end-to-end pipeline, managing everything from initial data ingestion and rigorous preprocessing to model training, thorough evaluation, and final inference. This allows it to support both straightforward binary classification (debt or no debt) and more valuable, identifying specific categories like code, design, or documentation debt, thus enabling more targeted management strategies. To ensure the generated models are robust and perform reliably on real-world, often imbalanced, datasets, TD-Suite incorporates critical training methodologies: k-fold cross-validation assesses generalization capability, early stopping mechanisms prevent overfitting to the training data, and class weighting strategies effectively address skewed data distributions. Beyond core functionality, and acknowledging the growing importance of sustainability, the framework integrates tracking and reporting of carbon emissions associated with the computationally intensive model training process. It also features a user-friendly Gradio web interface in a Docker container setup, simplifying model interaction, evaluation, and inference.