Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
The automotive industry is undergoing a software-as-a-service transformation that enables software-defined functions and post-sale updates via cloud and vehicle-to-everything communication. Connectivity in cars introduces significant security challenges, as remote attacks on vehicles have become increasingly prevalent. Current automotive designs call for security solutions that address the entire lifetime of a vehicle. In this paper, we propose to authenticate and authorize in-vehicle services by integrating DNSSEC, DANE, and DANCE with automotive middleware. Our approach decouples the cryptographic authentication of the service from that of the service deployment with the help of DNSSEC and thereby largely simplifies key management. We propose to authenticate in-vehicle services by certificates that are solely generated by the service suppliers but published on deployment via DNSSEC TLSA records solely signed by the OEM. Building on well-established Internet standards ensures interoperability with various current and future protocols, scalable management of credentials for millions of connected vehicles at well-established security levels. We back our design proposal by a security analysis using the STRIDE threat model and by evaluations in a realistic in-vehicle setup that demonstrate its effectiveness.
Large Language Models (LLMs) are increasingly used in software security, but their trustworthiness in generating accurate vulnerability advisories remains uncertain. This study investigates the ability of ChatGPT to (1) generate plausible security advisories from CVE-IDs, (2) differentiate real from fake CVE-IDs, and (3) extract CVE-IDs from advisory descriptions. Using a curated dataset of 100 real and 100 fake CVE-IDs, we manually analyzed the credibility and consistency of the model's outputs. The results show that ChatGPT generated plausible security advisories for 96% of given input real CVE-IDs and 97% of given input fake CVE-IDs, demonstrating a limitation in differentiating between real and fake IDs. Furthermore, when these generated advisories were reintroduced to ChatGPT to identify their original CVE-ID, the model produced a fake CVE-ID in 6% of cases from real advisories. These findings highlight both the strengths and limitations of ChatGPT in cybersecurity applications. While the model demonstrates potential for automating advisory generation, its inability to reliably authenticate CVE-IDs or maintain consistency upon re-evaluation underscores the risks associated with its deployment in critical security tasks. Our study emphasizes the importance of using LLMs with caution in cybersecurity workflows and suggests the need for further improvements in their design to improve reliability and applicability in security advisory generation.
Software developers frequently hard-code credentials such as passwords, generic secrets, private keys, and generic tokens in software repositories, even though it is strictly advised against due to the severe threat to the security of the software. These credentials create attack surfaces exploitable by a potential adversary to conduct malicious exploits such as backdoor attacks. Recent detection efforts utilize embedding models to vectorize textual credentials before passing them to classifiers for predictions. However, these models struggle to discriminate between credentials with contextual and complex sequences resulting in high false positive predictions. Context-dependent Pre-trained Language Models (PLMs) or Large Language Models (LLMs) such as Generative Pre-trained Transformers (GPT) tackled this drawback by leveraging the transformer neural architecture capacity for self-attention to capture contextual dependencies between words in input sequences. As a result, GPT has achieved wide success in several natural language understanding endeavors. Hence, we assess LLMs to represent these observations and feed extracted embedding vectors to a deep learning classifier to detect hard-coded credentials. Our model outperforms the current state-of-the-art by 13% in F1 measure on the benchmark dataset. We have made all source code and data publicly available to facilitate the reproduction of all results presented in this paper.
Open-source software (OSS) has become increasingly more popular across different domains. However, this rapid development and widespread adoption come with a security cost. The growing complexity and openness of OSS ecosystems have led to increased exposure to vulnerabilities and attack surfaces. This paper investigates the trends and patterns of reported vulnerabilities within OSS platforms, focusing on the implications of these findings for security practices. To understand the dynamics of OSS vulnerabilities, we analyze a comprehensive dataset comprising 31,267 unique vulnerability reports from GitHub's advisory database and Snyk.io, belonging to 14,675 packages across 10 programming languages. Our analysis reveals a significant surge in reported vulnerabilities, increasing at an annual rate of 98%, far outpacing the 25% average annual growth in the number of open-source software (OSS) packages. Additionally, we observe an 85% increase in the average lifespan of vulnerabilities across ecosystems during the studied period, indicating a potential decline in security. We identify the most prevalent Common Weakness Enumerations (CWEs) across programming languages and find that, on average, just seven CWEs are responsible for over 50% of all reported vulnerabilities. We further examine these commonly observed CWEs and highlight ecosystem-specific trends. Notably, we find that vulnerabilities associated with intentionally malicious packages comprise 49% of reports in the NPM ecosystem and 14% in PyPI, an alarming indication of targeted attacks within package repositories. We conclude with an in-depth discussion of the characteristics and attack vectors associated with these malicious packages.
Rigorous security-focused evaluation of large language model (LLM) agents is imperative for establishing trust in their safe deployment throughout the software development lifecycle. However, existing benchmarks largely rely on synthetic challenges or simplified vulnerability datasets that fail to capture the complexity and ambiguity encountered by security engineers in practice. We introduce SEC-bench, the first fully automated benchmarking framework for evaluating LLM agents on authentic security engineering tasks. SEC-bench employs a novel multi-agent scaffold that automatically constructs code repositories with harnesses, reproduces vulnerabilities in isolated environments, and generates gold patches for reliable evaluation. Our framework automatically creates high-quality software vulnerability datasets with reproducible artifacts at a cost of only $0.87 per instance. Using SEC-bench, we implement two critical software security tasks to rigorously evaluate LLM agents' capabilities: proof-of-concept (PoC) generation and vulnerability patching. A comprehensive evaluation of state-of-the-art LLM code agents reveals significant performance gaps, achieving at most 18.0% success in PoC generation and 34.0% in vulnerability patching on our complete dataset. These results highlight the crucial steps needed toward developing LLM agents that are more practical, intelligent, and autonomous for security engineering.
Permission systems which restrict access to system resources are a well-established technology in operating systems, especially for smartphones. However, as such systems are implemented in the operating system they can at most manage access on the process-level. Since moderns software often (re)uses code from third-parties libraries, a permission system for libraries can be desirable to enhance security. In this short-paper, we adapt concepts from capability systems building a novel theoretical foundation for permission system at the level of the programming language. This leads to PermRust, a token-based permission system for the Rust programming language as a zero cost abstraction on top of its type-system. With it access to system resources can be managed per library.
Indicators of Compromise (IOCs) such as IP addresses, file hashes, and domain names are commonly used for threat detection and attribution. However, IOCs tend to be short-lived as they are easy to change. As a result, the cybersecurity community is shifting focus towards more persistent behavioral profiles, such as the Tactics, Techniques, and Procedures (TTPs) and the software used by a threat group. However, the distinctiveness and completeness of such behavioral profiles remain largely unexplored. In this work, we systematically analyze threat group profiles built from two open cyber threat intelligence (CTI) knowledge bases: MITRE ATT&CK and Malpedia. We first investigate what fraction of threat groups have group-specific behaviors, i.e., behaviors used exclusively by a single group. We find that only 34% of threat groups in ATT&CK have group-specific techniques. The software used by a threat group proves to be more distinctive, with 73% of ATT&CK groups using group-specific software. However, this percentage drops to 24% in the broader Malpedia dataset. Next, we evaluate how group profiles improve when data from both sources are combined. While coverage improves modestly, the proportion of groups with group-specific behaviors remains under 30%. We then enhance profiles by adding exploited vulnerabilities and additional techniques extracted from more threat reports. Despite the additional information, 64% of groups still lack any group-specific behavior. Our findings raise concerns on the belief that behavioral profiles can replace IOCs in threat group attribution.
Recent advancements in LLMs indicate potential for novel applications, e.g., through reasoning capabilities in the latest OpenAI and DeepSeek models. For applying these models in specific domains beyond text generation, LLM-based multi-agent approaches can be utilized that solve complex tasks by combining reasoning techniques, code generation, and software execution. Applications might utilize these capabilities and the knowledge of specialized LLM agents. However, while many evaluations are performed on LLMs, reasoning techniques, and applications individually, their joint specification and combined application is not explored well. Defined specifications for multi-agent LLM systems are required to explore their potential and their suitability for specific applications, allowing for systematic evaluations of LLMs, reasoning techniques, and related aspects. This paper reports the results of exploratory research to specify and evaluate these aspects through a multi-agent system. The system architecture and prototype are extended from previous research and a specification is introduced for multi-agent systems. Test cases involving cybersecurity tasks indicate feasibility of the architecture and evaluation approach. In particular, the results show the evaluation of question answering, server security, and network security tasks that were completed correctly by agents with LLMs from OpenAI and DeepSeek.
Generation-based fuzzing produces appropriate testing cases according to specifications of input grammars and semantic constraints to test systems and software. However, these specifications require significant manual efforts to construct. This paper proposes a new approach, ELFuzz (Evolution Through Large Language Models for Fuzzing), that automatically synthesizes generation-based fuzzers tailored to a system under test (SUT) via LLM-driven synthesis over fuzzer space. At a high level, it starts with minimal seed fuzzers and propels the synthesis by fully automated LLM-driven evolution with coverage guidance. Compared to previous approaches, ELFuzz can 1) seamlessly scale to SUTs of real-world sizes -- up to 1,791,104 lines of code in our evaluation -- and 2) synthesize efficient fuzzers that catch interesting grammatical structures and semantic constraints in a human-understandable way. Our evaluation compared ELFuzz with specifications manually written by domain experts and synthesized by state-of-the-art approaches. It shows that ELFuzz achieves up to 434.8% more coverage and triggers up to 174.0% more artificially injected bugs. We also used ELFuzz to conduct a real-world fuzzing campaign on the newest version of cvc5 for 14 days, and encouragingly, it found five 0-day bugs (three are exploitable). Moreover, we conducted an ablation study, which shows that the fuzzer space model, the key component of ELFuzz, contributes the most (up to 62.5%) to the effectiveness of ELFuzz. Further analysis of the fuzzers synthesized by ELFuzz confirms that they catch interesting grammatical structures and semantic constraints in a human-understandable way. The results present the promising potential of ELFuzz for more automated, efficient, and extensible input generation for fuzzing.
Software vulnerabilities in source code pose serious cybersecurity risks, prompting a shift from traditional detection methods (e.g., static analysis, rule-based matching) to AI-driven approaches. This study presents a systematic review of software vulnerability detection (SVD) research from 2018 to 2023, offering a comprehensive taxonomy of techniques, feature representations, and embedding methods. Our analysis reveals that 91% of studies use AI-based methods, with graph-based models being the most prevalent. We identify key limitations, including dataset quality, reproducibility, and interpretability, and highlight emerging opportunities in underexplored techniques such as federated learning and quantum neural networks, providing a roadmap for future research.
As cyber threats become more sophisticated, rapid and accurate vulnerability detection is essential for maintaining secure systems. This study explores the use of Large Language Models (LLMs) in software vulnerability assessment by simulating the identification of Python code with known Common Weakness Enumerations (CWEs), comparing zero-shot, few-shot cross-domain, and few-shot in-domain prompting strategies. Our results indicate that while zero-shot prompting performs poorly, few-shot prompting significantly enhances classification performance, particularly when integrated with confidence-based routing strategies that improve efficiency by directing human experts to cases where model uncertainty is high, optimizing the balance between automation and expert oversight. We find that LLMs can effectively generalize across vulnerability categories with minimal examples, suggesting their potential as scalable, adaptable cybersecurity tools in simulated environments. However, challenges such as model reliability, interpretability, and adversarial robustness remain critical areas for future research. By integrating AI-driven approaches with expert-in-the-loop (EITL) decision-making, this work highlights a pathway toward more efficient and responsive cybersecurity workflows. Our findings provide a foundation for deploying AI-assisted vulnerability detection systems in both real and simulated environments that enhance operational resilience while reducing the burden on human analysts.
Off-the-shelf software for Command and Control is often used by attackers and legitimate pentesters looking for discretion. Among other functionalities, these tools facilitate the customization of their network traffic so it can mimic popular websites, thereby increasing their secrecy. Cobalt Strike is one of the most famous solutions in this category, used by known advanced attacker groups such as "Mustang Panda" or "Nobelium". In response to these threats, Security Operation Centers and other defense actors struggle to detect Command and Control traffic, which often use encryption protocols such as TLS. Network traffic metadata-based machine learning approaches have been proposed to detect encrypted malware communications or fingerprint websites over Tor network. This paper presents a machine learning-based method to detect Cobalt Strike Command and Control activity based only on widely used network traffic metadata. The proposed method is, to the best of our knowledge, the first of its kind that is able to adapt the model it uses to the observed traffic to optimize its performance. This specificity permits our method to performs equally or better than the state of the art while using standard features. Our method is thus easier to use in a production environment and more explainable.
Fuzzing is a widely used technique for discovering software vulnerabilities, but identifying hot bytes that influence program behavior remains challenging. Traditional taint analysis can track such bytes white-box, but suffers from scalability issue. Fuzzing-Driven Taint Inference (FTI) offers a black-box alternative, yet typically incurs significant runtime overhead due to extra program executions. We observe that the commonly used havoc mutation scheme in fuzzing can be adapted for lightweight FTI with zero extra executions. We present a computational model of havoc mode, demonstrating that it can perform FTI while generating new test cases. Building on this, we propose ZTaint-Havoc, a novel, efficient FTI with minimal overhead (3.84% on UniBench, 12.58% on FuzzBench). We further design an effective mutation algorithm utilizing the identified hot bytes. Our comprehensive evaluation shows that ZTaint-Havoc, implemented in AFL++, improves edge coverage by up to 33.71% on FuzzBench and 51.12% on UniBench over vanilla AFL++, with average gains of 2.97% and 6.12% in 24-hour fuzzing campaigns.
Containerisation is a popular deployment process for application-level virtualisation using a layer-based approach. Docker is a leading provider of containerisation, and through the Docker Hub, users can supply Docker images for sharing and re-purposing popular software application containers. Using a combination of in-built inspection commands, publicly displayed image layer content, and static image scanning, Docker images are designed to ensure end users can clearly assess the content of the image before running them. In this paper we present \textbf{\textit{gh0stEdit}}, a vulnerability that fundamentally undermines the integrity of Docker images and subverts the assumed trust and transparency they utilise. The use of gh0stEdit allows an attacker to maliciously edit Docker images, in a way that is not shown within the image history, hierarchy or commands. This attack can also be carried out against signed images (Docker Content Trust) without invalidating the image signature. We present two use case studies for this vulnerability, and showcase how gh0stEdit is able to poison an image in a way that is not picked up through static or dynamic scanning tools. Our attack case studies highlight the issues in the current approach to Docker image security and trust, and expose an attack method which could potentially be exploited in the wild without being detected. To the best of our knowledge we are the first to provide detailed discussion on the exploit of this vulnerability.
Homomorphic Encryption (HE) enables secure computation on encrypted data without decryption, allowing a great opportunity for privacy-preserving computation. In particular, domains such as healthcare, finance, and government, where data privacy and security are of utmost importance, can benefit from HE by enabling third-party computation and services on sensitive data. In other words, HE constitutes the "Holy Grail" of cryptography: data remains encrypted all the time, being protected while in use. HE's security guarantees rely on noise added to data to make relatively simple problems computationally intractable. This error-centric intrinsic HE mechanism generates new challenges related to the fault tolerance and robustness of HE itself: hardware- and software-induced errors during HE operation can easily evade traditional error detection and correction mechanisms, resulting in silent data corruption (SDC). In this work, we motivate a thorough discussion regarding the sensitivity of HE applications to bit faults and provide a detailed error characterization study of CKKS (Cheon-Kim-Kim-Song). This is one of the most popular HE schemes due to its fixed-point arithmetic support for AI and machine learning applications. We also delve into the impact of the residue number system (RNS) and the number theoretic transform (NTT), two widely adopted HE optimization techniques, on CKKS' error sensitivity. To the best of our knowledge, this is the first work that looks into the robustness and error sensitivity of homomorphic encryption and, as such, it can pave the way for critical future work in this area.
The dual use nature of Large Language Models (LLMs) presents a growing challenge in cybersecurity. While LLM enhances automation and reasoning for defenders, they also introduce new risks, particularly their potential to be misused for generating evasive, AI crafted malware. Despite this emerging threat, the research community currently lacks controlled and extensible tools that can simulate such behavior for testing and defense preparation. We present MalGEN, a multi agent framework that simulates coordinated adversarial behavior to generate diverse, activity driven malware samples. The agents work collaboratively to emulate attacker workflows, including payload planning, capability selection, and evasion strategies, within a controlled environment built for ethical and defensive research. Using MalGEN, we synthesized ten novel malware samples and evaluated them against leading antivirus and behavioral detection engines. Several samples exhibited stealthy and evasive characteristics that bypassed current defenses, validating MalGEN's ability to model sophisticated and new threats. By transforming the threat of LLM misuse into an opportunity for proactive defense, MalGEN offers a valuable framework for evaluating and strengthening cybersecurity systems. The framework addresses data scarcity, enables rigorous testing, and supports the development of resilient and future ready detection strategies.
As cloud environments become widespread, cybersecurity has emerged as a top priority across areas such as networks, communication, data privacy, response times, and availability. Various sectors, including industries, healthcare, and government, have recently faced cyberattacks targeting their computing systems. Ensuring secure app deployment in cloud environments requires substantial effort. With the growing interest in cloud security, conducting a systematic literature review (SLR) is critical to identifying research gaps. Continuous Software Engineering, which includes continuous integration (CI), delivery (CDE), and deployment (CD), is essential for software development and deployment. In our SLR, we reviewed 66 papers, summarising tools, approaches, and challenges related to the security of CI/CD in the cloud. We addressed key aspects of cloud security and CI/CD and reported on tools such as Harbor, SonarQube, and GitHub Actions. Challenges such as image manipulation, unauthorised access, and weak authentication were highlighted. The review also uncovered research gaps in how tools and practices address these security issues in CI/CD pipelines, revealing a need for further study to improve cloud-based security solutions.
Static analysis, a cornerstone technique in cybersecurity, offers a noninvasive method for detecting malware by analyzing dormant software without executing potentially harmful code. However, traditional static analysis often relies on biased or outdated datasets, leading to gaps in detection capabilities against emerging malware threats. To address this, our study focuses on the binary content of files as key features for malware detection. These binary contents are transformed and represented as images, which then serve as inputs to deep learning models. This method takes into account the visual patterns within the binary data, allowing the model to analyze potential malware effectively. This paper introduces the application of the CBiGAN in the domain of malware anomaly detection. Our approach leverages the CBiGAN for its superior latent space mapping capabilities, critical for modeling complex malware patterns by utilizing a reconstruction error-based anomaly detection method. We utilized several datasets including both portable executable (PE) files as well as Object Linking and Embedding (OLE) files. We then evaluated our model against a diverse set of both PE and OLE files, including self-collected malicious executables from 214 malware families. Our findings demonstrate the robustness of this innovative approach, with the CBiGAN achieving high Area Under the Curve (AUC) results with good generalizability, thereby confirming its capability to distinguish between benign and diverse malicious files with reasonably high accuracy.