Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
Sequence-based deep learning models (e.g., RNNs), can detect malware by analyzing its behavioral sequences. Meanwhile, these models are susceptible to adversarial attacks. Attackers can create adversarial samples that alter the sequence characteristics of behavior sequences to deceive malware classifiers. The existing methods for generating adversarial samples typically involve deleting or replacing crucial behaviors in the original data sequences, or inserting benign behaviors that may violate the behavior constraints. However, these methods that directly manipulate sequences make adversarial samples difficult to implement or apply in practice. In this paper, we propose an adversarial attack approach based on Deep Q-Network and a heuristic backtracking search strategy, which can generate perturbation sequences that satisfy practical conditions for successful attacks. Subsequently, we utilize a novel transformation approach that maps modifications back to the source code, thereby avoiding the need to directly modify the behavior log sequences. We conduct an evaluation of our approach, and the results confirm its effectiveness in generating adversarial samples from real-world malware behavior sequences, which have a high success rate in evading anomaly detection models. Furthermore, our approach is practical and can generate adversarial samples while maintaining the functionality of the modified software.
The rapid proliferation of network applications has led to a significant increase in network attacks. According to the OWASP Top 10 Projects report released in 2021, injection attacks rank among the top three vulnerabilities in software projects. This growing threat landscape has increased the complexity and workload of software testing, necessitating advanced tools to support agile development cycles. This paper introduces a novel test prioritization method for SQL injection vulnerabilities to enhance testing efficiency. By leveraging previous test outcomes, our method adjusts defense strength vectors for subsequent tests, optimizing the testing workflow and tailoring defense mechanisms to specific software needs. This approach aims to improve the effectiveness and efficiency of vulnerability detection and mitigation through a flexible framework that incorporates dynamic adjustments and considers the temporal aspects of vulnerability exposure.
The prevalence of cryptographic API misuse (CAM) is compromising the effectiveness of cryptography and in turn the security of modern systems and applications. Despite extensive efforts to develop CAM detection tools, these tools typically rely on a limited set of predefined rules from human-curated knowledge. This rigid, rule-based approach hinders adaptation to evolving CAM patterns in real practices. We propose leveraging large language models (LLMs), trained on publicly available cryptography-related data, to automatically detect and classify CAMs in real-world code to address this limitation. Our method enables the development and continuous expansion of a CAM taxonomy, supporting developers and detection tools in tracking and understanding emerging CAM patterns. Specifically, we develop an LLM-agnostic prompt engineering method to guide LLMs in detecting CAM instances from C/C++, Java, Python, and Go code, and then classifying them into a hierarchical taxonomy. Using a data set of 3,492 real-world software programs, we demonstrate the effectiveness of our approach with mainstream LLMs, including GPT, Llama, Gemini, and Claude. It also allows us to quantitatively measure and compare the performance of these LLMs in analyzing CAM in realistic code. Our evaluation produced a taxonomy with 279 base CAM categories, 36 of which are not addressed by existing taxonomies. To validate its practical value, we encode 11 newly identified CAM types into detection rules and integrate them into existing tools. Experiments show that such integration expands the tools' detection capabilities.
The success and wide adoption of generative AI (GenAI), particularly large language models (LLMs), has attracted the attention of cybercriminals seeking to abuse models, steal sensitive data, or disrupt services. Moreover, providing security to LLM-based systems is a great challenge, as both traditional threats to software applications and threats targeting LLMs and their integration must be mitigated. In this survey, we shed light on security and privacy concerns of such LLM-based systems by performing a systematic review and comprehensive categorization of threats and defensive strategies considering the entire software and LLM life cycles. We analyze real-world scenarios with distinct characteristics of LLM usage, spanning from development to operation. In addition, threats are classified according to their severity level and to which scenarios they pertain, facilitating the identification of the most relevant threats. Recommended defense strategies are systematically categorized and mapped to the corresponding life cycle phase and possible attack strategies they attenuate. This work paves the way for consumers and vendors to understand and efficiently mitigate risks during integration of LLMs in their respective solutions or organizations. It also enables the research community to benefit from the discussion of open challenges and edge cases that may hinder the secure and privacy-preserving adoption of LLM-based systems.
Software-based memory-erasure protocols are two-party communication protocols where a verifier instructs a computational device to erase its memory and send a proof of erasure. They aim at guaranteeing that low-cost IoT devices are free of malware by putting them back into a safe state without requiring secure hardware or physical manipulation of the device. Several software-based memory-erasure protocols have been introduced and theoretically analysed. Yet, many of them have not been tested for their feasibility, performance and security on real devices, which hinders their industry adoption. This article reports on the first empirical analysis of software-based memory-erasure protocols with respect to their security, erasure guarantees, and performance. The experimental setup consists of 3 modern IoT devices with different computational capabilities, 7 protocols, 6 hash-function implementations, and various performance and security criteria. Our results indicate that existing software-based memory-erasure protocols are feasible, although slow devices may take several seconds to erase their memory and generate a proof of erasure. We found that no protocol dominates across all empirical settings, defined by the computational power and memory size of the device, the network speed, and the required level of security. Interestingly, network speed and hidden constants within the protocol specification played a more prominent role in the performance of these protocols than anticipated based on the related literature. We provide an evaluation framework that, given a desired level of security, determines which protocols offer the best trade-off between performance and erasure guarantees.
Modern software development increasingly depends on open-source libraries and third-party components, which are often encapsulated into containerized environments. While improving the development and deployment of applications, this approach introduces security risks, particularly when outdated or vulnerable components are inadvertently included in production environments. Software Composition Analysis (SCA) is a critical process that helps identify and manage packages and dependencies inside a container. However, unintentional modifications to the container filesystem can lead to incomplete container images, which compromise the reliability of SCA tools. In this paper, we examine the limitations of both cloud-based and open-source SCA tools when faced with such obscure images. An analysis of 600 popular containers revealed that obscure containers exist in well-known registries and trusted images and that many tools fail to analyze such containers. To mitigate these issues, we propose an obscuration-resilient methodology for container analysis and introduce ORCA (Obscuration-Resilient Container Analyzer), its open-source implementation. We reported our findings to all vendors using their appropriate channels. Our results demonstrate that ORCA effectively detects the content of obscure containers and achieves a median 40% improvement in file coverage compared to Docker Scout and Syft.
WebAssembly (Wasm) is rapidly gaining popularity as a distribution format for software components embedded in various security-critical domains. Unfortunately, despite its prudent design, WebAssembly's primary use case as a compilation target for memory-unsafe languages leaves some possibilities for memory corruption. Independently of that, Wasm is an inherently interesting target for information flow analysis due to its interfacing role. Both the information flows between a Wasm module and its embedding context, as well as the memory integrity within a module, can be described by the hyperproperty noninterference. So far, no sound, fully static noninterference analysis for Wasm has been presented, but sound reachability analyses were. This work presents a novel and general approach to lift reachability analyses to noninterference by tracking taints on values and using value-sensitive, relational reasoning to remove them when appropriate. We implement this approach in Wanilla, the first automatic, sound, and fully static noninterference analysis for WebAssembly, and demonstrate its performance and precision by verifying memory integrity and other noninterference properties with several synthetic and real-world benchmarks.
Authors of cryptographic software are well aware that their code should not leak secrets through its timing behavior, and, until 2018, they believed that following industry-standard constant-time coding guidelines was sufficient. However, the revelation of the Spectre family of speculative execution attacks injected new complexities. To block speculative attacks, prior work has proposed annotating the program's source code to mark secret data, with hardware using this information to decide when to speculate (i.e., when only public values are involved) or not (when secrets are in play). While these solutions are able to track secret information stored on the heap, they suffer from limitations that prevent them from correctly tracking secrets on the stack, at a cost in performance. This paper introduces SecSep, a transformation framework that rewrites assembly programs so that they partition secret and public data on the stack. By moving from the source-code level to assembly rewriting, SecSep is able to address limitations of prior work. The key challenge in performing this assembly rewriting stems from the loss of semantic information through the lengthy compilation process. The key innovation of our methodology is a new variant of typed assembly language (TAL), Octal, which allows us to address this challenge. Assembly rewriting is driven by compile-time inference within Octal. We apply our technique to cryptographic programs and demonstrate that it enables secure speculation efficiently, incurring a low average overhead of $1.2\%$.
Symbolic execution is a powerful technique for software testing, but suffers from limitations when encountering external functions, such as native methods or third-party libraries. Existing solutions often require additional context, expensive SMT solvers, or manual intervention to approximate these functions through symbolic stubs. In this work, we propose a novel approach to automatically generate symbolic stubs for external functions during symbolic execution that leverages Genetic Programming. When the symbolic executor encounters an external function, AutoStub generates training data by executing the function on randomly generated inputs and collecting the outputs. Genetic Programming then derives expressions that approximate the behavior of the function, serving as symbolic stubs. These automatically generated stubs allow the symbolic executor to continue the analysis without manual intervention, enabling the exploration of program paths that were previously intractable. We demonstrate that AutoStub can automatically approximate external functions with over 90% accuracy for 55% of the functions evaluated, and can infer language-specific behaviors that reveal edge cases crucial for software testing.
Software ecosystems like Maven Central play a crucial role in modern software supply chains by providing repositories for libraries and build plugins. However, the separation between binaries and their corresponding source code in Maven Central presents a significant challenge, particularly when it comes to linking binaries back to their original build environment. This lack of transparency poses security risks, as approximately 84% of the top 1200 commonly used artifacts are not built using a transparent CI/CD pipeline. Consequently, users must place a significant amount of trust not only in the source code but also in the environment in which these artifacts are built. Rebuilding software artifacts from source provides a robust solution to improve supply chain security. This approach allows for a deeper review of code, verification of binary-source equivalence, and control over dependencies. However, challenges arise due to variations in build environments, such as JDK versions and build commands, which can lead to build failures. Additionally, ensuring that all dependencies are rebuilt from source across large and complex dependency graphs further complicates the process. In this paper, we introduce an extension to Macaron, an industry-grade open-source supply chain security framework, to automate the rebuilding of Maven artifacts from source. Our approach improves upon existing tools, by offering better performance in source code detection and automating the extraction of build specifications from GitHub Actions workflows. We also present a comprehensive root cause analysis of build failures in Java projects and propose a scalable solution to automate the rebuilding of artifacts, ultimately enhancing security and transparency in the open-source supply chain.
Software supply chain attacks have increased exponentially since 2020. The primary attack vectors for supply chain attacks are through: (1) software components; (2) the build infrastructure; and (3) humans (a.k.a software practitioners). Software supply chain risk management frameworks provide a list of tasks that an organization can adopt to reduce software supply chain risk. Exhaustively adopting all the tasks of these frameworks is infeasible, necessitating the prioritized adoption of tasks. Software organizations can benefit from being guided in this prioritization by learning what tasks other teams have adopted. The goal of this study is to aid software development organizations in understanding the adoption of security tasks that reduce software supply chain risk through an interview study of software practitioners engaged in software supply chain risk management efforts. An interview study was conducted with 61 practitioners at nine software development organizations that have focused efforts on reducing software supply chain risk. The results of the interviews indicate that organizations had implemented the most adopted software tasks before the focus on software supply chain security. Therefore, their implementation in organizations is more mature. The tasks that mitigate the novel attack vectors through software components and the build infrastructure are in the early stages of adoption. Adoption of these tasks should be prioritized.
Recent advances in Large Language Models (LLMs) have driven interest in automating cybersecurity penetration testing workflows, offering the promise of faster and more consistent vulnerability assessment for enterprise systems. Existing LLM agents for penetration testing primarily rely on self-guided reasoning, which can produce inaccurate or hallucinated procedural steps. As a result, the LLM agent may undertake unproductive actions, such as exploiting unused software libraries or generating cyclical responses that repeat prior tactics. In this work, we propose a guided reasoning pipeline for penetration testing LLM agents that incorporates a deterministic task tree built from the MITRE ATT&CK Matrix, a proven penetration testing kll chain, to constrain the LLM's reaoning process to explicitly defined tactics, techniques, and procedures. This anchors reasoning in proven penetration testing methodologies and filters out ineffective actions by guiding the agent towards more productive attack procedures. To evaluate our approach, we built an automated penetration testing LLM agent using three LLMs (Llama-3-8B, Gemini-1.5, and GPT-4) and applied it to navigate 10 HackTheBox cybersecurity exercises with 103 discrete subtasks representing real-world cyberattack scenarios. Our proposed reasoning pipeline guided the LLM agent through 71.8\%, 72.8\%, and 78.6\% of subtasks using Llama-3-8B, Gemini-1.5, and GPT-4, respectively. Comparatively, the state-of-the-art LLM penetration testing tool using self-guided reasoning completed only 13.5\%, 16.5\%, and 75.7\% of subtasks and required 86.2\%, 118.7\%, and 205.9\% more model queries. This suggests that incorporating a deterministic task tree into LLM reasoning pipelines can enhance the accuracy and efficiency of automated cybersecurity assessments
We use browsers daily to access all sorts of information. Because browsers routinely process scripts, media, and executable code from unknown sources, they form a critical security boundary between users and adversaries. A common attack vector is JavaScript, which exposes a large attack surface due to the sheer complexity of modern JavaScript engines. To mitigate these threats, modern engines increasingly adopt software-based fault isolation (SFI). A prominent example is Google's V8 heap sandbox, which represents the most widely deployed SFI mechanism, protecting billions of users across all Chromium-based browsers and countless applications built on Node$.$js and Electron. The heap sandbox splits the address space into two parts: one part containing trusted, security-sensitive metadata, and a sandboxed heap containing memory accessible to untrusted code. On a technical level, the sandbox enforces isolation by removing raw pointers and using translation tables to resolve references to trusted objects. Consequently, an attacker cannot corrupt trusted data even with full control of the sandboxed data, unless there is a bug in how code handles data from the sandboxed heap. Despite their widespread use, such SFI mechanisms have seen little security testing. In this work, we propose a new testing technique that models the security boundary of modern SFI implementations. Following the SFI threat model, we assume a powerful attacker who fully controls the sandbox's memory. We implement this by instrumenting memory loads originating in the trusted domain and accessing untrusted, attacker-controlled sandbox memory. We then inject faults into the loaded data, aiming to trigger memory corruption in the trusted domain. In a comprehensive evaluation, we identify 19 security bugs in V8 that enable an attacker to bypass the sandbox.
Software vulnerabilities pose serious risks to modern software ecosystems. While the National Vulnerability Database (NVD) is the authoritative source for cataloging these vulnerabilities, it often lacks explicit links to the corresponding Vulnerability-Fixing Commits (VFCs). VFCs encode precise code changes, enabling vulnerability localization, patch analysis, and dataset construction. Automatically mapping NVD records to their true VFCs is therefore critical. Existing approaches have limitations as they rely on sparse, often noisy commit messages and fail to capture the deep semantics in the vulnerability descriptions. To address this gap, we introduce PatchSeeker, a novel method that leverages large language models to create rich semantic links between vulnerability descriptions and their VFCs. PatchSeeker generates embeddings from NVD descriptions and enhances commit messages by synthesizing detailed summaries for those that are short or uninformative. These generated messages act as a semantic bridge, effectively closing the information gap between natural language reports and low-level code changes. Our approach PatchSeeker achieves 59.3% higher MRR and 27.9% higher Recall@10 than the best-performing baseline, Prospector, on the benchmark dataset. The extended evaluation on recent CVEs further confirms PatchSeeker's effectiveness. Ablation study shows that both the commit message generation method and the selection of backbone LLMs make a positive contribution to PatchSeeker. We also discuss limitations and open challenges to guide future work.
As on-device LLMs(e.g., Apple on-device Intelligence) are widely adopted to reduce network dependency, improve privacy, and enhance responsiveness, verifying the legitimacy of models running on local devices becomes critical. Existing attestation techniques are not suitable for billion-parameter Large Language Models (LLMs), struggling to remain both time- and memory-efficient while addressing emerging threats in the LLM era. In this paper, we present AttestLLM, the first-of-its-kind attestation framework to protect the hardware-level intellectual property (IP) of device vendors by ensuring that only authorized LLMs can execute on target platforms. AttestLLM leverages an algorithm/software/hardware co-design approach to embed robust watermarking signatures onto the activation distributions of LLM building blocks. It also optimizes the attestation protocol within the Trusted Execution Environment (TEE), providing efficient verification without compromising inference throughput. Extensive proof-of-concept evaluations on LLMs from Llama, Qwen, and Phi families for on-device use cases demonstrate AttestLLM's attestation reliability, fidelity, and efficiency. Furthermore, AttestLLM enforces model legitimacy and exhibits resilience against model replacement and forgery attacks.
Security patch detection (SPD) is crucial for maintaining software security, as unpatched vulnerabilities can lead to severe security risks. In recent years, numerous learning-based SPD approaches have demonstrated promising results on source code. However, these approaches typically cannot be applied to closed-source applications and proprietary systems that constitute a significant portion of real-world software, as they release patches only with binary files, and the source code is inaccessible. Given the impressive performance of code large language models (LLMs) in code intelligence and binary analysis tasks such as decompilation and compilation optimization, their potential for detecting binary security patches remains unexplored, exposing a significant research gap between their demonstrated low-level code understanding capabilities and this critical security task. To address this gap, we construct a large-scale binary patch dataset containing \textbf{19,448} samples, with two levels of representation: assembly code and pseudo-code, and systematically evaluate \textbf{19} code LLMs of varying scales to investigate their capability in binary SPD tasks. Our initial exploration demonstrates that directly prompting vanilla code LLMs struggles to accurately identify security patches from binary patches, and even state-of-the-art prompting techniques fail to mitigate the lack of domain knowledge in binary SPD within vanilla models. Drawing on the initial findings, we further investigate the fine-tuning strategy for injecting binary SPD domain knowledge into code LLMs through two levels of representation. Experimental results demonstrate that fine-tuned LLMs achieve outstanding performance, with the best results obtained on the pseudo-code representation.
In response to the prevalent concern of TCP SYN flood attacks within the context of Software-Defined Internet of Vehicles (SD-IoV), this study addresses the significant challenge of network security in rapidly evolving vehicular communication systems. This research focuses on optimizing a Random Forest Classifier model to achieve maximum accuracy and minimal detection time, thereby enhancing vehicular network security. The methodology involves preprocessing a dataset containing SYN attack instances, employing feature scaling and label encoding techniques, and applying Stratified K-Fold cross-validation to target key metrics such as accuracy, precision, recall, and F1-score. This research achieved an average value of 0.999998 for all metrics with a SYN DoS attack detection time of 0.24 seconds. Results show that the fine-tuned Random Forest model, configured with 20 estimators and a depth of 10, effectively differentiates between normal and malicious traffic with high accuracy and minimal detection time, which is crucial for SD-IoV networks. This approach marks a significant advancement and introduces a state-of-the-art algorithm in detecting SYN flood attacks, combining high accuracy with minimal detection time. It contributes to vehicular network security by providing a robust solution against TCP SYN flood attacks while maintaining network efficiency and reliability.
LLM-based agentic systems leverage large language models to handle user queries, make decisions, and execute external tools for complex tasks across domains like chatbots, customer service, and software engineering. A critical component of these systems is the Tool Invocation Prompt (TIP), which defines tool interaction protocols and guides LLMs to ensure the security and correctness of tool usage. Despite its importance, TIP security has been largely overlooked. This work investigates TIP-related security risks, revealing that major LLM-based systems like Cursor, Claude Code, and others are vulnerable to attacks such as remote code execution (RCE) and denial of service (DoS). Through a systematic TIP exploitation workflow (TEW), we demonstrate external tool behavior hijacking via manipulated tool invocations. We also propose defense mechanisms to enhance TIP security in LLM-based agentic systems.