Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
Formal specifications have numerous benefits for both designers and users of network protocols. They provide clear, unambiguous representations, which are useful as documentation and for testing. They can help reveal disagreements about what a protocol "is" and identify areas where further work is needed to resolve ambiguities or internal inconsistencies. They also provide a foundation for formal reasoning, making it possible to establish important security and correctness guarantees on all inputs and in every environment. Despite these advantages, formal methods are not widely used to design, implement, and validate network protocols today. Instead, Internet protocols are usually described in informal documents, such as IETF Requests for Comments (RFCs) or IEEE standards. These documents primarily consist of lengthy prose descriptions, accompanied by pseudocode, header descriptions, state machine diagrams, and reference implementations which are used for interoperability testing. So, while RFCs and reference implementations were only intended to help guide the social process used by protocol designers, they have evolved into the closest things to formal specifications the Internet community has. In this paper, we discuss the different roles that specifications play in the networking and formal methods communities. We then illustrate the potential benefits of specifying protocols formally, presenting highlights from several recent success stories. Finally, we identify key differences between how formal specifications are understood by the two communities and suggest possible strategies to bridge the gaps.
We find, motivated by real-world applications, that the well-known request-response specification comes with multiple variations, and that these variations should be distinguished. As the first main contribution, we introduce a classification of those variations into six types, and present it as a decision tree, where a user is led to the type that is suited for their application by answering a couple of questions. Our second main contribution is the formalization of those six types in various formalisms such as temporal logics, grammars, and automata; here, two types out of the six are non-regular specifications and their formalization requires extended formalisms. We also survey tools for monitoring these specifications to cater for practitioners' needs.
We prove that $S(5) = 47,176,870$ using the Coq proof assistant. The Busy Beaver value $S(n)$ is the maximum number of steps that an $n$-state 2-symbol Turing machine can perform from the all-zero tape before halting, and $S$ was historically introduced by Tibor Rad\'o in 1962 as one of the simplest examples of an uncomputable function. The proof enumerates $181,385,789$ Turing machines with 5 states and, for each machine, decides whether it halts or not. Our result marks the first determination of a new Busy Beaver value in over 40 years and the first Busy Beaver value ever to be formally verified, attesting to the effectiveness of massively collaborative online research (bbchallenge$.$org).
A directed acyclic graph (DAG) can represent a two-dimensional string or picture. We propose recognizing picture languages using DAG automata by encoding 2D inputs into DAGs. An encoding can be input-agnostic (based on input size only) or input-driven (depending on symbols). Three distinct input-agnostic encodings characterize classes of picture languages accepted by returning finite automata, boustrophedon automata, and online tessellation automata. Encoding a string as a simple directed path limits recognition to regular languages. However, input-driven encodings allow DAG automata to recognize some context-sensitive string languages and outperform online tessellation automata in two dimensions.
This paper presents boldsea, Boldachev's semantic-event approach -- an architecture for modeling complex dynamic systems using executable ontologies -- semantic models that act as dynamic structures, directly controlling process execution. We demonstrate that integrating event semantics with a dataflow architecture addresses the limitations of traditional Business Process Management (BPM) systems and object-oriented semantic technologies. The paper presents the formal BSL (boldsea Semantic Language), including its BNF grammar, and outlines the boldsea-engine's architecture, which directly interprets semantic models as executable algorithms without compilation. It enables the modification of event models at runtime, ensures temporal transparency, and seamlessly merges data and business logic within a unified semantic framework.
A binary word is Sturmian if the occurrences of each letter are balanced, in the sense that in any two factors of the same length, the difference between the number of occurrences of the same letter is at most 1. In digital geometry, Sturmian words correspond to discrete approximations of straight line segments in the Euclidean plane. The Dorst-Smeulders coding, introduced in 1984, is a 4-tuple of integers that uniquely represents a Sturmian word $w$, enabling its reconstruction using $|w|$ modular operations, making it highly efficient in practice. In this paper, we present a linear-time algorithm that, given a binary input word $w$, computes the Dorst-Smeulders coding of its longest Sturmian prefix. This forms the basis for computing the Dorst-Smeulders coding of an arbitrary binary word $w$, which is a minimal decomposition (in terms of the number of factors) of $w$ into Sturmian words, each represented by its Dorst-Smeulders coding. This coding could be leveraged in compression schemes where the input is transformed into a binary word composed of long Sturmian segments. Although the algorithm is conceptually simple and can be implemented in just a few lines of code, it is grounded in a deep analysis of the structural properties of Sturmian words.
The article focuses on word (or string) attractors, which are sets of positions related to the text compression efficiency of the underlying word. The article presents two combinatorial algorithms based on Suffix automata or Directed Acyclic Word Graphs. The first algorithm decides in linear time whether a set of positions on the word is an attractor of the word. The second algorithm generates an attractor for a given word in a greedy manner. Although this problem is NP-hard, the algorithm is efficient and produces very small attractors for several well-known families of words.
We investigate the verification power of rational-valued affine automata within Arthur--Merlin proof systems. For one-way verifiers, we give real-time protocols with perfect completeness and tunable bounded error for two benchmark nonregular languages, the balanced-middle language and the centered-palindrome language, illustrating a concrete advantage over probabilistic and quantum finite-state verifiers. For two-way verifiers, we first design a weak protocol that verifies every Turing-recognizable language by streaming and checking a configuration history. We then strengthen it with a probabilistic continuation check that bounds the prover's transcript length and ensures halting with high probability, yielding strong verification with expected running time proportional to the product of the simulated machine's space and time (up to input length and a factor polynomial in the inverse error parameter). Combining these constructions with standard alternation--space correspondences, we place alternating exponential time, equivalently deterministic exponential space, inside affine Arthur--Merlin with two-way affine automata. We also present a reduction-based route with perfect completeness via a Knapsack-game verifier, which, together with linear-space reductions, yields that the class PSPACE admits affine Arthur--Merlin verification by two-way affine automata. Two simple primitives drive our protocols: a probabilistic continuation check to control expected time and a restart-on-accept affine register that converts exact algebraic checks into eventually halting bounded-error procedures.
Timed regular expressions serve as a formalism for specifying real-time behaviors of Cyber-Physical Systems. In this paper, we consider the synthesis of timed regular expressions, focusing on generating a timed regular expression consistent with a given set of system behaviors including positive and negative examples, i.e., accepting all positive examples and rejecting all negative examples. We first prove the decidability of the synthesis problem through an exploration of simple timed regular expressions. Subsequently, we propose our method of generating a consistent timed regular expression with minimal length, which unfolds in two steps. The first step is to enumerate and prune candidate parametric timed regular expressions. In the second step, we encode the requirement that a candidate generated by the first step is consistent with the given set into a Satisfiability Modulo Theories (SMT) formula, which is consequently solved to determine a solution to parametric time constraints. Finally, we evaluate our approach on benchmarks, including randomly generated behaviors from target timed models and a case study.
We propose One-counter Positive Negative Inference (OPNI), a passive learning algorithm for deterministic real-time one-counter automata (DROCA). Inspired by the RPNI algorithm for regular languages, OPNI constructs a DROCA consistent with any given valid sample set. We further present a method for combining OPNI with active learning of DROCA, and provide an implementation of the approach. Our experimental results demonstrate that this approach scales more effectively than existing state-of-the-art algorithms. We also evaluate the performance of the proposed approach for learning visibly one-counter automata.
In the game-theoretic approach to controller synthesis, we model the interaction between a system to be controlled and its environment as a game between these entities, and we seek an appropriate (e.g., winning or optimal) strategy for the system. This strategy then serves as a formal blueprint for a real-world controller. A common belief is that simple (e.g., using limited memory) strategies are better: corresponding controllers are easier to conceive and understand, and cheaper to produce and maintain. This invited contribution focuses on the complexity of strategies in a variety of synthesis contexts. We discuss recent results concerning memory and randomness, and take a brief look at what lies beyond our traditional notions of complexity for strategies.
To make sense of the world around us, we develop models, constructed to enable us to replicate, describe, and explain the behaviours we see. Focusing on the broad case of sequences of correlated random variables, i.e., classical stochastic processes, we tackle the question of determining whether or not two different models produce the same observable behavior. This is the problem of identifiability. Curiously, the physics of the model need not correspond to the physics of the observations; recent work has shown that it is even advantageous -- in terms of memory and thermal efficiency -- to employ quantum models to generate classical stochastic processes. We resolve the identifiability problem in this regime, providing a means to compare any two models of a classical process, be the models classical, quantum, or `post-quantum', by mapping them to a canonical `generalized' hidden Markov model. Further, this enables us to place (sometimes tight) bounds on the minimal dimension required of a quantum model to generate a given classical stochastic process.
The store language of an automaton is the set of store configurations (state and store contents, but not the input) that can appear as an intermediate step in an accepting computation. A one-way nondeterministic finite-visit Turing machine (fvNTM) is a Turing machine with a one-way read-only input tape, and a single worktape, where there is some number $k$ such that in every accepting computation, each worktape cell is visited at most $k$ times. We show that the store language of every fvNTM is a regular language. Furthermore, we show that the store language of every fvNTM augmented by reversal-bounded counters can be accepted by a machine with only reversal-bounded counters and no worktape. Several applications are given to problems in the areas of verification and fault tolerance, and to the study of right quotients. We also continue the investigation of the store languages of one-way and two-way machine models where we present some conditions under which their store languages are recursive or non-recursive.
B\"uchi automata (BAs) recognize $\omega$-regular languages defined by formal specifications like linear temporal logic (LTL) and are commonly used in the verification of reactive systems. However, BAs face scalability challenges when handling and manipulating complex system behaviors. As neural networks are increasingly used to address these scalability challenges in areas like model checking, investigating their ability to generalize beyond training data becomes necessary. This work presents the first study investigating whether recurrent neural networks (RNNs) can generalize to $\omega$-regular languages derived from LTL formulas. We train RNNs on ultimately periodic $\omega$-word sequences to replicate target BA behavior and evaluate how well they generalize to out-of-distribution sequences. Through experiments on LTL formulas corresponding to deterministic automata of varying structural complexity, from 3 to over 100 states, we show that RNNs achieve high accuracy on their target $\omega$-regular languages when evaluated on sequences up to $8 \times$ longer than training examples, with $92.6\%$ of tasks achieving perfect or near-perfect generalization. These results establish the feasibility of neural approaches for learning complex $\omega$-regular languages, suggesting their potential as components in neurosymbolic verification methods.
Computing the probability of reaching a set of goal states G in a discrete-time Markov chain (DTMC) is a core task of probabilistic model checking. We can do so by directly computing the probability mass of the set of all finite paths from the initial state to G; however, when refining counterexamples, it is also interesting to compute the probability mass of subsets of paths. This can be achieved by splitting the computation into path abstractions that calculate "local" reachability probabilities as shown by \'Abrah\'am et al. in 2010. In this paper, we complete and extend their work: We prove that splitting the computation into path abstractions indeed yields the same result as the direct approach, and that the splitting does not need to follow the SCC structure. In particular, we prove that path abstraction can be performed along any finite sequence of sets of non-goal states. Our proofs proceed in a novel way by interpreting the DTMC as a structure on the free monoid on its state space, which makes them clean and concise. Additionally, we provide a compact reference implementation of path abstraction in PARI/GP.
We present a new criterion for proving that a language is not multiple context-free, which we call a Substitution Lemma. We apply it to show a sample selection of languages are not multiple context-free, including the word problem of $F_2\times F_2$. Our result is in contrast to Kanazawa et al. [2014, Theory Comput. Syst.] who proved that it was not possible to generalise the standard pumping lemma for context-free languages to multiple context-free languages, and Kanazawa [2019, Inform. and Comput.] who showed a weak variant of generalised Ogden's lemma does not apply to multiple context-free languages. We also show that groups with multiple context-free word problem have rational subset membership and intersection problems.
We consider a multi-player non-zero-sum turn-based game (abbreviated as multi-player game) on a finite directed graph. A secure equilibrium (SE) is a strategy profile in which no player has the incentive to deviate from the strategy because no player can increase her own payoff or lower the payoff of another player. SE is a promising refinement of Nash equilibrium in which a player does not care the payoff of another player. In this paper, we discuss the decidability and complexity of the problem of deciding whether a secure equilibrium with constraints (a payoff profile specifying which players must win) exists for a given multi-player game.
The theory of sequences, supported by many SMT solvers, can model program data types including bounded arrays and lists. Sequences are parameterized by the element data type and provide operations such as accessing elements, concatenation, forming sub-sequences and updating elements. Strings and sequences are intimately related; many operations, e.g., matching a string according to a regular expression, splitting strings, or joining strings in a sequence, are frequently used in string-manipulating programs. Nevertheless, these operations are typically not directly supported by existing SMT solvers, which instead only consider the generic theory of sequences. In this paper, we propose a theory of string sequences and study its satisfiability. We show that, while it is undecidable in general, the decidability can be recovered by restricting to the straight-line fragment. This is shown by encoding each string sequence as a string, and each string sequence operation as a corresponding string operation. We provide pre-image computation for the resulting string operations with respect to automata, effectively casting it into the generic OSTRICH string constraint solving framework. We implement the new decision procedure as a tool $\ostrichseq$, and carry out experiments on benchmark constraints generated from real-world JavaScript programs, hand-crafted templates and unit tests. The experiments confirm the efficacy of our approach.