Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
Learning to solve complex tasks with signal temporal logic (STL) specifications is crucial to many real-world applications. However, most previous works only consider fixed or parametrized STL specifications due to the lack of a diverse STL dataset and encoders to effectively extract temporal logic information for downstream tasks. In this paper, we propose TeLoGraF, Temporal Logic Graph-encoded Flow, which utilizes Graph Neural Networks (GNN) encoder and flow-matching to learn solutions for general STL specifications. We identify four commonly used STL templates and collect a total of 200K specifications with paired demonstrations. We conduct extensive experiments in five simulation environments ranging from simple dynamical models in the 2D space to high-dimensional 7DoF Franka Panda robot arm and Ant quadruped navigation. Results show that our method outperforms other baselines in the STL satisfaction rate. Compared to classical STL planning algorithms, our approach is 10-100X faster in inference and can work on any system dynamics. Besides, we show our graph-encoding method's capability to solve complex STLs and robustness to out-distribution STL specifications. Code is available at https://github.com/mengyuest/TeLoGraF
Neural network-based policies have demonstrated success in many robotic applications, but often lack human-explanability, which poses challenges in safety-critical deployments. To address this, we propose a neuro-symbolic explanation framework that generates a weighted signal temporal logic (wSTL) specification to describe a robot policy in a interpretable form. Existing methods typically produce explanations that are verbose and inconsistent, which hinders explainability, and loose, which do not give meaningful insights into the underlying policy. We address these issues by introducing a simplification process consisting of predicate filtering, regularization, and iterative pruning. We also introduce three novel explainability evaluation metrics -- conciseness, consistency, and strictness -- to assess explanation quality beyond conventional classification metrics. Our method is validated in three simulated robotic environments, where it outperforms baselines in generating concise, consistent, and strict wSTL explanations without sacrificing classification accuracy. This work bridges policy learning with formal methods, contributing to safer and more transparent decision-making in robotics.
Given a string $w$, another string $v$ is said to be a subsequence of $w$ if $v$ can be obtained from $w$ by removing some of its letters; on the other hand, $v$ is called an absent subsequence of $w$ if $v$ is not a subsequence of $w$. The existing literature on absent subsequences focused on understanding, for a string $w$, the set of its shortest absent subsequences (i.e., the shortest strings which are absent subsequences of $w$) and that of its minimal absent subsequences (i.e., those strings which are absent subsequences of $w$ but whose every proper subsequence occurs in $w$). Our contributions to this area of research are the following. Firstly, we present optimal algorithms (with linear time preprocessing and output-linear delay) for the enumeration of the shortest and, respectively, minimal absent subsequences. Secondly, we present optimal algorithms for the incremental enumeration of these strings with linear time preprocessing and constant delay; in this setting, we only output short edit-scripts showing how the currently enumerated string differs from the previous one. Finally, we provide an efficient algorithm for identifying a longest minimal absent subsequence of a string. All our algorithms improve the state-of-the-art results for the aforementioned problems.
We give a new proof of a result from well quasi-order theory on the computability of bases for upwards-closed sets of words. This new proof is based on Angluin's $L^*$ algorithm, that learns an automaton from a minimally adequate teacher. This relates in particular two results from the 1980s: Angluin's $L^*$ algorithm, and a result from Valk and Jantzen on the computability of bases for upwards-closed sets of tuples of integers. Along the way, we describe an algorithm for learning quasi-ordered automata from a minimally adequate teacher, and extend a generalization of Valk and Jantzen's result, encompassing both words and integers, to finitely generated monoids.
Stochastic process discovery is concerned with deriving a model capable of reproducing the stochastic character of observed executions of a given process, stored in a log. This leads to an optimisation problem in which the model's parameter space is searched for, driven by the resemblance between the log's and the model's stochastic languages. The bottleneck of such optimisation problem lay in the determination of the model's stochastic language which existing approaches deal with through, hardly scalable, exact computation approaches. In this paper we introduce a novel framework in which we combine a simulation-based Bayesian parameter inference scheme, used to search for the ``optimal'' instance of a stochastic model, with an expressive statistical model checking engine, used (during inference) to approximate the language of the considered model's instance. Because of its simulation-based nature, the payoff is that, the runtime for discovering of the optimal instance of a model can be easily traded in for accuracy, hence allowing to treat large models which would result in a prohibitive runtime with non-simulation based alternatives. We validate our approach on several popular event logs concerning real-life systems.
It is proved that every regular expression of alphabetic width $n$, that is, with $n$ occurrences of symbols of the alphabet, can be transformed into a deterministic finite automaton (DFA) with $2^{\frac{n}{2}+(\frac{\log_2 e}{2\sqrt{2}}+o(1))\sqrt{n\ln n}}$ states recognizing the same language (the best upper bound up to date is $2^n$). At the same time, it is also shown that this bound is close to optimal, namely, that there exist regular expressions of alphabetic width $n$ over a two-symbol alphabet, such that every DFA for the same language has at least $2^{\frac{n}{2}+(\sqrt{2} + o(1))\sqrt{\frac{n}{\ln n}}}$ states (the previously known lower bound is $\frac{5}{4}2^{\frac{n}{2}}$). The same bounds are obtained for an intermediate problem of determinizing nondetermistic finite automata (NFA) with each state having all incoming transitions by the same symbol.
A central question in the theory of automata is which classes of automata can be minimized in polynomial time. We close the remaining gaps for deterministic and history-deterministic automata over infinite words by proving that deterministic co-B\"uchi automata with transition-based acceptance are NP-hard to minimize, as are history-deterministic B\"uchi automata with transition-based acceptance.
We introduce a novel framework for simulating finite automata using representation-theoretic semidirect products and Fourier modules, achieving more efficient Transformer-based implementations.
We propose Weighted Guarded Kleene Algebra with Tests (wGKAT), an uninterpreted weighted programming language equipped with branching, conditionals, and loops. We provide an operational semantics for wGKAT using a variant of weighted automata and introduce a sound and complete axiomatization. We also provide a polynomial time decision procedure for bisimulation equivalence.
Message sequence charts (MSCs) visually represent interactions in distributed systems that communicate through FIFO channels. High-level MSCs (HMSCs) extend MSCs with choice, concatenation, and iteration, allowing for the specification of complex behaviors. This paper revisits two classical problems for HMSCs: satisfiability and realizability. Satisfiability (also known as reachability or nonemptiness) asks whether there exists a path in the HMSC that gives rise to a valid behavior. Realizability concerns translating HMSCs into communicating finite-state machines to ensure correct system implementations. While most positive results assume bounded channels, we introduce a class of HMSCs that allows for unbounded channels while maintaining effective implementations. On the other hand, we show that the corresponding satisfiability problem is still undecidable.
The notion of Wheeler languages is rooted in the Burrows-Wheeler transform (BWT), one of the most central concepts in data compression and indexing. The BWT has been generalized to finite automata, the so-called Wheeler automata, by Gagie et al. [Theor. Comput. Sci. 2017]. Wheeler languages have subsequently been defined as the class of regular languages for which there exists a Wheeler automaton accepting them. Besides their advantages in data indexing, these Wheelerlanguages also satisfy many interesting properties from a language theoretic point of view [Alanko et al., Inf. Comput. 2021]. A characteristic yet unsatisfying feature of Wheeler languages however is that their definition depends on a fixed order of the alphabet. In this paper we introduce the Universally Wheeler languages UW, i.e., the regular languages that are Wheeler with respect to all orders of a given alphabet. Our first main contribution is to relate UW to some very well known regular language classes. We first show that the Striclty Locally Testable languages are strictly included in UW. After noticing that UW is not closed under taking the complement, we prove that the class of languages for which both the language and its complement are in UW exactly coincides with those languages that are Definite or Reverse Definite. Secondly, we prove that deciding if a regular language given by a DFA is in UW can be done in quadratic time. We also show that this is optimal unless the Strong Exponential Time Hypothesis (SETH) fails.
A word over an ordered alphabet is said to be clustering if identical letters appear adjacently in its Burrows-Wheeler transform. Such words are strictly related to (discrete) interval exchange transformations. We use an extended version of the well-known Rauzy induction to show that every return word in the language generated by a regular interval exchange transformation is clustering, partially answering a question of Lapointe (2021).
Regular expression matching is of practical importance due to its widespread use in real-world applications. In practical use, regular expressions are often used with real-world extensions. Accordingly, the matching problem of regular expressions with real-world extensions has been actively studied in recent years, yielding steady progress. However, backreference, a popular extension supported by most modern programming languages such as Java, Python, JavaScript and others in their standard libraries for string processing, is an exception to this positive trend. In fact, it is known that the matching problem of regular expressions with backreferences (rewbs) is theoretically hard and the existence of an asymptotically fast matching algorithm for arbitrary rewbs seems unlikely. Even among currently known partial solutions, the balance between efficiency and generality remains unsatisfactory. To bridge this gap, we present an efficient matching algorithm for rewbs of the form $e_0 (e)_1 e_1 \backslash 1 e_2$ where $e_0, e, e_1, e_2$ are pure regular expressions, which are fundamental and frequently used in practical applications. It runs in quadratic time with respect to the input string length, substantially improving the best-known cubic time complexity for these rewbs. Our algorithm combines ideas from both stringology and automata theory in a novel way. We leverage two techniques from automata theory, injection and summarization, to simultaneously examine matches whose backreferenced substrings are either a fixed right-maximal repeat or its extendable prefixes, which are concepts from stringology. By further utilizing a subtle property of extendable prefixes, our algorithm correctly decides the matching problem while achieving the quadratic-time complexity.
We study the dynamic membership problem for regular tree languages under relabeling updates: we fix an alphabet ${\Sigma}$ and a regular tree language $L$ over ${\Sigma}$ (expressed, e.g., as a tree automaton), we are given a tree $T$ with labels in ${\Sigma}$, and we must maintain the information of whether the tree $T$ belongs to $L$ while handling relabeling updates that change the labels of individual nodes in $T$. (The shape and size of the tree remain the same throughout.) Our first contribution is to show that this problem admits an $O(\log n / \log \log n)$ algorithm for any fixed regular tree language, improving over known algorithms that achieve $O(\log n)$. This generalizes the known $O(\log n / \log \log n)$ upper bound over words, and it matches the lower bound of ${\Omega}(\log n / \log \log n)$ from dynamic membership to some word languages and from the existential marked ancestor problem. Our second contribution is to introduce a class of regular languages, dubbed almost-commutative tree languages, and show that dynamic membership to such languages under relabeling updates can be done in constant time per update. Almost-commutative languages generalize both commutative languages and finite languages, and they are the analogue for trees of the ZG languages enjoying constant-time dynamic membership over words. Our main technical contribution is to show that this class is conditionally optimal when we assume that the alphabet features a neutral letter, i.e., a letter that has no effect on membership to the language. More precisely, we show that any regular tree language with a neutral letter which is not almost-commutative cannot be maintained in constant time under the assumption that prefix-U1 problem from (Amarilli, Jachiet, Paperman, ICALP'21) also does not admit a constant-time algorithm.
We study how the application of injective morphisms affects the number $r$ of equal-letter runs in the Burrows-Wheeler Transform (BWT). This parameter has emerged as a key repetitiveness measure in compressed indexing. We focus on the notion of BWT-run sensitivity after application of an injective morphism. For binary alphabets, we characterize the class of morphisms that preserve the number of BWT-runs up to a bounded additive increase, by showing that it coincides with the known class of primitivity-preserving morphisms, which are those that map primitive words to primitive words. We further prove that deciding whether a given binary morphism has bounded BWT-run sensitivity is possible in polynomial time with respect to the total length of the images of the two letters. Additionally, we explore new structural and combinatorial properties of synchronizing and recognizable morphisms. These results establish new connections between BWT-based compressibility, code theory, and symbolic dynamics.
Finite (word) state transducers extend finite state automata by defining a binary relation over finite words, called rational relation. If the rational relation is the graph of a function, this function is said to be rational. The class of sequential functions is a strict subclass of rational functions, defined as the functions recognised by input-deterministic finite state transducers. The class membership problems between those classes are known to be decidable. We consider approximate versions of these problems and show they are decidable as well. This includes the approximate functionality problem, which asks whether given a rational relation (by a transducer), is it close to a rational function, and the approximate determinisation problem, which asks whether a given rational function is close to a sequential function. We prove decidability results for several classical distances, including Hamming and Levenshtein edit distance. Finally, we investigate the approximate uniformisation problem, which asks, given a rational relation $R$, whether there exists a sequential function that is close to some function uniformising $R$. As for its exact version, we prove that this problem is undecidable.
Formally verifying properties of software code has been a highly desirable task, especially with the emergence of LLM-generated code. In the same vein, they provide an interesting avenue for the exploration of formal verification and mechanistic interpretability. Since the introduction of code-specific models, despite their successes in generating code in Lean4 and Isabelle, the task of generalized theorem proving still remains far from being fully solved and will be a benchmark for reasoning capability in LLMs. In this work, we introduce a framework that generates whole proofs in a formal language to be used within systems that utilize the power of built-in tactics and off-the-shelf automated theorem provers. Our framework includes 3 components: generating natural language statements of the code to be verified, an LLM that generates formal proofs for the given statement, and a module employing heuristics for building the final proof. To train the LLM, we employ a 2-stage fine-tuning process, where we first use SFT-based training to enable the model to generate syntactically correct Isabelle code and then RL-based training that encourages the model to generate proofs verified by a theorem prover. We validate our framework using the miniF2F-test benchmark and the Isabelle proof assistant and design a use case to verify the correctness of the AWS S3 bucket access policy code. We also curate a dataset based on the FVEL\textsubscript{\textnormal{ER}} dataset for future training tasks.
The parity index problem of tree automata asks, given a regular tree language $L$ and a set of priorities $J$, is $L$ $J$-feasible, that is, recognised by a nondeterministic parity automaton with priorities $J$? This is a long-standing open problem, of which only a few sub-cases and variations are known to be decidable. In a significant but technically difficult step, Colcombet and L\"oding reduced the problem to the uniform universality of distance-parity automata. In this article, we revisit the index problem using tools from the parity game literature. We add some counters to Lehtinen's register game, originally used to solve parity games in quasipolynomial time, and use this novel game to characterise $J$-feasibility. This provides a alternative proof to Colcombet and L\"oding's reduction. We then provide a second characterisation, based on the notion of attractor decompositions and the complexity of their structure, as measured by a parameterised version of their Strahler number, which we call $n$-Strahler number. Finally, we rephrase this result using the notion of universal tree extended to automata: a guidable automaton recognises a $[1,2j]$-feasible language if and only if it admits a universal tree with $n$-Strahler number $j$, for some $n$. In particular, a language recognised by a guidable automaton $A$ is B\"uchi-feasible if and only if there is a uniform bound $n\in \mathbb{N}$ such that all trees in the language admit an accepting run with an attractor decomposition of width bounded by $n$, or, equivalently, if and only $A$ admits a \textit{finite} universal tree. While we do not solve the decidability of the index problem, our work makes the state-of-the-art more accessible and brings to light the deep relationships between the $J$-feasibility of a language and attractor decompositions, universal trees and Lehtinen's register game.