Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
Conformal prediction is a distribution-free uncertainty quantification method that has gained popularity in the machine learning community due to its finite-sample guarantees and ease of use. Its most common variant, dubbed split conformal prediction, is also computationally efficient as it boils down to collecting statistics of the model predictions on some calibration data not yet seen by the model. Nonetheless, these guarantees only hold if the calibration and test data are exchangeable, a condition that is difficult to verify and often violated in practice due to so-called distribution shifts. The literature is rife with methods to mitigate the loss in coverage in this non-exchangeable setting, but these methods require some prior information on the type of distribution shift to be expected at test time. In this work, we study this problem via a new perspective, through the lens of optimal transport, and show that it is possible to estimate the loss in coverage and mitigate it in case of distribution shift.
We propose LoRA-MCL, a training scheme that extends next-token prediction in language models with a method designed to decode diverse, plausible sentence continuations at inference time. Traditional language modeling is an intrinsically ill-posed problem: given a context, multiple futures may be equally plausible. Our approach leverages Multiple Choice Learning (MCL) and the Winner-Takes-All (WTA) loss to efficiently handle ambiguity through Low-Rank Adaptation (LoRA). We provide a theoretical interpretation of applying Multiple Choice Learning to Language Modeling, assuming the data is generated from a mixture of distributions. To illustrate the proposed approach, we use data sampled from mixtures of Markov chains. We then demonstrate with extensive experiments on real-world visual and audio captioning tasks that our method achieves high diversity and relevance in generated outputs.
Stochastic simulators exhibit intrinsic stochasticity due to unobservable, uncontrollable, or unmodeled input variables, resulting in random outputs even at fixed input conditions. Such simulators are common across various scientific disciplines; however, emulating their entire conditional probability distribution is challenging, as it is a task traditional deterministic surrogate modeling techniques are not designed for. Additionally, accurately characterizing the response distribution can require prohibitively large datasets, especially for computationally expensive high-fidelity (HF) simulators. When lower-fidelity (LF) stochastic simulators are available, they can enhance limited HF information within a multifidelity surrogate modeling (MFSM) framework. While MFSM techniques are well-established for deterministic settings, constructing multifidelity emulators to predict the full conditional response distribution of stochastic simulators remains a challenge. In this paper, we propose multifidelity generalized lambda models (MF-GLaMs) to efficiently emulate the conditional response distribution of HF stochastic simulators by exploiting data from LF stochastic simulators. Our approach builds upon the generalized lambda model (GLaM), which represents the conditional distribution at each input by a flexible, four-parameter generalized lambda distribution. MF-GLaMs are non-intrusive, requiring no access to the internal stochasticity of the simulators nor multiple replications of the same input values. We demonstrate the efficacy of MF-GLaM through synthetic examples of increasing complexity and a realistic earthquake application. Results show that MF-GLaMs can achieve improved accuracy at the same cost as single-fidelity GLaMs, or comparable performance at significantly reduced cost.
Predictive models often reinforce biases which were originally embedded in their training data, through skewed decisions. In such cases, mitigation methods are critical to ensure that, regardless of the prevailing disparities, model outcomes are adjusted to be fair. To assess this, datasets could be systematically generated with specific biases, to train machine learning classifiers. Then, predictive outcomes could aid in the understanding of this bias embedding process. Hence, an agent-based model (ABM), depicting a loan application process that represents various systemic biases across two demographic groups, was developed to produce synthetic datasets. Then, by applying classifiers trained on them to predict loan outcomes, we can assess how biased data leads to unfairness. This highlights a main contribution of this work: a framework for synthetic dataset generation with controllable bias injection. We also contribute with a novel explainability technique, which shows how mitigations affect the way classifiers leverage data features, via second-order Shapley values. In experiments, both offline and online learning approaches are employed. Mitigations are applied at different stages of the modelling pipeline, such as during pre-processing and in-processing.
Accurate forecasting of energy demand and supply is critical for optimizing sustainable energy systems, yet it is challenged by the variability of renewable sources and dynamic consumption patterns. This paper introduces a neural framework that integrates continuous-time Neural Ordinary Differential Equations (Neural ODEs), graph attention, multi-resolution wavelet transformations, and adaptive learning of frequencies to address the issues of time series prediction. The model employs a robust ODE solver, using the Runge-Kutta method, paired with graph-based attention and residual connections to better understand both structural and temporal patterns. Through wavelet-based feature extraction and adaptive frequency modulation, it adeptly captures and models diverse, multi-scale temporal dynamics. When evaluated across seven diverse datasets: ETTh1, ETTh2, ETTm1, ETTm2 (electricity transformer temperature), and Waste, Solar, and Hydro (renewable energy), this architecture consistently outperforms state-of-the-art baselines in various forecasting metrics, proving its robustness in capturing complex temporal dependencies. Furthermore, the model enhances interpretability through SHAP analysis, making it suitable for sustainable energy applications.
Tabular data synthesis for supervised learning ('SL') model training is gaining popularity in industries such as healthcare, finance, and retail. Despite the progress made in tabular data generators, models trained with synthetic data often underperform compared to those trained with original data. This low SL utility of synthetic data stems from class imbalance exaggeration and SL data relationship overlooked by tabular generator. To address these challenges, we draw inspirations from techniques in emerging data-centric artificial intelligence and elucidate Pruning and ReOrdering ('PRRO'), a novel pipeline that integrates data-centric techniques into tabular data synthesis. PRRO incorporates data pruning to guide the table generator towards observations with high signal-to-noise ratio, ensuring that the class distribution of synthetic data closely matches that of the original data. Besides, PRRO employs a column reordering algorithm to align the data modeling structure of generators with that of SL models. These two modules enable PRRO to optimize SL utility of synthetic data. Empirical experiments on 22 public datasets show that synthetic data generated using PRRO enhances predictive performance compared to data generated without PRRO. Specifically, synthetic replacement of original data yields an average improvement of 26.74% and up to 871.46% improvement using PRRO, while synthetic appendant to original data results with PRRO-generated data results in an average improvement of 6.13% and up to 200.32%. Furthermore, experiments on six highly imbalanced datasets show that PRRO enables the generator to produce synthetic data with a class distribution that resembles the original data more closely, achieving a similarity improvement of 43%. Through PRRO, we foster a seamless integration of data synthesis to subsequent SL prediction, promoting quality and accessible data analysis.
This paper addresses the problem of estimating the containment and similarity between two sets using only random samples from each set, without relying on sketches or full data access. The study introduces a binomial model for predicting the overlap between samples, demonstrating that it is both accurate and practical when sample sizes are small compared to the original sets. The paper compares this model to previous approaches and shows that it provides better estimates under the considered conditions. It also analyzes the statistical properties of the estimator, including error bounds and sample size requirements needed to achieve a desired level of accuracy and confidence. The framework is extended to estimate set similarity, and the paper provides guidance for applying these methods in large scale data systems where only partial or sampled data is available.
In this paper, we tackle the dynamic mean-variance portfolio selection problem in a {\it model-free} manner, based on (generative) diffusion models. We propose using data sampled from the real model $\mathcal P$ (which is unknown) with limited size to train a generative model $\mathcal Q$ (from which we can easily and adequately sample). With adaptive training and sampling methods that are tailor-made for time series data, we obtain quantification bounds between $\mathcal P$ and $\mathcal Q$ in terms of the adapted Wasserstein metric $\mathcal A W_2$. Importantly, the proposed adapted sampling method also facilitates {\it conditional sampling}. In the second part of this paper, we provide the stability of the mean-variance portfolio optimization problems in $\mathcal A W _2$. Then, combined with the error bounds and the stability result, we propose a policy gradient algorithm based on the generative environment, in which our innovative adapted sampling method provides approximate scenario generators. We illustrate the performance of our algorithm on both simulated and real data. For real data, the algorithm based on the generative environment produces portfolios that beat several important baselines, including the Markowitz portfolio, the equal weight (naive) portfolio, and S\&P 500.
In multi-source learning with discrete labels, distributional heterogeneity across domains poses a central challenge to developing predictive models that transfer reliably to unseen domains. We study multi-source unsupervised domain adaptation, where labeled data are drawn from multiple source domains and only unlabeled data from a target domain. To address potential distribution shifts, we propose a novel Conditional Group Distributionally Robust Optimization (CG-DRO) framework that learns a classifier by minimizing the worst-case cross-entropy loss over the convex combinations of the conditional outcome distributions from the sources. To solve the resulting minimax problem, we develop an efficient Mirror Prox algorithm, where we employ a double machine learning procedure to estimate the risk function. This ensures that the errors of the machine learning estimators for the nuisance models enter only at higher-order rates, thereby preserving statistical efficiency under covariate shift. We establish fast statistical convergence rates for the estimator by constructing two surrogate minimax optimization problems that serve as theoretical bridges. A distinguishing challenge for CG-DRO is the emergence of nonstandard asymptotics: the empirical estimator may fail to converge to a standard limiting distribution due to boundary effects and system instability. To address this, we introduce a perturbation-based inference procedure that enables uniformly valid inference, including confidence interval construction and hypothesis testing.
Time series forecasting is a fundamental task with broad applications, yet conventional methods often treat data as discrete sequences, overlooking their origin as noisy samples of continuous processes. Crucially, discrete noisy observations cannot uniquely determine a continuous function; instead, they correspond to a family of plausible functions. Mathematically, time series can be viewed as noisy observations of a continuous function family governed by a shared probability measure. Thus, the forecasting task can be framed as learning the transition from the historical function family to the future function family. This reframing introduces two key challenges: (1) How can we leverage discrete historical and future observations to learn the relationships between their underlying continuous functions? (2) How can we model the transition path in function space from the historical function family to the future function family? To address these challenges, we propose NeuTSFlow, a novel framework that leverages Neural Operators to facilitate flow matching for learning path of measure between historical and future function families. By parameterizing the velocity field of the flow in infinite-dimensional function spaces, NeuTSFlow moves beyond traditional methods that focus on dependencies at discrete points, directly modeling function-level features instead. Experiments on diverse forecasting tasks demonstrate NeuTSFlow's superior accuracy and robustness, validating the effectiveness of the function-family perspective.
As both model and dataset sizes continue to scale rapidly, conventional pretraining strategies with fixed compute budgets-such as cosine learning rate schedules-are increasingly inadequate for large-scale training. Recent alternatives, including warmup-stable-decay (WSD) schedules and weight averaging, offer greater flexibility. However, WSD relies on explicit decay phases to track progress, while weight averaging addresses this limitation at the cost of additional memory. In search of a more principled and scalable alternative, we revisit the Schedule-Free (SF) method [Defazio et al., 2024], which has shown strong empirical performance across diverse settings. We show that SF-AdamW effectively navigates the "river" structure of the loss landscape without decay phases or auxiliary averaging, making it particularly suitable for continuously scaling training workloads. To understand this behavior, we conduct a theoretical and empirical analysis of SF dynamics, revealing that it implicitly performs weight averaging without memory overhead. Guided by this analysis, we propose a refined variant of SF that improves robustness to momentum and performs better under large batch sizes, addressing key limitations of the original method. Together, these results establish SF as a practical, scalable, and theoretically grounded approach for language model training.
Bayesian optimization is a powerful tool for optimizing an expensive-to-evaluate black-box function. In particular, the effectiveness of expected improvement (EI) has been demonstrated in a wide range of applications. However, theoretical analyses of EI are limited compared with other theoretically established algorithms. This paper analyzes a randomized variant of EI, which evaluates the EI from the maximum of the posterior sample path. We show that this posterior sampling-based random EI achieves the sublinear Bayesian cumulative regret bounds under the assumption that the black-box function follows a Gaussian process. Finally, we demonstrate the effectiveness of the proposed method through numerical experiments.
In the study of complex dynamical systems, understanding and accurately modeling the underlying physical processes is crucial for predicting system behavior and designing effective interventions. Yet real-world systems exhibit pronounced input (or system) variability and are observed through noisy, limited data conditions that confound traditional discovery methods that assume fixed-coefficient deterministic models. In this work, we theorize that accounting for system variability together with measurement noise is the key to consistently discover the governing equations underlying dynamical systems. As such, we introduce a stochastic inverse physics-discovery (SIP) framework that treats the unknown coefficients as random variables and infers their posterior distribution by minimizing the Kullback-Leibler divergence between the push-forward of the posterior samples and the empirical data distribution. Benchmarks on four canonical problems -- the Lotka-Volterra predator-prey system (multi- and single-trajectory), the historical Hudson Bay lynx-hare data, the chaotic Lorenz attractor, and fluid infiltration in porous media using low- and high-viscosity liquids -- show that SIP consistently identifies the correct equations and lowers coefficient root-mean-square error by an average of 82\% relative to the Sparse Identification of Nonlinear Dynamics (SINDy) approach and its Bayesian variant. The resulting posterior distributions yield 95\% credible intervals that closely track the observed trajectories, providing interpretable models with quantified uncertainty. SIP thus provides a robust, data-efficient approach for consistent physics discovery in noisy, variable, and data-limited settings.
Real-world data is often represented through the relationships between data samples, forming a graph structure. In many applications, it is necessary to learn this graph structure from the observed data. Current graph learning research has primarily focused on unsigned graphs, which consist only of positive edges. However, many biological and social systems are better described by signed graphs that account for both positive and negative interactions, capturing similarity and dissimilarity between samples. In this paper, we develop a method for learning signed graphs from a set of smooth signed graph signals. Specifically, we employ the net Laplacian as a graph shift operator (GSO) to define smooth signed graph signals as the outputs of a low-pass signed graph filter defined by the net Laplacian. The signed graph is then learned by formulating a non-convex optimization problem where the total variation of the observed signals is minimized with respect to the net Laplacian. The proposed problem is solved using alternating direction method of multipliers (ADMM) and a fast algorithm reducing the per-ADMM iteration complexity from quadratic to linear in the number of nodes is introduced. Furthermore, theoretical proofs of convergence for the algorithm and a bound on the estimation error of the learned net Laplacian as a function of sample size, number of nodes, and graph topology are provided. Finally, the proposed method is evaluated on simulated data and gene regulatory network inference problem and compared to existing signed graph learning methods.
The matrix scaling problem, particularly the Sinkhorn-Knopp algorithm, has been studied for over 60 years. In practice, the algorithm often yields high-quality approximations within just a few iterations. Theoretically, however, the best-known upper bound places it in the class of pseudopolynomial-time approximation algorithms. Meanwhile, the lower-bound landscape remains largely unexplored. Two fundamental questions persist: what accounts for the algorithm's strong empirical performance, and can a tight bound on its iteration count be established? For an $n\times n$ matrix, its normalized version is obtained by dividing each entry by its largest entry. We say that a normalized matrix has a density $\gamma$ if there exists a constant $\rho > 0$ such that one row or column has exactly $\lceil \gamma n \rceil$ entries with values at least $\rho$, and every other row and column has at least $\lceil \gamma n \rceil$ such entries. For the upper bound, we show that the Sinkhorn-Knopp algorithm produces a nearly doubly stochastic matrix in $O(\log n - \log \varepsilon)$ iterations and $\widetilde{O}(n^2)$ time for all nonnegative square matrices whose normalized version has a density $\gamma > 1/2$. Such matrices cover both the algorithm's principal practical inputs and its typical theoretical regime, and the $\widetilde{O}(n^2)$ runtime is optimal. For the lower bound, we establish a tight bound of $\widetilde{\Omega}\left(n^{1/2}/\varepsilon\right)$ iterations for positive matrices under the $\ell_2$-norm error measure. Moreover, for every $\gamma < 1/2$, there exists a matrix with density $\gamma$ for which the algorithm requires $\Omega\left(n^{1/2}/\varepsilon\right)$ iterations. In summary, our results reveal a sharp phase transition in the Sinkhorn-Knopp algorithm at the density threshold $\gamma = 1/2$.
This paper introduces a comprehensive open-source framework for developing correlation kernels, with a particular focus on user-defined and composition of kernels for surrogate modeling. By advancing kernel-based modeling techniques, we incorporate frequency-aware elements that effectively capture complex mechanical behaviors and timefrequency dynamics intrinsic to aircraft systems. Traditional kernel functions, often limited to exponential-based methods, are extended to include a wider range of kernels such as exponential squared sine and rational quadratic kernels, along with their respective firstand second-order derivatives. The proposed methodologies are first validated on a sinus cardinal test case and then applied to forecasting Mauna-Loa Carbon Dioxide (CO 2 ) concentrations and airline passenger traffic. All these advancements are integrated into the open-source Surrogate Modeling Toolbox (SMT 2.0), providing a versatile platform for both standard and customizable kernel configurations. Furthermore, the framework enables the combination of various kernels to leverage their unique strengths into composite models tailored to specific problems. The resulting framework offers a flexible toolset for engineers and researchers, paving the way for numerous future applications in metamodeling for complex, frequency-sensitive domains.
We introduce an algorithm for identifying interpretable subgroups with elevated treatment effects, given an estimate of individual or conditional average treatment effects (CATE). Subgroups are characterized by ``rule sets'' -- easy-to-understand statements of the form (Condition A AND Condition B) OR (Condition C) -- which can capture high-order interactions while retaining interpretability. Our method complements existing approaches for estimating the CATE, which often produce high dimensional and uninterpretable results, by summarizing and extracting critical information from fitted models to aid decision making, policy implementation, and scientific understanding. We propose an objective function that trades-off subgroup size and effect size, and varying the hyperparameter that controls this trade-off results in a ``frontier'' of Pareto optimal rule sets, none of which dominates the others across all criteria. Valid inference is achievable through sample splitting. We demonstrate the utility and limitations of our method using simulated and empirical examples.
Motivated by applications such as cloud platforms allocating GPUs to users or governments deploying mobile health units across competing regions, we study the dynamic allocation of a reusable resource to strategic agents with private valuations. Our objective is to simultaneously (i) maximize social welfare, (ii) satisfy multi-dimensional long-term cost constraints, and (iii) incentivize truthful reporting. We begin by numerically evaluating primal-dual methods widely used in constrained online optimization and find them to be highly fragile in strategic settings -- agents can easily manipulate their reports to distort future dual updates for future gain. To address this vulnerability, we develop an incentive-aware framework that makes primal-dual methods robust to strategic behavior. Our design combines epoch-based lazy updates -- where dual variables remain fixed within each epoch -- with randomized exploration rounds that extract approximately truthful signals for learning. Leveraging carefully designed online learning subroutines that can be of independent interest for dual updates, our mechanism achieves $\tilde{\mathcal{O}}(\sqrt{T})$ social welfare regret, satisfies all cost constraints, and ensures incentive alignment. This matches the performance of non-strategic allocation approaches while being robust to strategic agents.