Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
ROCFTP is a perfect sampling algorithm that employs various random operations, and requiring a specific Markov chain construction for each target. To overcome this requirement, the Metropolis algorithm is incorporated as a random operation within ROCFTP. While the Metropolis sampler functions as a random operation, it isn't a coupler. However, by employing normal multishift coupler as a symmetric proposal for Metropolis, we obtain ROCFTP with Metropolis-multishift. Initially designed for bounded state spaces, ROCFTP's applicability to targets with unbounded state spaces is extended through the introduction of the Most Interest Range (MIR) for practical use. It was demonstrated that selecting MIR decreases the likelihood of ROCFTP hitting $MIR^C$ by a factor of (1 - {\epsilon}), which is beneficial for practical implementation. The algorithm exhibits a convergence rate characterized by exponential decay. Its performance is rigorously evaluated across various targets, and tests ensure its goodness of fit. Lastly, an R package is provided for generating exact samples using ROCFTP Metropolis-multishift.
We introduce a scalable framework for regressing multivariate distributions onto multivariate distributions, motivated by the application of inferring cell-cell communication from population-scale single-cell data. The observed data consist of pairs of multivariate distributions for ligands from one cell type and corresponding receptors from another. For each ordered pair $e=(l,r)$ of cell types $(l \neq r)$ and each sample $i = 1, \ldots, n$, we observe a pair of distributions $(F_{ei}, G_{ei})$ of gene expressions for ligands and receptors of cell types $l$ and $r$, respectively. The aim is to set up a regression of receptor distributions $G_{ei}$ given ligand distributions $F_{ei}$. A key challenge is that these distributions reside in distinct spaces of differing dimensions. We formulate the regression of multivariate densities on multivariate densities using a generalized Bayes framework with the sliced Wasserstein distance between fitted and observed distributions. Finally, we use inference under such regressions to define a directed graph for cell-cell communications.
Agent-based simulation provides a powerful tool for in silico system modeling. However, these simulations do not provide built-in methods for uncertainty quantification (UQ). Within these types of models a typical approach to UQ is to run multiple realizations of the model then compute aggregate statistics. This approach is limited due to the compute time required for a solution. When faced with an emerging biothreat, public health decisions need to be made quickly and solutions for integrating near real-time data with analytic tools are needed. We propose an integrated Bayesian UQ framework for agent-based models based on sequential Monte Carlo sampling. Given streaming or static data about the evolution of an emerging pathogen, this Bayesian framework provides a distribution over the parameters governing the spread of a disease through a population. These estimates of the spread of a disease may be provided to public health agencies seeking to abate the spread. By coupling agent-based simulations with Bayesian modeling in a data assimilation, our proposed framework provides a powerful tool for modeling dynamical systems in silico. We propose a method which reduces model error and provides a range of realistic possible outcomes. Moreover, our method addresses two primary limitations of ABMs: the lack of UQ and an inability to assimilate data. Our proposed framework combines the flexibility of an agent-based model with UQ provided by the Bayesian paradigm in a workflow which scales well to HPC systems. We provide algorithmic details and results on a simulated outbreak with both static and streaming data.
Markov chain sampling methods form the backbone of modern computational statistics. However, many popular methods are prone to random walk behavior, i.e., diffusion-like exploration of the sample space, leading to slow mixing that requires intricate tuning to alleviate. Non-reversible samplers can resolve some of these issues. We introduce a device that turns jump processes that satisfy a skew-detailed balance condition for a reference measure into a process that samples a target measure that is absolutely continuous with respect to the reference measure. The resulting sampler is rejection-free, non-reversible, and continuous-time. As an example, we apply the device to Hamiltonian dynamics discretized by the leapfrog integrator, resulting in a rejection-free non-reversible continuous-time version of Hamiltonian Monte Carlo (HMC). We prove the geometric ergodicity of the resulting sampler under certain convexity conditions, and demonstrate its qualitatively different behavior to HMC through numerical examples.
Simulation-Based Inference (SBI) deals with statistical inference in problems where the data are generated from a system that is described by a complex stochastic simulator. The challenge for inference in these problems is that the likelihood is intractable; SBI proceeds by using the simulator to sample from the likelihood. In many real world applications, simulator calls are expensive, limiting the associated sample size. Our goal in this work is to extend SBI to exploit two proposals for reducing simulator calls: to draw likelihood samples from a Neural Density Estimator (NDE) surrogate rather than from the stochastic simulator; and use of Support Points rather than simple random sampling to generate evaluation sites. We embed these methods in the Sequential Neural Posterior Estimator (SNPE) algorithm. Across a suite of test cases, we find that the NDE surrogate improves the quality of the inference; support points worked well in some examples, but not in others.
Ordinary differential equations (ODEs) are fundamental tools for modeling complex dynamic systems across scientific disciplines. However, parameter estimation in ODE models is challenging due to the multimodal nature of the likelihood function, which can lead to local optima and unstable inference. In this paper, we propose particle data cloning (PDC), a novel approach that enhances global optimization by leveraging data cloning and annealed sequential Monte Carlo (ASMC). PDC mitigates multimodality by refining the likelihood through data clones and progressively extracting information from the sharpened posterior. Compared to standard data cloning, PDC provides more reliable frequentist inference and demonstrates superior global optimization performance. We offer practical guidelines for efficient implementation and illustrate the method through simulation studies and an application to a prey-predator ODE model. Our implementation is available at https://github.com/SONDONGHUI/PDC.
This study introduces a computationally efficient algorithm, delayed acceptance Markov chain Monte Carlo (DA-MCMC), designed to improve posterior simulation in quasi-Bayesian inference. Quasi-Bayesian methods, which do not require fully specifying a probabilistic model, are often computationally expensive owing to the need to evaluate the inverse and determinant of large covariance matrices. DA-MCMC addresses this challenge by employing a two-stage process: In the first stage, proposals are screened using an approximate posterior, whereas a final acceptance or rejection decision is made in the second stage based on the exact target posterior. This reduces the need for costly matrix computations, thereby improving efficiency without sacrificing accuracy. We demonstrate the effectiveness of DA-MCMC through applications to both synthetic and real data. The results demonstrate that, although DA-MCMC slightly reduces the effective sample size per iteration compared with the standard MCMC, it achieves substantial improvement in terms of effective sample size per second, approximately doubling the efficiency. This makes DA-MCMC particularly useful for cases where posterior simulation is computationally intensive. Thus, the DA-MCMC algorithm offers a significant advancement in computational efficiency for quasi-Bayesian inference, making it a valuable tool for robust Bayesian analysis.
We revisit extending the Kolmogorov-Smirnov distance between probability distributions to the multidimensional setting and make new arguments about the proper way to approach this generalization. Our proposed formulation maximizes the difference over orthogonal dominating rectangular ranges (d-sided rectangles in R^d), and is an integral probability metric. We also prove that the distance between a distribution and a sample from the distribution converges to 0 as the sample size grows, and bound this rate. Moreover, we show that one can, up to this same approximation error, compute the distance efficiently in 4 or fewer dimensions; specifically the runtime is near-linear in the size of the sample needed for that error. With this, we derive a delta-precision two-sample hypothesis test using this distance. Finally, we show these metric and approximation properties do not hold for other popular variants.
The analysis of data from multiple experiments, such as observations of several individuals, is commonly approached using mixed-effects models, which account for variation between individuals through hierarchical representations. This makes mixed-effects models widely applied in fields such as biology, pharmacokinetics, and sociology. In this work, we propose a novel methodology for scalable Bayesian inference in hierarchical mixed-effects models. Our framework first constructs amortized approximations of the likelihood and the posterior distribution, which are then rapidly refined for each individual dataset, to ultimately approximate the parameters posterior across many individuals. The framework is easily trainable, as it uses mixtures of experts but without neural networks, leading to parsimonious yet expressive surrogate models of the likelihood and the posterior. We demonstrate the effectiveness of our methodology using challenging stochastic models, such as mixed-effects stochastic differential equations emerging in systems biology-driven problems. However, the approach is broadly applicable and can accommodate both stochastic and deterministic models. We show that our approach can seamlessly handle inference for many parameters. Additionally, we applied our method to a real-data case study of mRNA transfection. When compared to exact pseudomarginal Bayesian inference, our approach proved to be both fast and competitive in terms of statistical accuracy.
Spontaneous reporting system databases are key resources for post-marketing surveillance, providing real-world evidence (RWE) on the adverse events (AEs) of regulated drugs or other medical products. Various statistical methods have been proposed for AE signal detection in these databases, flagging drug-specific AEs with disproportionately high observed counts compared to expected counts under independence. However, signal detection remains challenging for rare AEs or newer drugs, which receive small observed and expected counts and thus suffer from reduced statistical power. Principled information sharing on signal strengths across drugs/AEs is crucial in such cases to enhance signal detection. However, existing methods typically ignore complex between-drug associations on AE signal strengths, limiting their ability to detect signals. We propose novel local-global mixture Dirichlet process (DP) prior-based nonparametric Bayesian models to capture these associations, enabling principled information sharing between drugs while balancing flexibility and shrinkage for each drug, thereby enhancing statistical power. We develop efficient Markov chain Monte Carlo algorithms for implementation and employ a false discovery rate (FDR)-controlled, false negative rate (FNR)-optimized hypothesis testing framework for AE signal detection. Extensive simulations demonstrate our methods' superior sensitivity -- often surpassing existing approaches by a twofold or greater margin -- while strictly controlling the FDR. An application to FDA FAERS data on statin drugs further highlights our methods' effectiveness in real-world AE signal detection. Software implementing our methods is provided as supplementary material.
In this article, we explore Bayesian extensions of the tensor normal model through a geometric expansion of the multi-way covariance's Cholesky factor inspired by the Fr\'echet mean under the log-Cholesky metric. Specifically, within a tensor normal framework, we identify three structural components in the covariance of the vectorized data. By parameterizing vector normal covariances through such a Cholesky factor representation, analogous to a finite average of multiway Cholesky factors, we eliminate one of these structural components without compromising the analytical tractability of the likelihood, in which the multiway covariance is a special case. Furthermore, we demonstrate that a specific class of structured Cholesky factors can be precisely represented under this parameterization, serving as an analogue to the Pitsianis-Van Loan decomposition. We apply this model using Hamiltonian Monte Carlo in a fixed-mean setting for two-way covariance relevancy detection of components, where efficient analytical gradient updates are available, as well as in a seasonally-varying covariance process regime.
Existing distribution compression methods, like Kernel Herding (KH), were originally developed for unlabelled data. However, no existing approach directly compresses the conditional distribution of labelled data. To address this gap, we first introduce the Average Maximum Conditional Mean Discrepancy (AMCMD), a natural metric for comparing conditional distributions. We then derive a consistent estimator for the AMCMD and establish its rate of convergence. Next, we make a key observation: in the context of distribution compression, the cost of constructing a compressed set targeting the AMCMD can be reduced from $\mathcal{O}(n^3)$ to $\mathcal{O}(n)$. Building on this, we extend the idea of KH to develop Average Conditional Kernel Herding (ACKH), a linear-time greedy algorithm that constructs a compressed set targeting the AMCMD. To better understand the advantages of directly compressing the conditional distribution rather than doing so via the joint distribution, we introduce Joint Kernel Herding (JKH), a straightforward adaptation of KH designed to compress the joint distribution of labelled data. While herding methods provide a simple and interpretable selection process, they rely on a greedy heuristic. To explore alternative optimisation strategies, we propose Joint Kernel Inducing Points (JKIP) and Average Conditional Kernel Inducing Points (ACKIP), which jointly optimise the compressed set while maintaining linear complexity. Experiments show that directly preserving conditional distributions with ACKIP outperforms both joint distribution compression (via JKH and JKIP) and the greedy selection used in ACKH. Moreover, we see that JKIP consistently outperforms JKH.
Bayesian optimal experimental design (OED) provides a principled framework for selecting the most informative observational settings in experiments. With rapid advances in computational power, Bayesian OED has become increasingly feasible for inference problems involving large-scale simulations, attracting growing interest in fields such as inverse problems. In this paper, we introduce a novel design criterion based on the expected Wasserstein-$p$ distance between the prior and posterior distributions. Especially, for $p=2$, this criterion shares key parallels with the widely used expected information gain (EIG), which relies on the Kullback--Leibler divergence instead. First, the Wasserstein-2 criterion admits a closed-form solution for Gaussian regression, a property which can be also leveraged for approximative schemes. Second, it can be interpreted as maximizing the information gain measured by the transport cost incurred when updating the prior to the posterior. Our main contribution is a stability analysis of the Wasserstein-1 criterion, where we provide a rigorous error analysis under perturbations of the prior or likelihood. We partially extend this study also to the Wasserstein-2 criterion. In particular, these results yield error rates when empirical approximations of priors are used. Finally, we demonstrate the computability of the Wasserstein-2 criterion and demonstrate our approximation rates through simulations.
In Bayesian inference, Hamiltonian Monte Carlo (HMC) is a popular Markov Chain Monte Carlo (MCMC) algorithm known for its efficiency in sampling from complex probability distributions. However, its application to models with latent variables, such as state-space models, poses significant challenges. These challenges arise from the need to compute gradients of the log-posterior of the latent variables, and the likelihood may be intractable due to the complexity of the underlying model. In this paper, we propose Particle Hamiltonian Monte Carlo (PHMC), an algorithm specifically designed for state-space models. PHMC leverages Sequential Monte Carlo (SMC) methods to estimate the marginal likelihood, infer latent variables (as in particle Metropolis-Hastings), and compute gradients of the log-posterior of model parameters. Importantly, PHMC avoids the need to calculate gradients of the log-posterior for latent variables, which addresses a major limitation of traditional HMC approaches. We assess the performance of Particle HMC on both simulated datasets and a real-world dataset involving crowdsourced cycling activities data. The results demonstrate that Particle HMC outperforms particle marginal Metropolis-Hastings with a Gaussian random walk, particularly in scenarios involving a large number of parameters.
The prompt online detection of abrupt changes in image data is essential for timely decision-making in broad applications, from video surveillance to manufacturing quality control. Existing methods, however, face three key challenges. First, the high-dimensional nature of image data introduces computational bottlenecks for efficient real-time monitoring. Second, changes often involve structural image features, e.g., edges, blurs and/or shapes, and ignoring such structure can lead to delayed change detection. Third, existing methods are largely non-Bayesian and thus do not provide a quantification of monitoring uncertainty for confident detection. We address this via a novel Bayesian onLine Structure-Aware change deTection (BLAST) method. BLAST first leverages a deep Gaussian Markov random field prior to elicit desirable image structure from offline reference data. With this prior elicited, BLAST employs a new Bayesian online change-point procedure for image monitoring via its so-called posterior run length distribution. This posterior run length distribution can be computed in an online fashion using $\mathcal{O}(p^2)$ work at each time-step, where $p$ is the number of image pixels; this facilitates scalable Bayesian online monitoring of large images. We demonstrate the effectiveness of BLAST over existing methods in a suite of numerical experiments and in two applications, the first on street scene monitoring and the second on real-time process monitoring for metal additive manufacturing.
In this paper, we study the problem of robust subspace recovery (RSR) in the presence of both strong adversarial corruptions and Gaussian noise. Specifically, given a limited number of noisy samples -- some of which are tampered by an adaptive and strong adversary -- we aim to recover a low-dimensional subspace that approximately contains a significant fraction of the uncorrupted samples, up to an error that scales with the Gaussian noise. Existing approaches to this problem often suffer from high computational costs or rely on restrictive distributional assumptions, limiting their applicability in truly adversarial settings. To address these challenges, we revisit the classical random sample consensus (RANSAC) algorithm, which offers strong robustness to adversarial outliers, but sacrifices efficiency and robustness against Gaussian noise and model misspecification in the process. We propose a two-stage algorithm, RANSAC+, that precisely pinpoints and remedies the failure modes of standard RANSAC. Our method is provably robust to both Gaussian and adversarial corruptions, achieves near-optimal sample complexity without requiring prior knowledge of the subspace dimension, and is more efficient than existing RANSAC-type methods.
Bayesian simulation-based inference (SBI) methods are used in statistical models where simulation is feasible but the likelihood is intractable. Standard SBI methods can perform poorly in cases of model misspecification, and there has been much recent work on modified SBI approaches which are robust to misspecified likelihoods. However, less attention has been given to the issue of inappropriate prior specification, which is the focus of this work. In conventional Bayesian modelling, there will often be a wide range of prior distributions consistent with limited prior knowledge expressed by an expert. Choosing a single prior can lead to an inappropriate choice, possibly conflicting with the likelihood information. Robust Bayesian methods, where a class of priors is considered instead of a single prior, can address this issue. For each density in the prior class, a posterior can be computed, and the range of the resulting inferences is informative about posterior sensitivity to the prior imprecision. We consider density ratio classes for the prior and implement robust Bayesian SBI using amortized neural methods developed recently in the literature. We also discuss methods for checking for conflict between a density ratio class of priors and the likelihood, and sequential updating methods for examining conflict between different groups of summary statistics. The methods are illustrated for several simulated and real examples.
Johnson and Lindenstrauss (Contemporary Mathematics, 1984) showed that for $n > m$, a scaled random projection $\mathbf{A}$ from $\mathbb{R}^n$ to $\mathbb{R}^m$ is an approximate isometry on any set $S$ of size at most exponential in $m$. If $S$ is larger, however, its points can contract arbitrarily under $\mathbf{A}$. In particular, the hypergrid $([-B, B] \cap \mathbb{Z})^n$ is expected to contain a point that is contracted by a factor of $\kappa_{\mathsf{stat}} = \Theta(B)^{-1/\alpha}$, where $\alpha = m/n$. We give evidence that finding such a point exhibits a statistical-computational gap precisely up to $\kappa_{\mathsf{comp}} = \widetilde{\Theta}(\sqrt{\alpha}/B)$. On the algorithmic side, we design an online algorithm achieving $\kappa_{\mathsf{comp}}$, inspired by a discrepancy minimization algorithm of Bansal and Spencer (Random Structures & Algorithms, 2020). On the hardness side, we show evidence via a multiple overlap gap property (mOGP), which in particular captures online algorithms; and a reduction-based lower bound, which shows hardness under standard worst-case lattice assumptions. As a cryptographic application, we show that the rounded Johnson-Lindenstrauss embedding is a robust property-preserving hash function (Boyle, Lavigne and Vaikuntanathan, TCC 2019) on the hypergrid for the Euclidean metric in the computationally hard regime. Such hash functions compress data while preserving $\ell_2$ distances between inputs up to some distortion factor, with the guarantee that even knowing the hash function, no computationally bounded adversary can find any pair of points that violates the distortion bound.