Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
We consider the branch-length estimation problem on a bifurcating tree: a character evolves along the edges of a binary tree according to a two-state symmetric Markov process, and we seek to recover the edge transition probabilities from repeated observations at the leaves. This problem arises in phylogenetics, and is related to latent tree graphical model inference. In general, the log-likelihood function is non-concave and may admit many critical points. Nevertheless, simple coordinate maximization has been known to perform well in practice, defying the complexity of the likelihood landscape. In this work, we provide the first theoretical guarantee as to why this might be the case. We show that deep inside the Kesten-Stigum reconstruction regime, provided with polynomially many $m$ samples (assuming the tree is balanced), there exists a universal parameter regime (independent of the size of the tree) where the log-likelihood function is strongly concave and smooth with high probability. On this high-probability likelihood landscape event, we show that the standard coordinate maximization algorithm converges exponentially fast to the maximum likelihood estimator, which is within $O(1/\sqrt{m})$ from the true parameter, provided a sufficiently close initial point.
Deep anchor-based multi-view clustering methods enhance the scalability of neural networks by utilizing representative anchors to reduce the computational complexity of large-scale clustering. Despite their scalability advantages, existing approaches often incorporate anchor structures in a heuristic or task-agnostic manner, either through post-hoc graph construction or as auxiliary components for message passing. Such designs overlook the core structural demands of anchor-based clustering, neglecting key optimization principles. To bridge this gap, we revisit the underlying optimization problem of large-scale anchor-based multi-view clustering and unfold its iterative solution into a novel deep network architecture, termed LargeMvC-Net. The proposed model decomposes the anchor-based clustering process into three modules: RepresentModule, NoiseModule, and AnchorModule, corresponding to representation learning, noise suppression, and anchor indicator estimation. Each module is derived by unfolding a step of the original optimization procedure into a dedicated network component, providing structural clarity and optimization traceability. In addition, an unsupervised reconstruction loss aligns each view with the anchor-induced latent space, encouraging consistent clustering structures across views. Extensive experiments on several large-scale multi-view benchmarks show that LargeMvC-Net consistently outperforms state-of-the-art methods in terms of both effectiveness and scalability.
We consider the problem of testing independence in mixed-type data that combine count variables with positive, absolutely continuous variables. We first introduce two distinct classes of test statistics in the bivariate setting, designed to test independence between the components of a bivariate mixed-type vector. These statistics are then extended to the multivariate context to accommodate: (i) testing independence between vectors of different types and possibly different dimensions, and (ii) testing total independence among all components of vectors with different types. The construction is based on the recently introduced Baringhaus-Gaigall transformation, which characterizes the joint distribution of such data. We establish the asymptotic properties of the resulting tests and, through an extensive power study, demonstrate that the proposed approach is both competitive and flexible.
Handling missing data is a major challenge in model-based clustering, especially when the data exhibit skewness and heavy tails. We address this by extending the finite mixture of scale mixtures of multivariate skew-normal (FMSMSN) family to accommodate incomplete data under a missing at random (MAR) mechanism. Unlike previous work that is limited to one of the special cases of the FMSMSN family, our method offers a cluster analysis methodology for the entire family that accounts for skewness and excess kurtosis amidst data with missing values. The multivariate skew-normal distribution, as parameterised by \cite{azzalini1996} and \cite{arnoldbeaver} includes the normal distribution as a special case, which ensures that our method is flexible toward existing symmetric model-based clustering techniques under a normality assumption. We derive the distributional properties of the missing components of the data and propose an augmented EM-type algorithm tailored for incomplete observations. The modified E-step yields closed-form expressions for the conditional expectations of the missing values. The simulation experiments showcase the flexibility of the FMSMSN family in both clustering performance and parameter recovery for varying percentages of missing values, while incorporating the effects of sample size and cluster proximity. Finally, we illustrate the practical utility of the proposed method by applying special cases of the FMSMSN family to global CO2 emissions data.
The experimenter must perform a legitimate search in the entire set of feasible censoring schemes to identify the optimal type II progressive censoring scheme, when applied to a life-testing experiment. Current recommendations are limited to small sample sizes. Exhaustive search strategies are not practically feasible for large sample sizes. This paper proposes a meta-heuristic algorithm based on the genetic algorithm for large sample sizes. The algorithm is found to provide optimal or near-optimal solutions for small sample sizes and large sample sizes. Our suggested optimal criterion is based on the cost function and is scale-invariant for both location-scale and log-location-scale distribution families. To investigate how inaccurate parameter values or cost coefficients may affect the optimal solution, a sensitivity analysis is also taken into account.
We put forward a new Bayesian modeling strategy for spatiotemporal count data that enables efficient posterior sampling. Most previous models for such data decompose logarithms of the response Poisson rates into fixed effects and spatial random effects, where the latter is typically assumed to follow a latent Gaussian process, the conditional autoregressive model, or the intrinsic conditional autoregressive model. Since log-Gaussian is not conjugate to Poisson, such implementations must resort to either approximation methods like INLA or Metropolis moves on latent states in MCMC algorithms for model fitting and exhibit several approximation and posterior sampling challenges. Instead of modeling logarithms of spatiotemporal frailties jointly as a Gaussian process, we construct a spatiotemporal autoregressive gamma process guaranteed stationary across the time dimension. We decompose latent Poisson variables to permit fully conjugate Gibbs sampling of spatiotemporal frailties and design a sparse spatial dependence structure to get a linear computational complexity that facilitates efficient posterior computation. Our model permits convenient Bayesian predictive machinery based on posterior samples that delivers satisfactory performance in predicting at new spatial locations and time intervals. We have performed extensive simulation experiments and real data analyses, which corroborated our model's accurate parameter estimation, model fitting, and out-of-sample prediction capabilities.
We consider a bivariate, possibly non-homogeneous, finite-state Markov chain $(X,U)=\{(X_t,U_t)\}_{t=1}^n$. We are interested in the marginal process $X$, which typically is not a Markov chain. The goal is to find a realization (path) $x=(x_1,\ldots,x_n)$ with maximal probability $P(X=x)$. If $X$ is Markov chain, then such path can be efficiently found using the celebrated Viterbi algorithm. However, when $X$ is not Markovian, identifying the most probable path -- hereafter referred to as the Viterbi path -- becomes computationally expensive. In this paper, we explore the branch-and-bound method for finding Viterbi paths. The method is based on the lower and upper bounds on maximum probability $\max_x P(X=x)$, and the objective of the paper is to exploit the joint Markov property of $(X,Y)$ to calculate possibly good bounds in possibly cheap way. This research is motivated by decoding or segmentation problem in triplet Markov models. A triplet Markov model is trivariate homogeneous Markov process $(X,U,Y)$. In decoding, a realization of one marginal process $Y$ is observed (representing the data), while $X$ and $U$ are latent processes. The process $U$ serves as a nuisance variable, whereas $X$ is the process of primary interest. Decoding refers to estimating the hidden sequence $X$ based solely on the observation $Y$. Conditional on $Y$, the latent processes $(X, U)$ form a non-homogeneous Markov chain. In this context, the Viterbi path corresponds to the maximum a posteriori (MAP) estimate of $X$, making it a natural choice for signal reconstruction.
This study introduces the SH-MBS-GARCH model, a hysteretic multivariate Bayesian structural GARCH framework that integrates hard and soft information to capture the joint dynamics of multiple financial time series, incorporating hysteretic effects and addressing conditional heteroscedasticity through GARCH components. Various model specifications could utilize soft information to define the regime indicator in distinct ways. We propose a flexible, straightforward method for embedding soft information into the regime component, applicable across all SH-MBS-GARCH model variants. We further propose a generally applicable Bayesian estimation approach that combines adaptive MCMC, spike-and-slab regression, and a simulation smoother, ensuring accurate parameter estimation, validated through extensive simulations. Empirical analysis of the Dow Jones Industrial Average, NASDAQ Composite, and PHLX Semiconductor indices from January 2016 to December 2020 demonstrates that the SH-MBS-GARCH model outperforms competing models in fitting and prediction accuracy, effectively capturing regime-switching dynamics.
This paper studies the formulation, well-posedness, and numerical solution of Bayesian inverse problems on metric graphs, in which the edges represent one-dimensional wires connecting vertices. We focus on the inverse problem of recovering the diffusion coefficient of a (fractional) elliptic equation on a metric graph from noisy measurements of the solution. Well-posedness hinges on both stability of the forward model and an appropriate choice of prior. We establish the stability of elliptic and fractional elliptic forward models using recent regularity theory for differential equations on metric graphs. For the prior, we leverage modern Gaussian Whittle--Mat\'ern process models on metric graphs with sufficiently smooth sample paths. Numerical results demonstrate accurate reconstruction and effective uncertainty quantification.
gsaot is an R package for Optimal Transport-based global sensitivity analysis. It provides a simple interface for indices estimation using a variety of state-of-the-art Optimal Transport solvers such as the network simplex and Sinkhorn-Knopp. The package is model-agnostic, allowing analysts to perform the sensitivity analysis as a post-processing step. Moreover, gsaot provides functions for indices and statistics visualization. In this work, we provide an overview of the theoretical grounds, of the implemented algorithms, and show how to use the package in different examples.
Improving Markov chain Monte Carlo algorithm efficiency is essential for enhancing computational speed and inferential accuracy in Bayesian analysis. These improvements can be effectively achieved using the ancillarity-sufficiency interweaving strategy (ASIS), an effective means of achieving such gains. Herein, we provide the first rigorous theoretical justification for applying ASIS in Bayesian hierarchical panel data models. Asymptotic analysis demonstrated that when the product of prior variance of unobserved heterogeneity and cross-sectional sample size N is sufficiently large, the latent individual effects can be sampled almost independently of their global mean. This near-independence accounts for ASIS's rapid mixing behavior and highlights its suitability for modern "tall" panel datasets. We derived simple inequalities to predict which conventional data augmentation scheme-sufficient augmentation (SA) or ancillary augmentation (AA)-yields faster convergence. By interweaving SA and AA, ASIS achieves optimal geometric rate of convergence and renders the Markov chain for the global mean parameter asymptotically independent and identically distributed. Monte Carlo experiment confirm that this theoretical efficiency ordering holds even for small panels (e.g., N=10). These findings confirm the empirical success of ASIS application across finance, marketing, and sports, laying the groundwork for its extension to models with more complex covariate structures and nonGaussian specifications.
We introduce the microclustering Ewens--Pitman model for random partitions, obtained by scaling the strength parameter of the Ewens--Pitman model linearly with the sample size. The resulting random partition is shown to have the microclustering property, namely: the size of the largest cluster grows sub-linearly with the sample size, while the number of clusters grows linearly. By leveraging the interplay between the Ewens--Pitman random partition with the Pitman--Yor process, we develop efficient variational inference schemes for posterior computation in entity resolution. Our approach achieves a speed-up of three orders of magnitude over existing Bayesian methods for entity resolution, while maintaining competitive empirical performance.
We consider the problem of variable selection in Bayesian multivariate linear regression models, involving multiple response and predictor variables, under multivariate normal errors. In the absence of a known covariance structure, specifying a model with a non-diagonal covariance matrix is appealing. Modeling dependency in the random errors through a non-diagonal covariance matrix is generally expected to lead to improved estimation of the regression coefficients. In this article, we highlight an interesting exception: modeling the dependency in errors can significantly worsen both estimation and prediction. We demonstrate that Bayesian multi-outcome regression models using several popular variable selection priors can suffer from poor estimation properties in low-information settings--such as scenarios with weak signals, high correlation among predictors and responses, and small sample sizes. In such cases, the simultaneous estimation of all unknown parameters in the model becomes difficult when using a non-diagonal covariance matrix. Through simulation studies and a dataset with measurements from NIR spectroscopy, we illustrate that a two-step procedure--estimating the mean and the covariance matrix separately--can provide more accurate estimates in such cases. Thus, a potential solution to avoid the problem altogether is to routinely perform an additional analysis with a diagonal covariance matrix, even if the errors are expected to be correlated.
Canonical correlation analysis (CCA) is a technique for finding correlated sets of features between two datasets. In this paper, we propose a novel extension of CCA to the online, streaming data setting: Sliding Window Informative Canonical Correlation Analysis (SWICCA). Our method uses a streaming principal component analysis (PCA) algorithm as a backend and uses these outputs combined with a small sliding window of samples to estimate the CCA components in real time. We motivate and describe our algorithm, provide numerical simulations to characterize its performance, and provide a theoretical performance guarantee. The SWICCA method is applicable and scalable to extremely high dimensions, and we provide a real-data example that demonstrates this capability.
Bayesian inference for spatial point patterns is often hindered computationally by intractable likelihoods. In the frequentist literature, estimating equations utilizing pseudolikelihoods have long been used for simulation-free parameter estimation. One such pseudolikelihood based on the process of differences is known as the Palm likelihood. Utilizing notions of Bayesian composite likelihoods and generalized Bayesian inference, we develop a framework for the use of Palm likelihoods in a Bayesian context. Naive implementation of the Palm likelihood results in posterior undercoverage of model parameters. We propose two approaches to remedy this issue and calibrate the resulting posterior. Numerical simulations illustrate both the efficacy of the method in terms of statistical properties and the superiority in terms of computational efficiency when compared to classical Markov chain Monte Carlo. The method is then applied to the popular \textit{Beilschmiedia pendula Lauraceae} dataset.
Penalized likelihood and quasi-likelihood methods dominate inference in high-dimensional linear mixed-effects models. Sampling-based Bayesian inference is less explored due to the computational bottlenecks introduced by the random effects covariance matrix. To address this gap, we propose the compressed mixed-effects (CME) model, which defines a quasi-likelihood using low-dimensional covariance parameters obtained via random projections of the random effects covariance. This dimension reduction, combined with a global-local shrinkage prior on the fixed effects, yields an efficient collapsed Gibbs sampler for prediction and fixed effects selection. Theoretically, when the compression dimension grows slowly relative to the number of fixed effects and observations, the Bayes risk for prediction is asymptotically negligible, ensuring accurate prediction using the CME model. Empirically, the CME model outperforms existing approaches in terms of predictive accuracy, interval coverage, and fixed-effects selection across varied simulation settings and a real-world dataset.
Inference for continuous-time Markov chains (CTMCs) becomes challenging when the process is only observed at discrete time points. The exact likelihood is intractable, and existing methods often struggle even in medium-dimensional state-spaces. We propose a scalable Bayesian framework for CTMC inference based on a pseudo-likelihood that bypasses the need for the full intractable likelihood. Our approach jointly estimates the probability transition matrix and a biorthogonal spectral decomposition of the generator, enabling an efficient Gibbs sampling procedure that obeys embeddability. Existing methods typically integrate out the unobserved transitions, which becomes computationally burdensome as the number of data or dimensions increase. The computational cost of our method is near-invariant in the number of data and scales well to medium-high dimensions. We justify our pseudo-likelihood approach by establishing theoretical guarantees, including a Bernstein-von Mises theorem for the probability transition matrix and posterior consistency for the spectral parameters of the generator. Through simulation and applications, we showcase the flexibility and robustness of our approach, offering a tractable and scalable approach to Bayesian inference for CTMCs.
We propose a general procedure for estimating the variance-covariance matrix of two-step estimates of structural parameters in latent variable models. The method is partially simulation-based, in that it includes drawing simulated values of the measurement parameters of the model from their sampling distribution obtained from the first step of two-step estimation, and using them to quantify part of the variability in the parameter estimates from the second step. This is asymptotically equal with the standard closed-form estimate of the variance-covariance matrix, but it avoids the need to evaluate a cross-derivative matrix which is the most inconvenient element of the standard estimate. The method can be applied to any types of latent variable models. We present it in more detail in the context of two common models where the measurement items are categorical: latent class models with categorical latent variables and latent trait models with continuous latent variables. The good performance of the proposed procedure is demonstrated with simulation studies and illustrated with two applied examples.