Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
Dice control involves "setting" the dice and then throwing them in a careful way, in the hope of influencing the outcomes and gaining an advantage at craps. How does one test for this ability? To specify the alternative hypothesis, we need a statistical model of dice control. Two have been suggested in the gambling literature, namely the Smith-Scott model and the Wong-Shackleford model. Both models are parameterized by $\theta\in[0,1]$, which measures the shooter's level of control. We propose and compare four test statistics: (a) the sample proportion of 7s; (b) the sample proportion of pass-line wins; (c) the sample mean of hand-length observations; and (d) the likelihood ratio statistic for a hand-length sample. We want to test $H_0:\theta = 0$ (no control) versus $H_1:\theta > 0$ (some control). We also want to test $H_0:\theta\le\theta_0$ versus $H_1:\theta>\theta_0$, where $\theta_0$ is the "break-even point." For the tests considered we estimate the power, either by normal approximation or by simulation.
Identifying areas where the signal is prominent is an important task in image analysis, with particular applications in brain mapping. In this work, we develop confidence regions for spatial excursion sets above and below a given level. We achieve this by treating the confidence procedure as a testing problem at the given level, allowing control of the False Discovery Rate (FDR). Methods are developed to control the FDR, separately for positive and negative excursions, as well as jointly over both. Furthermore, power is increased by incorporating a two-stage adaptive procedure. Simulation results with various signals show that our confidence regions successfully control the FDR under the nominal level. We showcase our methods with an application to functional magnetic resonance imaging (fMRI) data from the Human Connectome Project illustrating the improvement in statistical power over existing approaches.
Bovine Viral Diarrhoea (BVD) has been the focus of a successful eradication programme in Ireland, with the herd-level prevalence declining from 11.3% in 2013 to just 0.2% in 2023. As the country moves toward BVD freedom, the development of predictive models for targeted surveillance becomes increasingly important to mitigate the risk of disease re-emergence. In this study, we evaluate the performance of a range of machine learning algorithms, including binary classification and anomaly detection techniques, for predicting BVD-positive herds using highly imbalanced herd-level data. We conduct an extensive simulation study to assess model performance across varying sample sizes and class imbalance ratios, incorporating resampling, class weighting, and appropriate evaluation metrics (sensitivity, positive predictive value, F1-score and AUC values). Random forests and XGBoost models consistently outperformed other methods, with the random forest model achieving the highest sensitivity and AUC across scenarios, including real-world prediction of 2023 herd status, correctly identifying 219 of 250 positive herds while halving the number of herds that require compared to a blanket-testing strategy.
In causal inference, remarkable progress has been made in difference-in-differences (DID) approaches to estimate the average effect of treatment on the treated (ATT). Of these, the semiparametric DID (SDID) approach incorporates a propensity score analysis into the DID setup. Supposing that the ATT is a function of covariates, we estimate it by weighting the inverse of the propensity score. As one method to make the estimation robust to the propensity score modeling, we incorporate covariate balancing. Then, by attentively constructing the moment conditions used in the covariate balancing, we show that the proposed estimator has doubly robustness. In addition to the estimation, model selection is also addressed. In practice, covariate selection is an essential task in statistical analysis, but even in the basic setting of the SDID approach, there are no reasonable information criteria. Therefore, we derive a model selection criterion as an asymptotically bias-corrected estimator of risk based on the loss function used in the SDID estimation. As a result, we show that a penalty term is derived that is considerably different from almost twice the number of parameters that often appears in AIC-type information criteria. Numerical experiments show that the proposed method estimates the ATT robustly compared to the method using propensity scores given by the maximum likelihood estimation (MLE), and that the proposed criterion clearly reduces the risk targeted in the SDID approach compared to the intuitive generalization of the existing information criterion. In addition, real data analysis confirms that there is a large difference between the results of the proposed method and the existing method.
This paper proposes a robust high-dimensional sparse canonical correlation analysis (CCA) method for investigating linear relationships between two high-dimensional random vectors, focusing on elliptical symmetric distributions. Traditional CCA methods, based on sample covariance matrices, struggle in high-dimensional settings, particularly when data exhibit heavy-tailed distributions. To address this, we introduce the spatial-sign covariance matrix as a robust estimator, combined with a sparsity-inducing penalty to efficiently estimate canonical correlations. Theoretical analysis shows that our method is consistent and robust under mild conditions, converging at an optimal rate even in the presence of heavy tails. Simulation studies demonstrate that our approach outperforms existing sparse CCA methods, particularly under heavy-tailed distributions. A real-world application further confirms the method's robustness and efficiency in practice. Our work provides a novel solution for high-dimensional canonical correlation analysis, offering significant advantages over traditional methods in terms of both stability and performance.
ROCFTP is a perfect sampling algorithm that employs various random operations, and requiring a specific Markov chain construction for each target. To overcome this requirement, the Metropolis algorithm is incorporated as a random operation within ROCFTP. While the Metropolis sampler functions as a random operation, it isn't a coupler. However, by employing normal multishift coupler as a symmetric proposal for Metropolis, we obtain ROCFTP with Metropolis-multishift. Initially designed for bounded state spaces, ROCFTP's applicability to targets with unbounded state spaces is extended through the introduction of the Most Interest Range (MIR) for practical use. It was demonstrated that selecting MIR decreases the likelihood of ROCFTP hitting $MIR^C$ by a factor of (1 - {\epsilon}), which is beneficial for practical implementation. The algorithm exhibits a convergence rate characterized by exponential decay. Its performance is rigorously evaluated across various targets, and tests ensure its goodness of fit. Lastly, an R package is provided for generating exact samples using ROCFTP Metropolis-multishift.
Augmented inverse probability weighting (AIPW) and G-computation with canonical generalized linear models have become increasingly popular for estimating the average treatment effect in randomized experiments. These estimators leverage outcome prediction models to adjust for imbalances in baseline covariates across treatment arms, improving statistical power compared to unadjusted analyses, while maintaining control over Type I error rates, even when the models are misspecified. Practical application of such estimators often overlooks the clustering present in multi-center clinical trials. Even when prediction models account for center effects, this neglect can degrade the coverage of confidence intervals, reduce the efficiency of the estimators, and complicate the interpretation of the corresponding estimands. These issues are particularly pronounced for estimators of counterfactual means, though less severe for those of the average treatment effect, as demonstrated through Monte Carlo simulations and supported by theoretical insights. To address these challenges, we develop efficient estimators of counterfactual means and the average treatment effect in a random center. These extract information from baseline covariates by relying on outcome prediction models, but remain unbiased in large samples when these models are misspecified. We also introduce an accompanying inference framework inspired by random-effects meta-analysis and relevant for settings where data from many small centers are being analyzed. Adjusting for center effects yields substantial gains in efficiency, especially when treatment effect heterogeneity across centers is large. Monte Carlo simulations and application to the WASH Benefits Bangladesh study demonstrate adequate performance of the proposed methods.
Accurate prediction of spatially dependent functional data is critical for various engineering and scientific applications. In this study, a spatial functional deep neural network model was developed with a novel non-linear modeling framework that seamlessly integrates spatial dependencies and functional predictors using deep learning techniques. The proposed model extends classical scalar-on-function regression by incorporating a spatial autoregressive component while leveraging functional deep neural networks to capture complex non-linear relationships. To ensure a robust estimation, the methodology employs an adaptive estimation approach, where the spatial dependence parameter was first inferred via maximum likelihood estimation, followed by non-linear functional regression using deep learning. The effectiveness of the proposed model was evaluated through extensive Monte Carlo simulations and an application to Brazilian COVID-19 data, where the goal was to predict the average daily number of deaths. Comparative analysis with maximum likelihood-based spatial functional linear regression and functional deep neural network models demonstrates that the proposed algorithm significantly improves predictive performance. The results for the Brazilian COVID-19 data showed that while all models achieved similar mean squared error values over the training modeling phase, the proposed model achieved the lowest mean squared prediction error in the testing phase, indicating superior generalization ability.
We propose a clustering method, funWeightClustSkew, based on mixtures of functional linear regression models and three skewed multivariate distributions: the variance-gamma distribution, the skew-t distribution, and the normal-inverse Gaussian distribution. Our approach follows the framework of the functional high dimensional data clustering (funHDDC) method, and we extend to functional data the cluster weighted models based on skewed distributions used for finite dimensional multivariate data. We consider several parsimonious models, and to estimate the parameters we construct an expectation maximization (EM) algorithm. We illustrate the performance of funWeightClustSkew for simulated data and for the Air Quality dataset.
We introduce a scalable framework for regressing multivariate distributions onto multivariate distributions, motivated by the application of inferring cell-cell communication from population-scale single-cell data. The observed data consist of pairs of multivariate distributions for ligands from one cell type and corresponding receptors from another. For each ordered pair $e=(l,r)$ of cell types $(l \neq r)$ and each sample $i = 1, \ldots, n$, we observe a pair of distributions $(F_{ei}, G_{ei})$ of gene expressions for ligands and receptors of cell types $l$ and $r$, respectively. The aim is to set up a regression of receptor distributions $G_{ei}$ given ligand distributions $F_{ei}$. A key challenge is that these distributions reside in distinct spaces of differing dimensions. We formulate the regression of multivariate densities on multivariate densities using a generalized Bayes framework with the sliced Wasserstein distance between fitted and observed distributions. Finally, we use inference under such regressions to define a directed graph for cell-cell communications.
Conformal prediction provides a distribution-free framework for uncertainty quantification. This study explores the application of conformal prediction in scenarios where covariates are missing, which introduces significant challenges for uncertainty quantification. We establish that marginal validity holds for imputed datasets across various mechanisms of missing data and most imputation methods. Building on the framework of nonexchangeable conformal prediction, we demonstrate that coverage guarantees depend on the mask. To address this, we propose a nonexchangeable conformal prediction method for missing covariates that satisfies both marginal and mask-conditional validity. However, as this method does not ensure asymptotic conditional validity, we further introduce a localized conformal prediction approach that employs a novel score function based on kernel smoothing. This method achieves marginal, mask-conditional, and asymptotic conditional validity under certain assumptions. Extensive simulation studies and real-data analysis demonstrate the advantages of these proposed methods.
We propose a robust and scalable framework for variational Bayes (VB) that effectively handles outliers and contamination of arbitrary nature in large datasets. Our approach divides the dataset into disjoint subsets, computes the posterior for each subset, and applies VB approximation independently to these posteriors. The resulting variational posteriors with respect to the subsets are then aggregated using the geometric median of probability measures, computed with respect to the Wasserstein distance. This novel aggregation method yields the Variational Median Posterior (VM-Posterior) distribution. We rigorously demonstrate that the VM-Posterior preserves contraction properties akin to those of the true posterior, while accounting for approximation errors or the variational gap inherent in VB methods. We also provide provable robustness guarantee of the VM-Posterior. Furthermore, we establish a variational Bernstein-von Mises theorem for both multivariate Gaussian distributions with general covariance structures and the mean-field variational family. To facilitate practical implementation, we adapt existing algorithms for computing the VM-Posterior and evaluate its performance through extensive numerical experiments. The results highlight its robustness and scalability, making it a reliable tool for Bayesian inference in the presence of complex, contaminated datasets.
In this article, we introduce the mean independent component analysis for multivariate time series to reduce the parameter space. In particular, we seek for a contemporaneous linear transformation that detects univariate mean independent components so that each component can be modeled separately. The mean independent component analysis is flexible in the sense that no parametric model or distributional assumptions are made. We propose a unified framework to estimate the mean independent components from a data with a fixed dimension or a diverging dimension. We estimate the mean independent components by the martingale difference divergence so that the mean dependence across components and across time is minimized. The approach is extended to the group mean independent component analysis by imposing a group structure on the mean independent components. We further introduce a method to identify the group structure when it is unknown. The consistency of both proposed methods are established. Extensive simulations and a real data illustration for community mobility is provided to demonstrate the efficacy of our method.
The population size ("abundance") of wildlife species has central interest in ecological research and management. Distance sampling is a dominant approach to the estimation of wildlife abundance for many vertebrate animal species. One perceived advantage of distance sampling over the well-known alternative approach of capture-recapture is that distance sampling is thought to be robust to unmodelled heterogeneity in animal detection probability, via a conjecture known as "pooling robustness". Although distance sampling has been successfully applied and developed for decades, its statistical foundation is not complete: there are published proofs and arguments highlighting deficiency of the methodology. This work provides a design-based statistical foundation for distance sampling that has attainable assumptions. In addition, because identification and consistency of the developed distance sampling abundance estimator is unaffected by detection heterogeneity, the pooling robustness conjecture is resolved.
Archetypal analysis (AA) was originally proposed in 1994 by Adele Cutler and Leo Breiman as a computational procedure to extract the distinct aspects called archetypes in observations with each observational record approximated as a mixture (i.e., convex combination) of these archetypes. AA thereby provides straightforward, interpretable, and explainable representations for feature extraction and dimensionality reduction, facilitating the understanding of the structure of high-dimensional data with wide applications throughout the sciences. However, AA also faces challenges, particularly as the associated optimization problem is non-convex. This survey provides researchers and data mining practitioners an overview of methodologies and opportunities that AA has to offer surveying the many applications of AA across disparate fields of science, as well as best practices for modeling data using AA and limitations. The survey concludes by explaining important future research directions concerning AA.
The first step in evaluating a potential diagnostic biomarker is to examine the variation in its values across different disease groups. In a three-class disease setting, the volume under the receiver operating characteristic surface and the three-class Youden index are commonly used summary measures of a biomarker's discriminatory ability. However, these measures rely on a stochastic ordering assumption for the distributions of biomarker outcomes across the three groups. This assumption can be restrictive, particularly when covariates are involved, and its violation may lead to incorrect conclusions about a biomarker's ability to distinguish between the three disease classes. Even when a stochastic ordering exists, the order may vary across different biomarkers in discovery studies involving dozens or even thousands of candidate biomarkers, complicating automated ranking. To address these challenges and complement existing measures, we propose the underlap coefficient, a novel summary index of a biomarker's ability to distinguish between three (or more) disease groups, and study its properties. Additionally, we introduce Bayesian nonparametric estimators for both the unconditional underlap coefficient and its covariate-specific counterpart. These estimators are broadly applicable to a wide range of biomarkers and populations. A simulation study reveals a good performance of the proposed estimators across a range of conceivable scenarios. We illustrate the proposed approach through an application to an Alzheimer's disease (AD) dataset aimed to assess how four potential AD biomarkers distinguish between individuals with normal cognition, mild impairment, and dementia, and how and if age and gender impact this discriminatory ability.
Ocean microbes are critical to both ocean ecosystems and the global climate. Flow cytometry, which measures cell optical properties in fluid samples, is routinely used in oceanographic research. Despite decades of accumulated data, identifying key microbial populations (a process known as ``gating'') remains a significant analytical challenge. To address this, we focus on gating multidimensional, high-frequency flow cytometry data collected {\it continuously} on board oceanographic research vessels, capturing time- and space-wise variations in the dynamic ocean. Our paper proposes a novel mixture-of-experts model in which both the gating function and the experts are given by trend filtering. The model leverages two key assumptions: (1) Each snapshot of flow cytometry data is a mixture of multivariate Gaussians and (2) the parameters of these Gaussians vary smoothly over time. Our method uses regularization and a constraint to ensure smoothness and that cluster means match biologically distinct microbe types. We demonstrate, using flow cytometry data from the North Pacific Ocean, that our proposed model accurately matches human-annotated gating and corrects significant errors.
To appreciate intervention effects on rare events, meta-analysis techniques are commonly applied in order to assess the accumulated evidence. When it comes to adverse effects in clinical trials, these are often most adequately handled using survival methods. A common-effect model that is able to process data in commonly quoted formats in terms of hazard ratios has been proposed for this purpose by Holzhauer (Stat. Med. 2017; 36(5):723-737). In order to accommodate potential heterogeneity between studies, we have extended the model by Holzhauer to a random-effects approach. The Bayesian model is described in detail, and applications to realistic data sets are discussed along with sensitivity analyses and Monte Carlo simulations to support the conclusions.