Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
Controlling the False Discovery Rate (FDR) in variable selection is crucial for reproducibility and preventing over-selection, particularly with the increasing prevalence of predictive modeling. The Split Knockoff method, a recent extension of the canonical Knockoffs framework, offers finite-sample FDR control for selecting sparse transformations, finding applications across signal processing, economics, information technology, and the life sciences. However, its current formulation is limited to fixed design settings, restricting its use to linear models. The question of whether it can be generalized to random designs, thereby accommodating a broader range of models beyond the linear case -- similar to the Model-X Knockoff framework -- remains unanswered. A major challenge in addressing transformational sparsity within random design settings lies in reconciling the combination of a random design with a deterministic transformation. To overcome this limitation, we propose the Model-X Split Knockoff method. Our method achieves FDR control for transformation selection in random designs, bridging the gap between existing approaches. This is accomplished by introducing an auxiliary randomized design that interacts with both the existing random design and the deterministic transformation, enabling the construction of Model-X Split Knockoffs. Like the classical Model-X framework, our method provides provable finite-sample FDR control under known or accurately estimated covariate distributions, regardless of the conditional distribution of the response. Importantly, it guarantees at least the same selection power as Model-X Knockoffs when both are applicable. Empirical studies, including simulations and real-world applications to Alzheimer's disease imaging and university ranking analysis, demonstrate robust FDR control and suggest improved selection power over the original Model-X approach.
We introduce a novel Bayesian method that can detect multiple structural breaks in the mean and variance of a length $T$ time-series. Our method quantifies the uncertainty by returning $\alpha$-level credible sets around the estimated locations of the breaks. In the case of a single change in the mean and/or the variance of an independent sub-Gaussian sequence, we prove that our method attains a localization rate that is minimax optimal up to a $\log T$ factor. For an $\alpha$-mixing sequence with dependence, we prove this optimality holds up to $\log^2 T$ factor. For $d$-dimensional mean changes, we show that if $d \gtrsim \log T$ and the mean signal is dense, then our method exactly recovers the location of the change at the optimal rate. We show that we can modularly combine single change-point models to detect multiple change-points. This approach enables efficient inference using a variational approximation of the posterior distribution for the change-points. The proposal is applicable to both continuous and count data. Extensive simulation studies demonstrate that our method is competitive with the state-of-the-art and returns credible sets that are an order of magnitude smaller than those returned by competitors without sacrificing nominal coverage guarantees. We test our method on real data by detecting i) gating of the ion channels in the outer membrane of a bacterial cell, and ii) changes in the lithological structure of an oil well.
Markov-switching models are a powerful tool for modelling time series data that are driven by underlying latent states. As such, they are widely used in behavioural ecology, where discrete states can serve as proxies for behavioural modes and enable inference on latent behaviour driving e.g. observed movement. To understand drivers of behavioural changes, it is common to link model parameters to covariates. Over the last decade, nonparametric approaches have gained traction in this context to avoid unrealistic parametric assumptions. Nonetheless, existing methods are largely limited to univariate smooth functions of covariates, based on penalised splines, while real processes are typically complex requiring consideration of interaction effects. We address this gap by incorporating tensor-product interactions into Markov-switching models, enabling flexible modelling of multidimensional effects in a computationally efficient manner. Based on the extended Fellner-Schall method, we develop an efficient automatic smoothness selection procedure that is robust and scales well with the number of smooth functions in the model. The method builds on a random effects view of the spline coefficients and yields a recursive penalised likelihood procedure. As special cases, this general framework accommodates bivariate smoothing, function-valued random effects, and space-time interactions. We demonstrate its practical utility through three ecological case studies of an African elephant, common fruitflies, and Arctic muskoxen. The methodology is implemented in the LaMa R package, providing applied ecologists with an accessible and flexible tool for semiparametric inference in hidden-state models. The approach has the potential to drastically improve the level of detail in inference, allowing to fit HMMs with hundreds of parameters, 10-20 (potentially bivariate) smooths to thousands of observations.
Gaussian mixture models (GMMs) are ubiquitous in statistical learning, particularly for unsupervised problems. While full GMMs suffer from the overparameterization of their covariance matrices in high-dimensional spaces, spherical GMMs (with isotropic covariance matrices) certainly lack flexibility to fit certain anisotropic distributions. Connecting these two extremes, we introduce a new family of parsimonious GMMs with piecewise-constant covariance eigenvalue profiles. These extend several low-rank models like the celebrated mixtures of probabilistic principal component analyzers (MPPCA), by enabling any possible sequence of eigenvalue multiplicities. If the latter are prespecified, then we can naturally derive an expectation-maximization (EM) algorithm to learn the mixture parameters. Otherwise, to address the notoriously-challenging issue of jointly learning the mixture parameters and hyperparameters, we propose a componentwise penalized EM algorithm, whose monotonicity is proven. We show the superior likelihood-parsimony tradeoffs achieved by our models on a variety of unsupervised experiments: density fitting, clustering and single-image denoising.
The use of metrics underpins the quantification, communication and, ultimately, the functioning of a wide range of disciplines as diverse as labour recruitment, institutional management, economics and science. For application of metrics, customised scores are widely employed to optimise progress monitoring towards a goal, to contribute to decision-making, and to quantify situations under evaluation. However, the development of such metrics in complex and rigorous settings intrinsically relies on mathematical processes which are not always readily accessible. Here, we propose a framework for construction of metrics suitable for a wide range of disciplines, following a specified workflow that combines existing decision analysis and utility theory concepts to create a customisable performance metric (with corresponding uncertainty) that can be used to quantitatively evaluate goal achievement. It involves dividing criteria into two groups (root and additional) to utilise a newly proposed alternative form of utility function designed to build such customised metrics. Once the metrics are produced by this approach, these metrics can be used on a varied set of contexts, including their use in subsequent statistical analysis with the metric values as a response variable, or informing a decision-making process. The flexibility of the metric construction makes it suitable for a wide range of fields and applications, and could provide a valuable first step for monitoring and comparison in many different settings.
Graphical models have been popularly used for capturing conditional independence structure in multivariate data, which are often built upon independent and identically distributed observations, limiting their applicability to complex datasets such as network-linked data. This paper proposes a nonparametric graphical model that addresses these limitations by accommodating heterogeneous graph structures without imposing any specific distributional assumptions. The proposed estimation method effectively integrates network embedding with nonparametric graphical model estimation. It further transforms the graph learning task into solving a finite-dimensional linear equation system by leveraging the properties of vector-valued reproducing kernel Hilbert space. Moreover, theoretical guarantees are established for the proposed method in terms of the estimation consistency and exact recovery of the heterogeneous graph structures. Its effectiveness is also demonstrated through a variety of simulated examples and a real application to the statistician coauthorship dataset.
We present a novel tuning procedure for random forests (RFs) that improves the accuracy of estimated quantiles and produces valid, relatively narrow prediction intervals. While RFs are typically used to estimate mean responses (conditional on covariates), they can also be used to estimate quantiles by estimating the full distribution of the response. However, standard approaches for building RFs often result in excessively biased quantile estimates. To reduce this bias, our proposed tuning procedure minimizes "quantile coverage loss" (QCL), which we define as the estimated bias of the marginal quantile coverage probability estimate based on the out-of-bag sample. We adapt QCL tuning to handle censored data and demonstrate its use with random survival forests. We show that QCL tuning results in quantile estimates with more accurate coverage probabilities than those achieved using default parameter values or traditional tuning (using MSPE for uncensored data and C-index for censored data), while also reducing the estimated MSE of these coverage probabilities. We discuss how the superior performance of QCL tuning is linked to its alignment with the estimation goal. Finally, we explore the validity and width of prediction intervals created using this method.
Flow cytometry is a valuable technique that measures the optical properties of particles at a single-cell resolution. When deployed in the ocean, flow cytometry allows oceanographers to study different types of photosynthetic microbes called phytoplankton. It is of great interest to study how phytoplankton properties change in response to environmental conditions. In our work, we develop a nonlinear mixture of experts model for simultaneous clustering and regression utilizing random-weight neural networks. Our model allows one to flexibly estimate how cell properties and relative abundances depend on environmental covariates, without the computational burden of backpropagation. We show that the proposed model provides superior predictive performance in simulated examples compared to a mixture of linear experts. Also, applying our model to real data, we show that our model has (1) comparable out-of-sample prediction performance, and (2) more realistic estimates of phytoplankton behavior.
Extremile regression, as a least squares analog of quantile regression, is potentially useful tool for modeling and understanding the extreme tails of a distribution. However, existing extremile regression methods, as nonparametric approaches, may face challenges in high-dimensional settings due to data sparsity, computational inefficiency, and the risk of overfitting. While linear regression serves as the foundation for many other statistical and machine learning models due to its simplicity, interpretability, and relatively easy implementation, particularly in high-dimensional settings, this paper introduces a novel definition of linear extremile regression along with an accompanying estimation methodology. The regression coefficient estimators of this method achieve $\sqrt{n}$-consistency, which nonparametric extremile regression may not provide. In particular, while semi-supervised learning can leverage unlabeled data to make more accurate predictions and avoid overfitting to small labeled datasets in high-dimensional spaces, we propose a semi-supervised learning approach to enhance estimation efficiency, even when the specified linear extremile regression model may be misspecified. Both simulation studies and real data analyses demonstrate the finite-sample performance of our proposed methods.
The present paper answers the following questions related with high-dimensional manova: (i) is it possible to develop a likelihood ratio test for high-dimensional manova? (ii) would such test perform well? (iii) would it be able to outperform existing tests? (iv) would it be applicable to extremely small samples? (v) would it be applicable to non-normal random variables, as uniform, extremely skewed distributions, or even heavy tailed distributions with success? (vi) would it have a nice, rather simple to compute and well performing, asymptotic distribution? And what about if the answer to all the above questions would be a clear 'Yes'? Surprisingly enough, it is exactly the case. Extensive simulations, with both normal and non-normal distributions, some of which are heavy tailed and/or highly skewed, and even discrete distributions, are carried out in order to evaluate the performance of the proposed test and to compare its performance with other tests. Two real data applications are presented.
When using observational causal models, practitioners often want to disentangle the effects of many related, partially-overlapping treatments. Examples include estimating treatment effects of different marketing touchpoints, ordering different types of products, or signing up for different services. Common approaches that estimate separate treatment coefficients are too noisy for practical decision-making. We propose a computationally light model that uses a customized ridge regression to move between a heterogeneous and a homogenous model: it substantially reduces MSE for the effects of each individual sub-treatment while allowing us to easily reconstruct the effects of an aggregated treatment. We demonstrate the properties of this estimator in theory and simulation, and illustrate how it has unlocked targeted decision-making at Wayfair.
Networks, composed of nodes and their connections, are widely used to model complex relationships across various fields. Centrality metrics often inform decisions such as identifying key nodes or prioritizing resources. However, networks frequently suffer from missing or incorrect edges, which can systematically centrality-based decisions and distort the representation of certain protected groups. To address this issue, we introduce a formal definition of minority representation, measured as the proportion of minority nodes among the top-ranked nodes. We model systematic bias against minority groups by using group-dependent missing edge errors. We propose methods to estimate and detect systematic bias. Asymptotic limits of minority representation statistics are derived under canonical network models and used to correct representation of minority groups in node rankings. Simulation results demonstrate the effectiveness of our estimation, testing, and ranking correction procedures, and we apply our methods to a contact network, showcasing their practical applicability.
The Kaplan-Meier estimate, also known as the product-limit method (PLM), is a widely used non-parametric maximum likelihood estimator (MLE) in survival analysis. In the context of highway engineering, it has been repeatedly applied to estimate stochastic traffic flow capacity. However, this paper demonstrates that PLM is fundamentally unsuitable for this purpose. The method implicitly assumes continuous exposure to failure risk over time - a premise invalid for traffic flow, where intensity does not increase linearly, and capacity is not even directly observable. Although parametric MLE approach offers a viable alternative, earlier derivation suffers from flawed likelihood formulation, likely due to attempt to preserve consistency with PLM. This study derives a corrected likelihood formula for stochastic capacity MLE and validates it using two empirical datasets. The proposed method is then applied in a case study examining the effect of a variable speed limit (VSL) system used for traffic flow speed harmonisation at a 2 to 1 lane drop. Results show that the VSL improved capacity by approximately 10% or reduced breakdown probability at the same flow intensity by up to 50%. The findings underscore the methodological importance of correct model formulation and highlight the practical relevance of stochastic capacity estimation for evaluating traffic control strategies.
Although appealing, randomization inference for treatment effects can suffer from severe size distortion due to sample attrition. We propose new, computationally efficient methods for randomization inference that remain valid under a range of potentially informative missingness mechanisms. We begin by constructing valid p-values for testing sharp null hypotheses, using the worst-case p-value from the Fisher randomization test over all possible imputations of missing outcomes. Leveraging distribution-free test statistics, this worst-case p-value admits a closed-form solution, connecting naturally to bounds in the partial identification literature. Our test statistics incorporate both potential outcomes and missingness indicators, allowing us to exploit structural assumptions-such as monotone missingness-for increased power. We further extend our framework to test non-sharp null hypotheses concerning quantiles of individual treatment effects. The methods are illustrated through simulations and an empirical application.
Optimal data detection in massive multiple-input multiple-output (MIMO) systems often requires prohibitively high computational complexity. A variety of detection algorithms have been proposed in the literature, offering different trade-offs between complexity and detection performance. In recent years, Variational Bayes (VB) has emerged as a widely used method for addressing statistical inference in the context of massive data. This study focuses on misspecified models and examines the risk functions associated with predictive distributions derived from variational posterior distributions. These risk functions, defined as the expectation of the Kullback-Leibler (KL) divergence between the true data-generating density and the variational predictive distributions, provide a framework for assessing predictive performance. We propose two novel information criteria for predictive model comparison based on these risk functions. Under certain regularity conditions, we demonstrate that the proposed information criteria are asymptotically unbiased estimators of their respective risk functions. Through comprehensive numerical simulations and empirical applications in economics and finance, we demonstrate the effectiveness of these information criteria in comparing misspecified models in the context of massive data.
The statistical analysis of clinical trials is often complicated by missing data. Patients sometimes experience intercurrent events (ICEs), which usually (although not always) lead to missing subsequent outcome measurements for such individuals. The reference-based imputation methods were proposed by Carpenter et al. (2013) and have been commonly adopted for handling missing data due to ICEs when estimating treatment policy strategy estimands. Conventionally, the variance for reference-based estimators was obtained using Rubin's rules. However, Rubin's rules variance estimator is biased compared to the repeated sampling variance of the point estimator, due to uncongeniality. Repeated sampling variance estimators were proposed as an alternative to variance estimation for reference-based estimators. However, these have the property that they decrease as the proportion of ICEs increases. White et al. (2019) introduced a causal model incorporating the concept of a 'maintained treatment effect' following the occurrence of ICEs and showed that this causal model included common reference-based estimators as special cases. Building on this framework, we propose introducing a prior distribution for the maintained effect parameter to account for uncertainty in this assumption. Our approach provides inference for reference-based estimators that explicitly reflects our uncertainty about how much treatment effects are maintained after the occurrence of ICEs. In trials where no or little post-ICE data are observed, our proposed Bayesian reference-based causal model approach can be used to estimate the treatment policy treatment effect, incorporating uncertainty about the reference-based assumption. We compare the frequentist properties of this approach with existing reference-based methods through simulations and by application to an antidepressant trial.
Traditional machine learning approaches in physics rely on global optimization, limiting interpretability and enforcing physical constraints externally. We introduce the Hebbian Physics Network (HPN), a self-organizing computational framework in which learning emerges from local Hebbian updates driven by violations of conservation laws. Grounded in non-equilibrium thermodynamics and inspired by Prigogine/'s theory of dissipative structures, HPNs eliminate the need for global loss functions by encoding physical laws directly into the system/'s local dynamics. Residuals - quantified imbalances in continuity, momentum, or energy - serve as thermodynamic signals that drive weight adaptation through generalized Hebbian plasticity. We demonstrate this approach on incompressible fluid flow and continuum diffusion, where physically consistent structures emerge from random initial conditions without supervision. HPNs reframe computation as a residual-driven thermodynamic process, offering an interpretable, scalable, and physically grounded alternative for modeling complex dynamical systems.
In this paper, we propose a novel coefficient, named differential distance correlation, to measure the strength of dependence between a random variable $ Y \in \mathbb {R} $ and a random vector $ X \in \mathbb {R}^{p} $. The coefficient has a concise expression and is invariant to arbitrary orthogonal transformations of the random vector. Moreover, the coefficient is a strongly consistent estimator of a simple and interpretable dependent measure, which is 0 if and only if $ X $ and $ Y $ are independent and equal to 1 if and only if $ Y $ determines $ X $ almost surely. Furthermore, the coefficient exhibits asymptotic normality with a simple variance under the independent hypothesis, facilitating fast and accurate estimation of p-value for testing independence. Two simulated experiments demonstrate that our proposed coefficient outperforms some dependence measures in identifying relationships with higher oscillatory behavior. We also apply our method to analyze a real data example.