Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
This note introduces a unified theory for causal inference that integrates Riesz regression, covariate balancing, density-ratio estimation (DRE), targeted maximum likelihood estimation (TMLE), and the matching estimator in average treatment effect (ATE) estimation. In ATE estimation, the balancing weights and the regression functions of the outcome play important roles, where the balancing weights are referred to as the Riesz representer, bias-correction term, and clever covariates, depending on the context. Riesz regression, covariate balancing, DRE, and the matching estimator are methods for estimating the balancing weights, where Riesz regression is essentially equivalent to DRE in the ATE context, the matching estimator is a special case of DRE, and DRE is in a dual relationship with covariate balancing. TMLE is a method for constructing regression function estimators such that the leading bias term becomes zero. Nearest Neighbor Matching is equivalent to Least Squares Density Ratio Estimation and Riesz Regression.
We conduct a KL-divergence based procedure for testing elliptical distributions. The procedure simultaneously takes into account the two defining properties of an elliptically distributed random vector: independence between length and direction, and uniform distribution of the direction. The test statistic is constructed based on the $k$ nearest neighbors ($k$NN) method, and two cases are considered where the mean vector and covariance matrix are known and unknown. First-order asymptotic properties of the test statistic are rigorously established by creatively utilizing sample splitting, truncation and transformation between Euclidean space and unit sphere, while avoiding assuming Fr\'echet differentiability of any functionals. Debiasing and variance inflation are further proposed to treat the degeneration of the influence function. Numerical implementations suggest better size and power performance than the state of the art procedures.
The goal of policy learning is to train a policy function that recommends a treatment given covariates to maximize population welfare. There are two major approaches in policy learning: the empirical welfare maximization (EWM) approach and the plug-in approach. The EWM approach is analogous to a classification problem, where one first builds an estimator of the population welfare, which is a functional of policy functions, and then trains a policy by maximizing the estimated welfare. In contrast, the plug-in approach is based on regression, where one first estimates the conditional average treatment effect (CATE) and then recommends the treatment with the highest estimated outcome. This study bridges the gap between the two approaches by showing that both are based on essentially the same optimization problem. In particular, we prove an exact equivalence between EWM and least squares over a reparameterization of the policy class. As a consequence, the two approaches are interchangeable in several respects and share the same theoretical guarantees under common conditions. Leveraging this equivalence, we propose a novel regularization method for policy learning. Our findings yield a convex and computationally efficient training procedure that avoids the NP-hard combinatorial step typically required in EWM.
Causal discovery is the subfield of causal inference concerned with estimating the structure of cause-and-effect relationships in a system of interrelated variables, as opposed to quantifying the strength of causal effects. As interest in causal discovery builds in fields such as ecology, public health, and environmental sciences where data is regularly collected with spatial and temporal structures, approaches must evolve to manage autocorrelation and complex confounding. As it stands, the few proposed causal discovery algorithms for spatiotemporal data require summarizing across locations, ignore spatial autocorrelation, and/or scale poorly to high dimensions. Here, we introduce our developing framework that extends time-series causal discovery to systems with spatial structure, building upon work on causal discovery across contexts and methods for handling spatial confounding in causal effect estimation. We close by outlining remaining gaps in the literature and directions for future research.
This paper studies decision-making and statistical inference for two-sided matching markets via matrix completion. In contrast to the independent sampling assumed in classical matrix completion literature, the observed entries, which arise from past matching data, are constrained by matching capacity. This matching-induced dependence poses new challenges for both estimation and inference in the matrix completion framework. We propose a non-convex algorithm based on Grassmannian gradient descent and establish near-optimal entrywise convergence rates for three canonical mechanisms, i.e., one-to-one matching, one-to-many matching with one-sided random arrival, and two-sided random arrival. To facilitate valid uncertainty quantification and hypothesis testing on matching decisions, we further develop a general debiasing and projection framework for arbitrary linear forms of the reward matrix, deriving asymptotic normality with finite-sample guarantees under matching-induced dependent sampling. Our empirical experiments demonstrate that the proposed approach provides accurate estimation, valid confidence intervals, and efficient evaluation of matching policies.
The difference-in-differences (DID) research design is a key identification strategy which allows researchers to estimate causal effects under the parallel trends assumption. While the parallel trends assumption is counterfactual and cannot be tested directly, researchers often examine pre-treatment periods to check whether the time trends are parallel before treatment is administered. Recently, researchers have been cautioned against using preliminary tests which aim to detect violations of parallel trends in the pre-treatment period. In this paper, we argue that preliminary testing can -- and should -- play an important role within the DID research design. We propose a new and more substantively appropriate conditional extrapolation assumption, which requires an analyst to conduct a preliminary test to determine whether the severity of pre-treatment parallel trend violations falls below an acceptable level before extrapolation to the post-treatment period is justified. This stands in contrast to prior work which can be interpreted as either setting the acceptable level to be exactly zero (in which case preliminary tests lack power) or assuming that extrapolation is always justified (in which case preliminary tests are not required). Under mild assumptions on how close the actual violation is to the acceptable level, we provide a consistent preliminary test as well confidence intervals which are valid when conditioned on the result of the test. The conditional coverage of these intervals overcomes a common critique made against the use of preliminary testing within the DID research design. We use real data as well as numerical simulations to illustrate the performance of the proposed methods.
This paper introduces a unified family of smoothed quantile estimators that continuously interpolate between classical empirical quantiles and the sample mean. The estimators q(z, h) are defined as minimizers of a regularized objective function depending on two parameters: a smoothing parameter h $\ge$ 0 and a location parameter z $\in$ R. When h = 0 and z $\in$ (-1, 1), the estimator reduces to the empirical quantile of order $\tau$ = (1z)/2; as h $\rightarrow$ $\infty$, it converges to the sample mean for any fixed z. We establish consistency, asymptotic normality, and an explicit variance expression characterizing the efficiency-robustness trade-off induced by h. A key geometric insight shows that for each fixed quantile level $\tau$ , the admissible parameter pairs (z, h) lie on a straight line in the parameter space, along which the population quantile remains constant while asymptotic efficiency varies. The analysis reveals two regimes: under light-tailed distributions (e.g., Gaussian), smoothing yields a monotonic but asymptotic variance reduction with no finite optimum; under heavy-tailed distributions (e.g., Laplace), a finite smoothing level h * ($\tau$ ) > 0 achieves strict efficiency improvement over the classical empirical quantile. Numerical illustrations confirm these theoretical predictions and highlight how smoothing balances robustness and efficiency across quantile levels.
Characterizing the genetic basis of survival traits, such as age at disease onset, is critical for risk stratification, early intervention, and elucidating biological mechanisms that can inform therapeutic development. However, time-to-event outcomes in human cohorts are frequently right-censored, complicating both the estimation and partitioning of total heritability. Modern biobanks linked to electronic health records offer the unprecedented power to dissect the genetic basis of age-at-diagnosis traits at large scale. Yet, few methods exist for estimating and partitioning the total heritability of censored survival traits. Existing methods impose restrictive distributional assumptions on genetic and environmental effects and are not scalable to large biobanks with a million subjects. We introduce a censored multiple variance component model to robustly estimate the total heritability of survival traits under right-censoring. We demonstrate through extensive simulations that the method provides accurate total heritability estimates of right-censored traits at censoring rates up to 80% given sufficient sample size. The method is computationally efficient in estimating one hundred genetic variance components of a survival trait using large-scale biobank genotype data consisting of a million subjects and a million SNPs in under nine hours, including uncertainty quantification. We apply our method to estimate the total heritability of four age-at-diagnosis traits from the UK Biobank study. Our results establish a scalable and robust framework for heritability analysis of right-censored survival traits in large-scale genetic studies.
Spatial regression models have a variety of applications in several fields ranging from economics to public health. Typically, it is of interest to select important exogenous predictors of the spatially autocorrelated response variable. In this paper, we propose variable selection in linear spatial lag models by means of the focussed information criterion (FIC). The FIC-based variable selection involves the minimization of the asymptotic risk in the estimation of a certain parametric focus function of interest under potential model misspecification. We systematically investigate the key asymptotics of the maximum likelihood estimators under the sequence of locally perturbed mutually contiguous probability models. Using these results, we obtain the expressions for the bias and the variance of the estimated focus leading to the desired FIC formula. We provide practically useful focus functions that account for various spatial characteristics such as mean response, variability in the estimation and spatial spillover effects. Furthermore, we develop an averaged version of the FIC that incorporates varying covariate levels while evaluating the models. The empirical performance of the proposed methodology is demonstrated through simulations and real data analysis.
Mutational signatures are powerful summaries of the mutational processes altering the DNA of cancer cells and are increasingly relevant as biomarkers in personalized treatments. The widespread approach to mutational signature analysis consists of decomposing the matrix of mutation counts from a sample of patients via non-negative matrix factorization (NMF) algorithms. However, by working with aggregate counts, this procedure ignores the non-homogeneous patterns of occurrence of somatic mutations along the genome, as well as the tissue-specific characteristics that notoriously influence their rate of appearance. This gap is primarily due to a lack of adequate methodologies to leverage locus-specific covariates directly in the factorization. In this paper, we address these limitations by introducing a model based on Poisson point processes to infer mutational signatures and their activities as they vary across genomic regions. Using covariate-dependent factorized intensity functions, our Poisson process factorization (PPF) generalizes the baseline NMF model to include regression coefficients that capture the effect of commonly known genomic features on the mutation rates from each latent process. Furthermore, our method relies on sparsity-inducing hierarchical priors to automatically infer the number of active latent factors in the data, avoiding the need to fit multiple models for a range of plausible ranks. We present algorithms to obtain maximum a posteriori estimates and uncertainty quantification via Markov chain Monte Carlo. We test the method on simulated data and on real data from breast cancer, using covariates on alterations in chromosomal copies, histone modifications, cell replication timing, nucleosome positioning, and DNA methylation. Our results shed light on the joint effect that epigenetic marks have on the latent processes at high resolution.
This paper reinterprets the Synthetic Control (SC) framework through the lens of weighting philosophy, arguing that the contrast between traditional SC and Difference-in-Differences (DID) reflects two distinct modeling mindsets: sparse versus dense weighting schemes. Rather than viewing sparsity as inherently superior, we treat it as a modeling choice simple but potentially fragile. We propose an L-infinity-regularized SC method that combines the strengths of both approaches. Like DID, it employs a denser weighting scheme that distributes weights more evenly across control units, enhancing robustness and reducing overreliance on a few control units. Like traditional SC, it remains flexible and data-driven, increasing the likelihood of satisfying the parallel trends assumption while preserving interpretability. We develop an interior point algorithm for efficient computation, derive asymptotic theory under weak dependence, and demonstrate strong finite-sample performance through simulations and real-world applications.
We study the statistical properties of nonparametric distance-based (isotropic) local polynomial regression estimators of the boundary average treatment effect curve, a key causal functional parameter capturing heterogeneous treatment effects in boundary discontinuity designs. We present necessary and/or sufficient conditions for identification, estimation, and inference in large samples, both pointwise and uniformly along the boundary. Our theoretical results highlight the crucial role played by the ``regularity'' of the boundary (a one-dimensional manifold) over which identification, estimation, and inference are conducted. Our methods are illustrated with simulated data. Companion general-purpose software is provided.
Imbalanced data, where the positive samples represent only a small proportion compared to the negative samples, makes it challenging for classification problems to balance the false positive and false negative rates. A common approach to addressing the challenge involves generating synthetic data for the minority group and then training classification models with both observed and synthetic data. However, since the synthetic data depends on the observed data and fails to replicate the original data distribution accurately, prediction accuracy is reduced when the synthetic data is naively treated as the true data. In this paper, we address the bias introduced by synthetic data and provide consistent estimators for this bias by borrowing information from the majority group. We propose a bias correction procedure to mitigate the adverse effects of synthetic data, enhancing prediction accuracy while avoiding overfitting. This procedure is extended to broader scenarios with imbalanced data, such as imbalanced multi-task learning and causal inference. Theoretical properties, including bounds on bias estimation errors and improvements in prediction accuracy, are provided. Simulation results and data analysis on handwritten digit datasets demonstrate the effectiveness of our method.
We consider the problem of estimating a causal effect in a multi-domain setting. The causal effect of interest is confounded by an unobserved confounder and can change between the different domains. We assume that we have access to a proxy of the hidden confounder and that all variables are discrete or categorical. We propose methodology to estimate the causal effect in the target domain, where we assume to observe only the proxy variable. Under these conditions, we prove identifiability (even when treatment and response variables are continuous). We introduce two estimation techniques, prove consistency, and derive confidence intervals. The theoretical results are supported by simulation studies and a real-world example studying the causal effect of website rankings on consumer choices.
In modern industrial settings, advanced acquisition systems allow for the collection of data in the form of profiles, that is, as functional relationships linking responses to explanatory variables. In this context, statistical process monitoring (SPM) aims to assess the stability of profiles over time in order to detect unexpected behavior. This review focuses on SPM methods that model profiles as functional data, i.e., smooth functions defined over a continuous domain, and apply functional data analysis (FDA) tools to address limitations of traditional monitoring techniques. A reference framework for monitoring multivariate functional data is first presented. This review then offers a focused survey of several recent FDA-based profile monitoring methods that extend this framework to address common challenges encountered in real-world applications. These include approaches that integrate additional functional covariates to enhance detection power, a robust method designed to accommodate outlying observations, a real-time monitoring technique for partially observed profiles, and two adaptive strategies that target the characteristics of the out-of-control distribution. These methods are all implemented in the R package funcharts, available on CRAN. Finally, a review of additional existing FDA-based profile monitoring methods is also presented, along with suggestions for future research.
The partial correlation graphical LASSO (PCGLASSO) is a penalised likelihood method for Gaussian graphical models which provides scale invariant sparse estimation of the precision matrix and improves upon the popular graphical LASSO method. However, the PCGLASSO suffers from computational challenges due to the non-convexity of its associated optimisation problem. This paper provides some important breakthroughs in the computation of the PCGLASSO. First, the existence of the PCGLASSO estimate is proven when the sample size is smaller than the dimension - a case in which the maximum likelihood estimate does not exist. This means that the PCGLASSO can be used with any Gaussian data. Second, a new alternating algorithm for computing the PCGLASSO is proposed and implemented in the R package PCGLASSO available at https://github.com/JackStorrorCarter/PCGLASSO. This was the first publicly available implementation of the PCGLASSO and provides competitive computation time for moderate dimension size.
We extend a heuristic method for automatic dimensionality selection, which maximizes a profile likelihood to identify "elbows" in scree plots. Our extension enables researchers to make automatic choices of multiple hyper-parameters simultaneously. To facilitate our extension to multi-dimensions, we propose a "softened" profile likelihood. We present two distinct parameterizations of our solution and demonstrate our approach on elastic nets, support vector machines, and neural networks. We also briefly discuss applications of our method to other data-analytic tasks than hyper-parameter selection.
We propose a method for variable selection in the intensity function of spatial point processes that combines sparsity-promoting estimation with noise-robust model selection. As high-resolution spatial data becomes increasingly available through remote sensing and automated image analysis, identifying spatial covariates that influence the localization of events is crucial to understand the underlying mechanism. However, results from automated acquisition techniques are often noisy, for example due to measurement uncertainties or detection errors, which leads to spurious displacements and missed events. We study the impact of such noise on sparse point-process estimation across different models, including Poisson and Thomas processes. To improve noise robustness, we propose to use stability selection based on point-process subsampling and to incorporate a non-convex best-subset penalty to enhance model-selection performance. In extensive simulations, we demonstrate that such an approach reliably recovers true covariates under diverse noise scenarios and improves both selection accuracy and stability. We then apply the proposed method to a forestry data set, analyzing the distribution of trees in relation to elevation and soil nutrients in a tropical rain forest. This shows the practical utility of the method, which provides a systematic framework for robust variable selection in spatial point-process models under noise, without requiring additional knowledge of the process.