Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
We develop a framework for the operationalization of models and parameters by combining de Finetti's representation theorem with a conditional form of Sanov's theorem. This synthesis, the tilted de Finetti theorem, shows that conditioning exchangeable sequences on empirical moment constraints yields predictive laws in exponential families via the I-projection of a baseline measure. Parameters emerge as limits of empirical functionals, providing a probabilistic foundation for maximum entropy (MaxEnt) principles. This explains why exponential tilting governs likelihood methods and Bayesian updating, connecting naturally to finite-sample concentration rates that anticipate PAC-Bayes bounds. Examples include Gaussian scale mixtures, where symmetry uniquely selects location-scale families, and Jaynes' Brandeis dice problem, where partial information tilts the uniform law. Broadly, the theorem unifies exchangeability, large deviations, and entropy concentration, clarifying the ubiquity of exponential families and MaxEnt's role as the inevitable predictive limit under partial information.
Discrete Bayesian networks (DBNs) provide a broadly useful framework for modeling dependence structures in multivariate categorical data. There is a vast literature on methods for inferring conditional probabilities and graphical structure in DBNs, but data sparsity and parametric assumptions are major practical issues. In this article, we detail a comprehensive Bayesian framework for learning DBNs. First, we propose a hierarchical prior for the conditional probabilities that enables complicated interactions between parent variables and stability in sparse regimes. We give a novel Markov chain Monte Carlo (MCMC) algorithm utilizing parallel Langevin proposals to generate exact posterior samples, avoiding the pitfalls of variational approximations. Moreover, we verify that the full conditional distribution of the concentration parameters is log-concave under mild conditions, facilitating efficient sampling. We then propose two methods for learning network structures, including parent sets, Markov blankets, and DAGs, from categorical data. The first cycles through individual edges each MCMC iteration, whereas the second updates the entire structure as a single step. We evaluate the accuracy, power, and MCMC performance of our methods on several simulation studies. Finally, we apply our methodology to uncover prognostic network structure from primary breast cancer samples.
We propose a semiparametric framework for causal inference with right-censored survival outcomes and many weak invalid instruments, motivated by Mendelian randomization in biobank studies where classical methods may fail. We adopt an accelerated failure time model and construct a moment condition based on augmented inverse probability of censoring weighting, incorporating both uncensored and censored observations. Under a heteroscedasticity-based condition on the treatment model, we establish point identification of the causal effect despite censoring and invalid instruments. We propose GEL-NOW (Generalized Empirical Likelihood with Non-Orthogonal and Weak moments) for valid inference under these conditions. A divergent number of Neyman orthogonal nuisance functions is estimated using deep neural networks. A key challenge is that the conditional censoring distribution is a non-Neyman orthogonal nuisance, contributing to the first-order asymptotics of the estimator for the target causal effect parameter. We derive the asymptotic distribution and explicitly incorporate this additional uncertainty into the asymptotic variance formula. We also introduce a censoring-adjusted over-identification test that accounts for this variance component. Simulation studies and UK Biobank applications demonstrate the method's robustness and practical utility.
The identification of causal effects in observational studies typically relies on two standard assumptions: unconfoundedness and overlap. However, both assumptions are often questionable in practice: unconfoundedness is inherently untestable, and overlap may fail in the presence of extreme unmeasured confounding. While various approaches have been developed to address unmeasured confounding and extreme propensity scores separately, few methods accommodate simultaneous violations of both assumptions. In this paper, we propose a sensitivity analysis framework that relaxes both unconfoundedness and overlap, building upon the marginal sensitivity model. Specifically, we allow the bound on unmeasured confounding to hold for only a subset of the population, thereby accommodating heterogeneity in confounding and allowing treatment probabilities to be zero or one. Moreover, unlike prior work, our approach does not require bounded outcomes and focuses on overlap-weighted average treatment effects, which are both practically meaningful and robust to non-overlap. We develop computationally efficient methods to obtain worst-case bounds via linear programming, and introduce a novel augmented percentile bootstrap procedure for statistical inference. This bootstrap method handles parameters defined through over-identified estimating equations involving unobserved variables and may be of independent interest. Our work provides a unified and flexible framework for sensitivity analysis under violations of both unconfoundedness and overlap.
Beta regression is frequently used when the outcome variable y is bounded within a specific interval, transformed to the (0, 1) domain if necessary. However, standard beta regression cannot handle data observed at the boundary values of 0 or 1, as the likelihood function takes on values of either 0 or infinity. To address this issue, we propose the Scale-Location-Truncated beta (SLTB) regression model, which extends the beta distribution's domain to the [0, 1] interval. By using scale-location transformation and truncation, SLTB distribution allows positive finite mass to the boundary values, offering a flexible approach for handling values at 0 and 1. In this paper, we demonstrate the effectiveness of the SLTB regression model in comparison to standard beta regression models and other approaches like the Zero-One Inflated Beta (ZOIB) mixture model and XBX regression. Using empirical and simulated data, we compare the performance including predictive accuracy of the SLTB regression model with other methods, particularly in cases with observed boundary data values for y. The SLTB model is shown to offer great flexibility, supporting both linear and nonlinear relationships. Additionally, we implement the SLTB model within maximum likelihood and Bayesian frameworks, employing both hierarchical and non-hierarchical models. These comprehensive implementations demonstrate the broad applicability of SLTB model for modeling data with bounded values in a variety of contexts.
We make three contributions to conformal prediction. First, we propose fuzzy conformal confidence sets that offer a degree of exclusion, generalizing beyond the binary inclusion/exclusion offered by classical confidence sets. We connect fuzzy confidence sets to e-values to show this degree of exclusion is equivalent to an exclusion at different confidence levels, capturing precisely what e-values bring to conformal prediction. We show that a fuzzy confidence set is a predictive distribution with a more appropriate error guarantee. Second, we derive optimal conformal confidence sets by interpreting the minimization of the expected measure of the confidence set as an optimal testing problem against a particular alternative. We use this to characterize exactly in what sense traditional conformal prediction is optimal. Third, we generalize the inheritance of guarantees by subsequent minimax decisions from confidence sets to fuzzy confidence sets. All our results generalize beyond the exchangeable conformal setting to prediction sets for arbitrary models. In particular, we find that any valid test (e-value) for a hypothesis automatically defines a (fuzzy) prediction confidence set.
Spatial two-component mixture models offer a robust framework for analyzing spatially correlated data with zero inflation. To circumvent potential biases introduced by assuming a specific distribution for the response variables, we employ a flexible spatial zero-inflated model. Despite its flexibility, this model poses significant computational challenges, particularly with large datasets, due to the high dimensionality of spatially dependent latent variables, the complexity of matrix operations, and the slow convergence of estimation procedures. To overcome these challenges, we propose a projection-based approach that reduces the dimensionality of the problem by projecting spatially dependent latent variables onto a lower-dimensional space defined by a selected set of basis functions. We further develop an efficient iterative algorithm for parameter estimation, incorporating a generalized estimating equation (GEE) framework. The optimal number of basis functions is determined using Akaike's information criterion (AIC), and the stability of the parameter estimates is assessed using the block jackknife method. The proposed method is validated through a comprehensive simulation study and applied to the analysis of Taiwan's daily rainfall data for 2016, demonstrating its practical utility and effectiveness.
In this article, we propose a least squares method for the estimation of the transition density in bifurcating Markov models. Unlike the kernel estimation, this method do not use the quotient which can be a source of errors. In order to study the rate of convergence for least squares estimators, we develop exponential inequalities for empirical process of bifurcating Markov chain under bracketing assumption. Unlike the classical processes, we observe that for bifurcating Markov chains, the complexity parameter depends on the ergodicity rate and as consequence, we have that the convergence rate of our estimator is a function of the ergodicity rate. We conclude with a numerical study to validate our theoretical results.
Nonstationary spatial processes can often be represented as stationary processes on a warped spatial domain. Selecting an appropriate spatial warping function for a given application is often difficult and, as a result of this, warping methods have largely been limited to two-dimensional spatial domains. In this paper, we introduce a novel approach to modeling nonstationary, anisotropic spatial processes using neural autoregressive flows (NAFs), a class of invertible mappings capable of generating complex, high-dimensional warpings. Through simulation studies we demonstrate that a NAF-based model has greater representational capacity than other commonly used spatial process models. We apply our proposed modeling framework to a subset of the 3D Argo Floats dataset, highlighting the utility of our framework in real-world applications.
This paper proposes a novel low-rank approximation to the multivariate State-Space Model. The Stochastic Partial Differential Equation (SPDE) approach is applied component-wise to the independent-in-time Mat\'ern Gaussian innovation term in the latent equation, assuming component independence. This results in a sparse representation of the latent process on a finite element mesh, allowing for scalable inference through sparse matrix operations. Dependencies among observed components are introduced through a matrix of weights applied to the latent process. Model parameters are estimated using the Expectation-Maximisation algorithm, which features closed-form updates for most parameters and efficient numerical routines for the remaining parameters. We prove theoretical results regarding the accuracy and convergence of the SPDE-based approximation under fixed-domain asymptotics. Simulation studies show our theoretical results. We include an empirical application on air quality to demonstrate the practical usefulness of the proposed model, which maintains computational efficiency in high-dimensional settings. In this application, we reduce computation time by about 93%, with only a 15% increase in the validation error.
This paper introduces a novel framework for estimation theory by introducing a second-order diagnostic for estimator design. While classical analysis focuses on the bias-variance trade-off, we present a more foundational constraint. This result is model-agnostic, domain-agnostic, and is valid for both parametric and non-parametric problems, Bayesian and frequentist frameworks. We propose to classify the estimators into three primary power regimes. We theoretically establish that any estimator operating in the `power-dominant regime' incurs an unavoidable mean-squared error penalty, making it structurally prone to sub-optimal performance. We propose a `safe-zone law' and make this diagnostic intuitive through two safe-zone maps. One map is a geometric visualization analogous to a receiver operating characteristic curve for estimators, and the other map shows that the safe-zone corresponds to a bounded optimization problem, while the forbidden `power-dominant zone' represents an unbounded optimization landscape. This framework reframes estimator design as a path optimization problem, providing new theoretical underpinnings for regularization and inspiring novel design philosophies.
For two-component load-sharing systems, a doubly-flexible model is developed where the generalized Fruend bivariate (GFB) distribution is used for the baseline of the component lifetimes, and the generalized gamma (GG) family of distributions is used to incorporate a shared frailty that captures dependence between the component lifetimes. The proposed model structure results in a very general two-way class of models that enables a researcher to choose an appropriate model for a given two-component load-sharing data within the respective families of distributions. The GFB-GG model structure provides better fit to two-component load-sharing systems compared to existing models. Fitting methods for the proposed model, based on direct optimization and an expectation maximization (EM) type algorithm, are discussed. Through simulations, effectiveness of the fitting methods is demonstrated. Also, through simulations, it is shown that the proposed model serves the intended purpose of model choice for a given two-component load-sharing data. A simulation case, and analysis of a real dataset are presented to illustrate the strength of the proposed model.
With multiple outcomes in empirical research, a common strategy is to define a composite outcome as a weighted average of the original outcomes. However, the choices of weights are often subjective and can be controversial. We propose an inverse regression strategy for causal inference with multiple outcomes. The key idea is to regress the treatment on the outcomes, which is the inverse of the standard regression of the outcomes on the treatment. Although this strategy is simple and even counterintuitive, it has several advantages. First, testing for zero coefficients of the outcomes is equivalent to testing for the null hypothesis of zero effects, even though the inverse regression is deemed misspecified. Second, the coefficients of the outcomes provide a data-driven choice of the weights for defining a composite outcome. We also discuss the associated inference issues. Third, this strategy is applicable to general study designs. We illustrate the theory in both randomized experiments and observational studies.
With nonignorable nonresponse, an effective method to construct valid estimators of population parameters is to use a covariate vector called instrument that can be excluded from the nonresponse propensity but are still useful covariate even when other covariates are conditioned. The existing work in this approach assumes such an instrument is given, which is frequently not the case in applications. In this paper we investigate how to search for an instrument from a given set of covariates. The method for estimation we apply is the pseudo likelihood proposed by Tang et al. (2003) and Zhao and Shao (2015), which assumed that an instrument is given and the distribution of response given covariates is parametric and the propensity is nonparametric. Thus, in addition to the challenge of searching an instrument, we also need to do variable and model selection simultaneously. We propose a method for instrument, variable, and model selection and show that our method produces consistent instrument and model selection as the sample size tends to infinity, under some regularity conditions. Empirical results including two simulation studies and two real examples are present to show that the proposed method works well.
Educational policymakers often lack data on student outcomes in regions where standardized tests were not administered. Machine learning techniques can be used to predict unobserved outcomes in target populations by training models on data from a source population. However, differences between the source and target populations, particularly in covariate distributions, can reduce the transportability of these models, potentially reducing predictive accuracy and introducing bias. We propose using double machine learning for a covariate-shift weighted model. First, we estimate the overlap score-namely, the probability that an observation belongs to the source dataset given its covariates. Second, balancing weights, defined as the density ratio of target-to-source membership probabilities, are used to reweight the individual observations' contribution to the loss or likelihood function in the target outcome prediction model. This approach downweights source observations that are less similar to the target population, allowing predictions to rely more heavily on observations with greater overlap. As a result, predictions become more generalizable under covariate shift. We illustrate this framework in the context of uncertain data on students' standardized financial literacy scores (FLS). Using Bayesian Additive Regression Trees (BART), we predict missing FLS. We find minimal differences in predictive performance between the weighted and unweighted models, suggesting limited covariate shift in our empirical setting. Nonetheless, the proposed approach provides a principled framework for addressing covariate shift and is broadly applicable to predictive modeling in the social and health sciences, where differences between source and target populations are common.
The Cox proportional hazards model is the most widely used regression model in univariate survival analysis. Extensions of the Cox model to bivariate survival data, however, remain scarce. We propose two novel extensions based on a Lehmann-type representation of the survival function. The first, the simple Lehmann model, is a direct extension that retains a straightforward structure. The second, the generalized Lehmann model, allows greater flexibility by incorporating three distinct regression parameters and includes the simple Lehmann model as a special case. For both models, we derive the corresponding regression formulations for the three bivariate hazard functions and discuss their interpretation and model validity. To estimate the regression parameters, we adopt a bivariate pseudo-observations approach. For the generalized Lehmann model, we extend this approach to accommodate a trivariate structure: trivariate pseudo-observations and a trivariate link function. We then propose a two-step estimation procedure, where the marginal regression parameters are estimated in the first step, and the remaining parameters are estimated in the second step. Finally, we establish the consistency and asymptotic normality of the resulting estimators.
Maximizing statistical power in experimental design often involves imbalanced treatment allocation, but several challenges hinder its practical adoption: (1) the misconception that equal allocation always maximizes power, (2) when only targeting maximum power, more than half the participants may be expected to obtain inferior treatment, and (3) response-adaptive randomization (RAR) targeting maximum statistical power may inflate type I error rates substantially. Recent work identified issue (3) and proposed a novel allocation procedure combined with the asymptotic score test. Instead, the current research focuses on finite-sample guarantees. First, we analyze the power for traditional power-maximizing RAR procedures under exact tests, including a novel generalization of Boschloo's test. Second, we evaluate constrained Markov decision process (CMDP) RAR procedures under exact tests. These procedures target maximum average power under constraints on pointwise and average type I error rates, with averages taken across the parametric space. A combination of the unconditional exact test and the CMDP procedure protecting allocations to the superior arm gives the best performance, providing substantial power gains over equal allocation while allocating more participants in expectation to the superior treatment. Future research could focus on the randomization test, in which CMDP procedures exhibited lower power compared to other examined RAR procedures.
In a coherent reliability system composed of multiple components configured according to a specific structure function, the distribution of system time to failure, or system lifetime, is often of primary interest. Accurate estimation of system reliability is critical in a wide range of engineering and industrial applications, forming decisions in system design, maintenance planning, and risk assessment. The system lifetime distribution can be estimated directly using the observed system failure times. However, when component-level lifetime data is available, it can yield improved estimates of system reliability. In this work, we demonstrate that under nonparametric assumptions about the component time-to-failure distributions, traditional estimators such as the Product-Limit Estimator (PLE) can be further improved under specific loss functions. We propose a novel methodology that enhances the nonparametric system reliability estimation through a shrinkage transformation applied to component-level estimators. This shrinkage approach leads to improved efficiency in estimating system reliability.