Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
Continuous soil-moisture measurements provide a direct lens on subsurface hydrological processes, notably the post-rainfall "drydown" phase. Because these records consist of distinct, segment-specific behaviours whose forms and scales vary over time, realistic inference demands a model that captures piecewise dynamics while accommodating parameters that are unknown a priori. Building on Bayesian Online Changepoint Detection (BOCPD), we introduce two complementary extensions: a particle-filter variant that substitutes exact marginalisation with sequential Monte Carlo to enable real-time inference when critical parameters cannot be integrated out analytically, and an online-gradient variant that embeds stochastic gradient updates within BOCPD to learn application-relevant parameters on the fly without prohibitive computational cost. After validating both algorithms on synthetic data that replicate the temporal structure of field observations-detailing hyperparameter choices, priors, and cost-saving strategies-we apply them to soil-moisture series from experimental sites in Austria and the United States, quantifying site-specific drydown rates and demonstrating the advantages of our adaptive framework over static models.
We extended the Wikle's Bayesian hierarchical model based on a diffusion-reaction equation [Wikle, 2003] to investigate the COVID-19 spatio-temporal spread events across the USA from Mar 2020 to Feb 2022. Our model incorporated an advection term to account for the intra-state spread trend. We applied a Markov chain Monte Carlo (MCMC) method to obtain samples from the posterior distribution of the parameters. We implemented the approach via the collection of the COVID-19 infections across the states overtime from the New York Times. Our analysis shows that our approach can be robust to model misspecification to a certain extent and outperforms a few other approaches in the simulation settings. Our analysis results confirm that the diffusion rate is heterogeneous across the USA, and both the growth rate and the advection velocity are time-varying.
This paper aims to explore the impact of tournament design on the incentives of the contestants. We develop a simulation framework to quantify the potential gain and loss from attacking based on changes in the probability of reaching the critical ranking thresholds. The model is applied to investigate the 2024/25 UEFA Champions League reform. The novel incomplete round-robin league phase is found to create more powerful incentives for offensive play than the previous group stage, with an average increase of 119\% (58\%) regarding the first (second) prize. Our study provides the first demonstration that the tournament format itself can strongly influence team behaviour in sports.
ECG foundation models are increasingly popular due to their adaptability across various tasks. However, their clinical applicability is often limited by performance gaps compared to task-specific models, even after pre-training on large ECG datasets and fine-tuning on target data. This limitation is likely due to the lack of an effective post-training strategy. In this paper, we propose a simple yet effective post-training approach to enhance ECGFounder, a state-of-the-art ECG foundation model pre-trained on over 7 million ECG recordings. Experiments on the PTB-XL benchmark show that our approach improves the baseline fine-tuning strategy by 1.2%-3.3% in macro AUROC and 5.3%-20.9% in macro AUPRC. Additionally, our method outperforms several recent state-of-the-art approaches, including task-specific and advanced architectures. Further evaluation reveals that our method is more stable and sample-efficient compared to the baseline, achieving a 9.1% improvement in macro AUROC and a 34.9% improvement in macro AUPRC using just 10% of the training data. Ablation studies identify key components, such as stochastic depth and preview linear probing, that contribute to the enhanced performance. These findings underscore the potential of post-training strategies to improve ECG foundation models, and we hope this work will contribute to the continued development of foundation models in the ECG domain.
This paper proposes a novel low-rank approximation to the multivariate State-Space Model. The Stochastic Partial Differential Equation (SPDE) approach is applied component-wise to the independent-in-time Mat\'ern Gaussian innovation term in the latent equation, assuming component independence. This results in a sparse representation of the latent process on a finite element mesh, allowing for scalable inference through sparse matrix operations. Dependencies among observed components are introduced through a matrix of weights applied to the latent process. Model parameters are estimated using the Expectation-Maximisation algorithm, which features closed-form updates for most parameters and efficient numerical routines for the remaining parameters. We prove theoretical results regarding the accuracy and convergence of the SPDE-based approximation under fixed-domain asymptotics. Simulation studies show our theoretical results. We include an empirical application on air quality to demonstrate the practical usefulness of the proposed model, which maintains computational efficiency in high-dimensional settings. In this application, we reduce computation time by about 93%, with only a 15% increase in the validation error.
A detailed analysis of precipitation data over Europe is presented, with a focus on interpolation and forecasting applications. A Spatio-temporal DeepKriging (STDK) framework has been implemented using the PyTorch platform to achieve these objectives. The proposed model is capable of handling spatio-temporal irregularities while generating high-resolution interpolations and multi-step forecasts. Reproducible code modules have been developed as standalone PyTorch implementations for the interpolation\footnote[2]{Interpolation - https://github.com/pratiknag/Spatio-temporalDeepKriging-Pytorch.git} and forecasting\footnote[3]{Forecasting - https://github.com/pratiknag/pytorch-convlstm.git}, facilitating broader application to similar climate datasets. The effectiveness of this approach is demonstrated through extensive evaluation on daily precipitation measurements, highlighting predictive performance and robustness.
Distributed scatterers in InSAR (DS-InSAR) processing are essential for retrieving surface deformation in areas lacking strong point targets. Conventional workflows typically involve selecting statistically homogeneous pixels based on amplitude similarity, followed by phase estimation under the complex circular Gaussian model. However, amplitude statistics primarily reflect the backscattering strength of surface targets and may not sufficiently capture differences in decorrelation behavior. For example, when distinct scatterers exhibit similar backscatter strength but differ in coherence, amplitude-based selection methods may fail to differentiate them. Moreover, CCG-based phase estimators may lack robustness and suffer performance degradation under non-Rayleigh amplitude fluctuations. Centered around scale-invariant second-order statistics, we propose ``Shape-to-Scale,'' a novel DS-InSAR framework. We first identify pixels that share a common angular scattering structure (``shape statistically homogeneous pixels'') with an angular consistency adaptive filter: a parametric selection method based on the complex angular central Gaussian distribution. Then, we introduce a complex generalized Gaussian-based phase estimation approach that is robust to potential non-Rayleigh scattering. Experiments on both simulated and SAR datasets show that the proposed framework improves coherence structure clustering and enhances phase estimation robustness. This work provides a unified and physically interpretable strategy for DS-InSAR processing and offers new insights for high-resolution SAR time series analysis.
For two-component load-sharing systems, a doubly-flexible model is developed where the generalized Fruend bivariate (GFB) distribution is used for the baseline of the component lifetimes, and the generalized gamma (GG) family of distributions is used to incorporate a shared frailty that captures dependence between the component lifetimes. The proposed model structure results in a very general two-way class of models that enables a researcher to choose an appropriate model for a given two-component load-sharing data within the respective families of distributions. The GFB-GG model structure provides better fit to two-component load-sharing systems compared to existing models. Fitting methods for the proposed model, based on direct optimization and an expectation maximization (EM) type algorithm, are discussed. Through simulations, effectiveness of the fitting methods is demonstrated. Also, through simulations, it is shown that the proposed model serves the intended purpose of model choice for a given two-component load-sharing data. A simulation case, and analysis of a real dataset are presented to illustrate the strength of the proposed model.
Educational policymakers often lack data on student outcomes in regions where standardized tests were not administered. Machine learning techniques can be used to predict unobserved outcomes in target populations by training models on data from a source population. However, differences between the source and target populations, particularly in covariate distributions, can reduce the transportability of these models, potentially reducing predictive accuracy and introducing bias. We propose using double machine learning for a covariate-shift weighted model. First, we estimate the overlap score-namely, the probability that an observation belongs to the source dataset given its covariates. Second, balancing weights, defined as the density ratio of target-to-source membership probabilities, are used to reweight the individual observations' contribution to the loss or likelihood function in the target outcome prediction model. This approach downweights source observations that are less similar to the target population, allowing predictions to rely more heavily on observations with greater overlap. As a result, predictions become more generalizable under covariate shift. We illustrate this framework in the context of uncertain data on students' standardized financial literacy scores (FLS). Using Bayesian Additive Regression Trees (BART), we predict missing FLS. We find minimal differences in predictive performance between the weighted and unweighted models, suggesting limited covariate shift in our empirical setting. Nonetheless, the proposed approach provides a principled framework for addressing covariate shift and is broadly applicable to predictive modeling in the social and health sciences, where differences between source and target populations are common.
As researchers increasingly rely on machine learning models and LLMs to annotate unstructured data, such as texts or images, various approaches have been proposed to correct bias in downstream statistical analysis. However, existing methods tend to yield large standard errors and require some error-free human annotation. In this paper, I introduce Surrogate Representation Inference (SRI), which assumes that unstructured data fully mediate the relationship between human annotations and structured variables. The assumption is guaranteed by design provided that human coders rely only on unstructured data for annotation. Under this setting, I propose a neural network architecture that learns a low-dimensional representation of unstructured data such that the surrogate assumption remains to be satisfied. When multiple human annotations are available, SRI can further correct non-differential measurement errors that may exist in human annotations. Focusing on text-as-outcome settings, I formally establish the identification conditions and semiparametric efficient estimation strategies that enable learning and leveraging such a low-dimensional representation. Simulation studies and a real-world application demonstrate that SRI reduces standard errors by over 50% when machine learning prediction accuracy is moderate and provides valid inference even when human annotations contain non-differential measurement errors.
Preferential attachment is often suggested to be the underlying mechanism of the growth of a network, largely due to that many real networks are, to a certain extent, scale-free. However, such attribution is usually made under debatable practices of determining scale-freeness and when only snapshots of the degree distribution are observed. In the presence of the evolution history of the network, modelling the increments of the evolution allows us to measure preferential attachment directly. Therefore, we propose a generalised linear model for such purpose, where the in-degrees and their increments are the covariate and response, respectively. Not only are the parameters that describe the preferential attachment directly incorporated, they also ensure that the tail heaviness of the asymptotic degree distribution is realistic. The Bayesian approach to inference enables the hierarchical version of the model to be implemented naturally. The application to the dependency network of R packages reveals subtly different behaviours between new dependencies by new and existing packages, and between addition and removal of dependencies.
Fairness and interpretability play an important role in the adoption of decision-making algorithms across many application domains. These requirements are intended to avoid undesirable group differences and to alleviate concerns related to transparency. This paper proposes a framework that integrates fairness and interpretability into algorithmic decision making by combining data transformation with policy trees, a class of interpretable policy functions. The approach is based on pre-processing the data to remove dependencies between sensitive attributes and decision-relevant features, followed by a tree-based optimization to obtain the policy. Since data pre-processing compromises interpretability, an additional transformation maps the parameters of the resulting tree back to the original feature space. This procedure enhances fairness by yielding policy allocations that are pairwise independent of sensitive attributes, without sacrificing interpretability. Using administrative data from Switzerland to analyze the allocation of unemployed individuals to active labor market programs (ALMP), the framework is shown to perform well in a realistic policy setting. Effects of integrating fairness and interpretability constraints are measured through the change in expected employment outcomes. The results indicate that, for this particular application, fairness can be substantially improved at relatively low cost.
It is often of interest to test a global null hypothesis using multiple, possibly dependent, $p$-values by combining their strengths while controlling the Type I error. Recently, several heavy-tailed combinations tests, such as the harmonic mean test and the Cauchy combination test, have been proposed: they map $p$-values into heavy-tailed random variables before combining them in some fashion into a single test statistic. The resulting tests, which are calibrated under the assumption of independence of the $p$-values, have shown to be rather robust to dependence. The complete understanding of the calibration properties of the resulting combination tests of dependent and possibly tail-dependent $p$-values has remained an important open problem in the area. In this work, we show that the powerful framework of multivariate regular variation (MRV) offers a nearly complete solution to this problem. We first show that the precise asymptotic calibration properties of a large class of homogeneous combination tests can be expressed in terms of the angular measure -- a characteristic of the asymptotic tail-dependence under MRV. Consequently, we show that under MRV, the Pareto-type linear combination tests, which are equivalent to the harmonic mean test, are universally calibrated regardless of the tail-dependence structure of the underlying $p$-values. In contrast, the popular Cauchy combination test is shown to be universally honest but often conservative; the Tippet combination test, while being honest, is calibrated if and only if the underlying $p$-values are tail-independent. One of our major findings is that the Pareto-type linear combination tests are the only universally calibrated ones among the large family of possibly non-linear homogeneous heavy-tailed combination tests.
Ambient air pollution poses significant health and environmental challenges. Exposure to high concentrations of PM$_{2.5}$ have been linked to increased respiratory and cardiovascular hospital admissions, more emergency department visits and deaths. Traditional air quality monitoring systems such as EPA-certified stations provide limited spatial and temporal data. The advent of low-cost sensors has dramatically improved the granularity of air quality data, enabling real-time, high-resolution monitoring. This study exploits the extensive data from PurpleAir sensors to assess and compare the effectiveness of various statistical and machine learning models in producing accurate hourly PM$_{2.5}$ maps across California. We evaluate traditional geostatistical methods, including kriging and land use regression, against advanced machine learning approaches such as neural networks, random forests, and support vector machines, as well as ensemble model. Our findings enhanced the predictive accuracy of PM2.5 concentration by correcting the bias in PurpleAir data with an ensemble model, which incorporating both spatiotemporal dependencies and machine learning models.
Land-atmosphere coupling is an important process for correctly modelling near-surface temperature profiles, but it involves various uncertainties due to subgrid-scale processes, such as turbulent fluxes or unresolved surface heterogeneities, suggesting a probabilistic modelling approach. We develop a copula Bayesian network (CBN) to interpolate temperature profiles, acting as alternative to T2m-diagnostics used in numerical weather prediction (NWP) systems. The new CBN results in (1) a reduction of the warm bias inherent to NWP predictions of wintertime stable boundary layers allowing cold temperature extremes to be better represented, and (2) consideration of uncertainty associated with subgrid-scale spatial variability. The use of CBNs combines the advantages of uncertainty propagation inherent to Bayesian networks with the ability to model complex dependence structures between random variables through copulas. By combining insights from copula modelling and information entropy, criteria for the applicability of CBNs in the further development of parameterizations in NWP models are derived.
The problem of identifying statistically significant inferences about the structure of the graphical model is considered, along with the related task of constructing a confidence set for a graphical model. It has been proven that the procedure for constructing such set is equivalent to the procedure for simultaneous testing of hypotheses and alternatives regarding the composition of the graphical model. Some variants of the simultaneous testing of hypotheses and alternatives are discussed. It is shown that under the condition of free combination of hypotheses and alternatives, a simple generalization of the closure method leads to singlestep procedures for simultaneous testing of hypotheses and alternatives. The structure of the confidence set for the graphical model is analyzed, demonstrating how the confidence set leads to a separation of inferences about the graphical model into statistically significant and insignificant categories, or into an area of uncertainty. General results are detailed by analyzing confidence sets for undirected Gaussian graphical model selection. Examples are provided that illustrate the separation of inferences about the composition of undirected Gaussian graphical models into significant results and areas of uncertainty, and a comparison is made with known results obtained using the SINful approach to undirected Gaussian graphical model selection.
This study develops and evaluates a novel hybridWavelet SARIMA Transformer, WST framework to forecast using monthly rainfall across five meteorological subdivisions of Northeast India over the 1971 to 2023 period. The approach employs the Maximal Overlap Discrete Wavelet Transform, MODWT with four wavelet families such as, Haar, Daubechies, Symlet, Coiflet etc. to achieve shift invariant, multiresolution decomposition of the rainfall series. Linear and seasonal components are modeled using Seasonal ARIMA, SARIMA, while nonlinear components are modeled by a Transformer network, and forecasts are reconstructed via inverse MODWT. Comprehensive validation using an 80 is to 20 train test split and multiple performance indices such as, RMSE, MAE, SMAPE, Willmotts d, Skill Score, Percent Bias, Explained Variance, and Legates McCabes E1 demonstrates the superiority of the Haar-based hybrid model, WHST. Across all subdivisions, WHST consistently achieved lower forecast errors, stronger agreement with observed rainfall, and unbiased predictions compared with stand alone SARIMA, stand-alone Transformer, and two-stage wavelet hybrids. Residual adequacy was confirmed through the Ljung Box test, while Taylor diagrams provided an integrated assessment of correlation, variance fidelity, and RMSE, further reinforcing the robustness of the proposed approach. The results highlight the effectiveness of integrating multiresolution signal decomposition with complementary linear and deep learning models for hydroclimatic forecasting. Beyond rainfall, the proposed WST framework offers a scalable methodology for forecasting complex environmental time series, with direct implications for flood risk management, water resources planning, and climate adaptation strategies in data-sparse and climate-sensitive regions.
Cardiovascular diseases (CVD) remain one of the leading causes of hospitalization in Brazil. Exposure to air pollutants such as PM$_{10}$ $\mu$m, NO$_2$, and SO$_2$ has been associated with the worsening of these diseases, especially in urban areas. This study evaluated the association between the daily concentration of these pollutants and daily hospitalizations for acute myocardial infarction and cerebrovascular diseases in S\~ao Paulo (2010-2019), using generalized additive models with a lag of 0 to 4 days. Two approaches for choosing the degrees of freedom in temporal smoothing were compared: based on pollutant prediction and based on outcome prediction (hospitalizations). Data were obtained from official government databases. The modeling used the quasi-Poisson family in R software (v. 4.4.0). Models with exposure-based smoothing generated more consistent estimates. For PM10{\mu}m, the cumulative risk estimate for exposure was 1.08%, while for hospitalization, it was 1.20%. For NO$_2$, the estimated risk was 1.47% (exposure) versus 1.33% (hospitalization). For SO$_2$, a striking difference was observed: 7.66% (exposure) versus 14.31% (hospitalization). The significant lags were on days 0, 1, and 2. The results show that smoothing based on outcome prediction can generate bias, masking the true effect of pollutants. The appropriate choice of df in the smoothing function is crucial. Smoothing by the pollutant series was more robust and accurate, contributing to methodological improvements in time-series studies and reinforcing the importance of public policies for pollution control.