Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
Econometrics in general, and Panel Data methods in particular, are becoming crucial in Public Health Economics and Social Policy analysis. In this discussion paper, we employ a helpful approach of Feasible Generalized Least Squares (FGLS) to assess if there are statistically relevant relationships between hemoglobin (adjusted to sea-level), weight, and height from 2007 to 2022 in children up to five years of age in Peru. By using this method, we may find a tool that allows us to confirm if the relationships considered between the target variables by the Peruvian agencies and authorities are in the right direction to fight against chronic malnutrition and stunting.
Accurate prediction of spatially dependent functional data is critical for various engineering and scientific applications. In this study, a spatial functional deep neural network model was developed with a novel non-linear modeling framework that seamlessly integrates spatial dependencies and functional predictors using deep learning techniques. The proposed model extends classical scalar-on-function regression by incorporating a spatial autoregressive component while leveraging functional deep neural networks to capture complex non-linear relationships. To ensure a robust estimation, the methodology employs an adaptive estimation approach, where the spatial dependence parameter was first inferred via maximum likelihood estimation, followed by non-linear functional regression using deep learning. The effectiveness of the proposed model was evaluated through extensive Monte Carlo simulations and an application to Brazilian COVID-19 data, where the goal was to predict the average daily number of deaths. Comparative analysis with maximum likelihood-based spatial functional linear regression and functional deep neural network models demonstrates that the proposed algorithm significantly improves predictive performance. The results for the Brazilian COVID-19 data showed that while all models achieved similar mean squared error values over the training modeling phase, the proposed model achieved the lowest mean squared prediction error in the testing phase, indicating superior generalization ability.
We introduce a scalable framework for regressing multivariate distributions onto multivariate distributions, motivated by the application of inferring cell-cell communication from population-scale single-cell data. The observed data consist of pairs of multivariate distributions for ligands from one cell type and corresponding receptors from another. For each ordered pair $e=(l,r)$ of cell types $(l \neq r)$ and each sample $i = 1, \ldots, n$, we observe a pair of distributions $(F_{ei}, G_{ei})$ of gene expressions for ligands and receptors of cell types $l$ and $r$, respectively. The aim is to set up a regression of receptor distributions $G_{ei}$ given ligand distributions $F_{ei}$. A key challenge is that these distributions reside in distinct spaces of differing dimensions. We formulate the regression of multivariate densities on multivariate densities using a generalized Bayes framework with the sliced Wasserstein distance between fitted and observed distributions. Finally, we use inference under such regressions to define a directed graph for cell-cell communications.
Background: Researchers typically identify pregnancies in healthcare data based on observed outcomes (e.g., delivery). This outcome-based approach misses pregnancies that received prenatal care but whose outcomes were not recorded (e.g., at-home miscarriage), potentially inducing selection bias in effect estimates for prenatal exposures. Alternatively, prenatal encounters can be used to identify pregnancies, including those with unobserved outcomes. However, this prenatal approach requires methods to address missing data. Methods: We simulated 10,000,000 pregnancies and estimated the total effect of initiating treatment on the risk of preeclampsia. We generated data for 36 scenarios in which we varied the effect of treatment on miscarriage and/or preeclampsia; the percentage with missing outcomes (5% or 20%); and the cause of missingness: (1) measured covariates, (2) unobserved miscarriage, and (3) a mix of both. We then created three analytic samples to address missing pregnancy outcomes: observed deliveries, observed deliveries and miscarriages, and all pregnancies. Treatment effects were estimated using non-parametric direct standardization. Results: Risk differences (RDs) and risk ratios (RRs) from the three analytic samples were similarly biased when all missingness was due to unobserved miscarriage (log-transformed RR bias range: -0.12-0.33 among observed deliveries; -0.11-0.32 among observed deliveries and miscarriages; and -0.11-0.32 among all pregnancies). When predictors of missingness were measured, only the all pregnancies approach was unbiased (-0.27-0.33; -0.29-0.03; and -0.02-0.01, respectively). Conclusions: When all missingness was due to miscarriage, the analytic samples returned similar effect estimates. Only among all pregnancies did bias decrease as the proportion of missingness due to measured variables increased.
Ocean microbes are critical to both ocean ecosystems and the global climate. Flow cytometry, which measures cell optical properties in fluid samples, is routinely used in oceanographic research. Despite decades of accumulated data, identifying key microbial populations (a process known as ``gating'') remains a significant analytical challenge. To address this, we focus on gating multidimensional, high-frequency flow cytometry data collected {\it continuously} on board oceanographic research vessels, capturing time- and space-wise variations in the dynamic ocean. Our paper proposes a novel mixture-of-experts model in which both the gating function and the experts are given by trend filtering. The model leverages two key assumptions: (1) Each snapshot of flow cytometry data is a mixture of multivariate Gaussians and (2) the parameters of these Gaussians vary smoothly over time. Our method uses regularization and a constraint to ensure smoothness and that cluster means match biologically distinct microbe types. We demonstrate, using flow cytometry data from the North Pacific Ocean, that our proposed model accurately matches human-annotated gating and corrects significant errors.
The mean-variance portfolio model, based on the risk-return trade-off for optimal asset allocation, remains foundational in portfolio optimization. However, its reliance on restrictive assumptions about asset return distributions limits its applicability to real-world data. Parametric copula structures provide a novel way to overcome these limitations by accounting for asymmetry, heavy tails, and time-varying dependencies. Existing methods have been shown to rely on fixed or static dependence structures, thus overlooking the dynamic nature of the financial market. In this study, a semiparametric model is proposed that combines non-parametrically estimated copulas with parametrically estimated marginals to allow all parameters to dynamically evolve over time. A novel framework was developed that integrates time-varying dependence modeling with flexible empirical beta copula structures. Marginal distributions were modeled using the Skewed Generalized T family. This effectively captures asymmetry and heavy tails and makes the model suitable for predictive inferences in real world scenarios. Furthermore, the model was applied to rolling windows of financial returns from the USA, India and Hong Kong economies to understand the influence of dynamic market conditions. The approach addresses the limitations of models that rely on parametric assumptions. By accounting for asymmetry, heavy tails, and cross-correlated asset prices, the proposed method offers a robust solution for optimizing diverse portfolios in an interconnected financial market. Through adaptive modeling, it allows for better management of risk and return across varying economic conditions, leading to more efficient asset allocation and improved portfolio performance.
Agent-based simulation provides a powerful tool for in silico system modeling. However, these simulations do not provide built-in methods for uncertainty quantification (UQ). Within these types of models a typical approach to UQ is to run multiple realizations of the model then compute aggregate statistics. This approach is limited due to the compute time required for a solution. When faced with an emerging biothreat, public health decisions need to be made quickly and solutions for integrating near real-time data with analytic tools are needed. We propose an integrated Bayesian UQ framework for agent-based models based on sequential Monte Carlo sampling. Given streaming or static data about the evolution of an emerging pathogen, this Bayesian framework provides a distribution over the parameters governing the spread of a disease through a population. These estimates of the spread of a disease may be provided to public health agencies seeking to abate the spread. By coupling agent-based simulations with Bayesian modeling in a data assimilation, our proposed framework provides a powerful tool for modeling dynamical systems in silico. We propose a method which reduces model error and provides a range of realistic possible outcomes. Moreover, our method addresses two primary limitations of ABMs: the lack of UQ and an inability to assimilate data. Our proposed framework combines the flexibility of an agent-based model with UQ provided by the Bayesian paradigm in a workflow which scales well to HPC systems. We provide algorithmic details and results on a simulated outbreak with both static and streaming data.
Existing estimates of human migration are limited in their scope, reliability, and timeliness, prompting the United Nations and the Global Compact on Migration to call for improved data collection. Using privacy protected records from three billion Facebook users, we estimate country-to-country migration flows at monthly granularity for 181 countries, accounting for selection into Facebook usage. Our estimates closely match high-quality measures of migration where available but can be produced nearly worldwide and with less delay than alternative methods. We estimate that 39.1 million people migrated internationally in 2022 (0.63% of the population of the countries in our sample). Migration flows significantly changed during the COVID-19 pandemic, decreasing by 64% before rebounding in 2022 to a pace 24% above the pre-crisis rate. We also find that migration from Ukraine increased tenfold in the wake of the Russian invasion. To support research and policy interventions, we will release these estimates publicly through the Humanitarian Data Exchange.
Background: Spatial transcriptomics have emerged as a powerful tool in biomedical research because of its ability to capture both the spatial contexts and abundance of the complete RNA transcript profile in organs of interest. However, limitations of the technology such as the relatively low resolution and comparatively insufficient sequencing depth make it difficult to reliably extract real biological signals from these data. To alleviate this challenge, we propose a novel transfer learning framework, referred to as TransST, to adaptively leverage the cell-labeled information from external sources in inferring cell-level heterogeneity of a target spatial transcriptomics data. Results: Applications in several real studies as well as a number of simulation settings show that our approach significantly improves existing techniques. For example, in the breast cancer study, TransST successfully identifies five biologically meaningful cell clusters, including the two subgroups of cancer in situ and invasive cancer; in addition, only TransST is able to separate the adipose tissues from the connective issues among all the studied methods. Conclusions: In summary, the proposed method TransST is both effective and robust in identifying cell subclusters and detecting corresponding driving biomarkers in spatial transcriptomics data.
The main form of freeway traffic congestion is the familiar stop-and-go wave, characterized by wide moving jams that propagate indefinitely upstream provided enough traffic demand. They cause severe, long-lasting adverse effects, such as reduced traffic efficiency, increased driving risks, and higher vehicle emissions. This underscores the crucial importance of artificial intervention in the propagation of stop-and-go waves. Over the past two decades, two prominent strategies for stop-and-go wave suppression have emerged: variable speed limit (VSL) and jam-absorption driving (JAD). Although they share similar research motivations, objectives, and theoretical foundations, the development of these strategies has remained relatively disconnected. To synthesize fragmented advances and drive the field forward, this paper first provides a comprehensive review of the achievements in the stop-and-go wave suppression-oriented VSL and JAD, respectively. It then focuses on bridging the two areas and identifying research opportunities from the following perspectives: fundamental diagrams, traffic dynamics modeling, traffic state estimation and prediction, stochasticity, scenarios for strategy validation, and field tests and practical deployment. We expect that through this review, one area can effectively address its limitations by identifying and leveraging the strengths of the other, thus promoting the overall research goal of freeway stop-and-go wave suppression.
Sepsis remains a critical challenge due to its high mortality and complex prognosis. To address data limitations in studying MSSA sepsis, we extend existing transfer learning frameworks to accommodate transformation models for high-dimensional survival data. Specifically, we construct a measurement index based on C-index for intelligently identifying the helpful source datasets, and the target model performance is improved by leveraging information from the identified source datasets via performing the transfer step and debiasing step. We further provide an algorithm to construct confidence intervals for each coefficient component. Another significant development is that statistical properties are rigorously established, including $\ell_1/\ell_2$-estimation error bounds of the transfer learning algorithm, detection consistency property of the transferable source detection algorithm and asymptotic theories for the confidence interval construction. Extensive simulations and analysis of MIMIC-IV sepsis data demonstrate the estimation and prediction accuracy, and practical advantages of our approach, providing significant improvements in survival estimates for MSSA sepsis patients.
Katz, Savage, and Brusch propose a two-part forecasting method for sectors where event timing differs from recording time. They treat forecasting as a time-shift operation, using univariate time series for total bookings and a Bayesian Dirichlet Auto-Regressive Moving Average (B-DARMA) model to allocate bookings across trip dates based on lead time. Analysis of Airbnb data shows that this approach is interpretable, flexible, and potentially more accurate for forecasting demand across multiple time axes.
Accurate modeling of daily rainfall, encompassing both dry and wet days as well as extreme precipitation events, is critical for robust hydrological and climatological analyses. This study proposes a zero-inflated extended generalized Pareto distribution model that unifies the modeling of dry days, low, moderate, and extreme rainfall within a single framework. Unlike traditional approaches that rely on prespecified threshold selection to identify extremes, our proposed model captures tail behavior intrinsically through a tail index that aligns with the generalized Pareto distribution. The model also accommodates covariate effects via generalized additive modeling, allowing for the representation of complex climatic variability. The current implementation is limited to a univariate setting, modeling daily rainfall independently of covariates. Model estimation is carried out using both maximum likelihood and Bayesian approaches. Simulation studies and empirical applications demonstrate the model flexibility in capturing zero inflation and heavy-tailed behavior characteristics of daily rainfall distributions.
This work presents a conceptual synthesis of causal discovery and inference frameworks, with a focus on how foundational assumptions -- causal sufficiency, causal faithfulness, and the causal Markov condition -- are formalized and operationalized across methodological traditions. Through structured tables and comparative summaries, I map core assumptions, tasks, and analytical choices from multiple causal frameworks, highlighting their connections and differences. The synthesis provides practical guidance for researchers designing causal studies, especially in settings where observational or experimental constraints challenge standard approaches. This guide spans all phases of causal analysis, including question formulation, formalization of background knowledge, selection of appropriate frameworks, choice of study design or algorithm, and interpretation. It is intended as a tool to support rigorous causal reasoning across diverse empirical domains.
Spontaneous reporting system databases are key resources for post-marketing surveillance, providing real-world evidence (RWE) on the adverse events (AEs) of regulated drugs or other medical products. Various statistical methods have been proposed for AE signal detection in these databases, flagging drug-specific AEs with disproportionately high observed counts compared to expected counts under independence. However, signal detection remains challenging for rare AEs or newer drugs, which receive small observed and expected counts and thus suffer from reduced statistical power. Principled information sharing on signal strengths across drugs/AEs is crucial in such cases to enhance signal detection. However, existing methods typically ignore complex between-drug associations on AE signal strengths, limiting their ability to detect signals. We propose novel local-global mixture Dirichlet process (DP) prior-based nonparametric Bayesian models to capture these associations, enabling principled information sharing between drugs while balancing flexibility and shrinkage for each drug, thereby enhancing statistical power. We develop efficient Markov chain Monte Carlo algorithms for implementation and employ a false discovery rate (FDR)-controlled, false negative rate (FNR)-optimized hypothesis testing framework for AE signal detection. Extensive simulations demonstrate our methods' superior sensitivity -- often surpassing existing approaches by a twofold or greater margin -- while strictly controlling the FDR. An application to FDA FAERS data on statin drugs further highlights our methods' effectiveness in real-world AE signal detection. Software implementing our methods is provided as supplementary material.
This study develops a novel predictive framework for power grid vulnerability based on the statistical signatures of Self-Organized Criticality (SOC). By analyzing the evolution of the power law critical exponents in outage size distributions from the Texas grid during 2014-2022, we demonstrate the method's ability for forecasting system-wide vulnerability to catastrophic failures. Our results reveal a systematic decline in the critical exponent from 1.45 in 2018 to 0.95 in 2020, followed by a drop below the theoretical critical threshold ($\alpha$ = 1) to 0.62 in 2021, coinciding precisely with the catastrophic February 2021 power crisis. Such predictive signal emerged 6-12 months before the crisis. By monitoring critical exponent transitions through subcritical and supercritical regimes, we provide quantitative early warning capabilities for catastrophic infrastructure failures, with significant implications for grid resilience planning, risk assessment, and emergency preparedness in increasingly stressed power systems.
Rural economies are largely dependent upon agriculture, which is greatly determined by climatic conditions such as rainfall. This study aims to forecast agricultural production in Maharashtra, India, which utilises annual data from the year 1962 to 2021. Since rainfall plays a major role with respect to the crop yield, we analyze the impact of rainfall on crop yield using four time series models that includes ARIMA, ARIMAX, GARCH-ARIMA and GARCH-ARIMAX. We take advantage of rainfall as an external regressor to examine if it contributes to the performance of the model. 1-step, 2-step, and 3-step ahead forecasts are obtained and the model performance is assessed using MAE and RMSE. The models are able to more accurately predict when using rainfall as a predictor compared to when solely dependant on historical production trends (more improved outcomes are seen in the ARIMAX and GARCH-ARIMAX models). As such, these findings underscore the need for climate-aware forecasting techniques that provide useful information to policymakers and farmers to aid in agricultural planning.
Political scientists are increasingly interested in analyzing visual content at scale. However, the existing computational toolbox is still in need of methods and models attuned to the specific challenges and goals of social and political inquiry. In this article, we introduce a visual Structural Topic Model (vSTM) that combines pretrained image embeddings with a structural topic model. This has important advantages compared to existing approaches. First, pretrained embeddings allow the model to capture the semantic complexity of images relevant to political contexts. Second, the structural topic model provides the ability to analyze how topics and covariates are related, while maintaining a nuanced representation of images as a mixture of multiple topics. In our empirical application, we show that the vSTM is able to identify topics that are interpretable, coherent, and substantively relevant to the study of online political communication.