Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
Recently, Delhi has become a chamber of bad air quality. This study explores the trends of probable contributors to Delhi's deteriorating air quality by analyzing data from 2014 to 2024 -- a period that has not been the central focus of previous research. The study aims to reassess the contributors in light of recent shifts. The consistently worsening air quality has forced the people of Delhi to adapt to an unhealthy environment. People breathing this polluted air are at great risk of developing several health issues such as respiratory infections, heart disease, and lung cancer. The study provides a quantified perspective on how each contributor has influenced pollution levels by identifying percentage contributions of major sources. Over the years, Delhi's air pollution has been primarily attributed to stubble burning. However, the present study discusses the decline in stubble burning cases in the current scenario and the evolving impact of contributors such as vehicular emissions, industrial activities, and population growth. Moreover, the study assesses the effectiveness of mitigation strategies like Electric Vehicles (EVs), public transport expansion, and pollution control policies. The average levels of the Air Quality Index (AQI) during October-November and November-December remained consistently high from 2018 to 2024, reaching 374 in November 2024. Based on the data-driven analysis, the study demonstrates that existing measures have fallen short and makes a strong case for implementing new long-term strategies focusing on the root causes.
Data analysis often encounters missing data, which can result in inaccurate conclusions, especially when it comes to ordinal variables. In trauma data, the Glasgow Coma Scale is useful for assessing the level of consciousness. This score is often missing in patients who are intubated or under sedation upon arrival at the hospital, and those with normal reactivity without head injury, suggesting a Missing Not At Random (MNAR) mechanism. The problem with MNAR is the absence of a definitive analysis. While sensitivity analysis is often recommended, practical limitations sometimes restrict the analysis to a basic comparison between results under Missing Completely At Random (MCAR) and Missing At Random (MAR) assumptions, disregarding MNAR plausibility. Our objective is to propose a flexible and accessible sensitivity analysis method in the presence of a MNAR ordinal independent variable. The method is inspired by the sensitivity analysis approach proposed by Leurent et al. (2018) for a continuous response variable. We propose an extension for an independent ordinal variable. The method is evaluated on simulated data before being applied to Pan-Canadian trauma data from April 2013 to March 2018. The simulation shows that MNAR estimates are less biased than MAR estimates and more precise than complete case analysis (CC) estimates. The confidence intervals coverage rates are relatively better for MNAR estimates than CC and MAR estimates. In the application, it is observed that the Glasgow Coma Scale is significant under MNAR, unlike MCAR and MAR assumptions.
Recent multi-omic microbiome studies enable integrative analysis of microbes and metabolites, uncovering their associations with various host conditions. Such analyses require multivariate models capable of accounting for the complex correlation structures between microbes and metabolites. However, existing multivariate models often suffer from low statistical power for detecting microbiome-metabolome interactions due to small sample sizes and weak biological signals. To address these challenges, we introduce CoMMiT, Co-informed inference of Microbiome-Metabolome Interactions via novel Transfer learning models. Unlike conventional transfer-learning methods that borrow information from external datasets, CoMMiT leverages similarities across metabolites within a single cohort, reducing the risk of negative transfer often caused by differences in sequencing platforms and bioinformatic pipelines across studies. CoMMiT operates under the flexible assumption that auxiliary metabolites are collectively informative for the target metabolite, without requiring individual auxiliary metabolites to be informative. CoMMiT uses a novel data-driven approach to selecting the optimal set of auxiliary metabolites. Using this optimal set, CoMMiT employs a de-biasing framework to enable efficient calculation of p-values, facilitating the identification of statistically significant microbiome-metabolome interactions. Applying CoMMiT to a feeding study reveals biologically meaningful microbiome-metabolome interactions under a low glycemic load diet, demonstrating the diet-host link through gut metabolism.
Here, we outline how Rothman diagrams provide a geometric perspective that can help epidemiologists understand the relationships between effect measure modification (which we call association measure modification), collapsibility, and confounding. A Rothman diagram plots the risk of disease in the unexposed on the x-axis and the risk in the exposed on the y-axis. Crude and stratum-specific risks in the two exposure groups define points in the unit square. When there is modification of a measure of association $M$ by a covariate $C$, the stratum-specific values of $M$ differ across strata defined by $C$, so the stratum-specific points are on different contour lines of $M$. We show how collapsibility can be defined in terms of standardization instead of no confounding, and we show that a measure of association is collapsible if and only if all its contour lines are straight. We illustrate these ideas using data from a study in Newcastle, United Kingdom, where the causal effect of smoking on 20-year mortality was confounded by age. From this perspective, it is clear that association measure modification and collapsibility are logically independent of confounding. This distinction can be obscured when these concepts are taught using regression models.
The interest in summarizing complex and multidimensional phenomena often related to one or more specific sectors (social, economic, environmental, political, etc.) to make them easily understandable even to non-experts is far from waning. A widely adopted approach for this purpose is the use of composite indices, statistical measures that aggregate multiple indicators into a single comprehensive measure. In this paper, we present a novel methodology called AutoSynth, designed to condense potentially extensive datasets into a single synthetic index or a hierarchy of such indices. AutoSynth leverages an Autoencoder, a neural network technique, to represent a matrix of features in a lower-dimensional space. Although this approach is not limited to the creation of a particular composite index and can be applied broadly across various sectors, the motivation behind this work arises from a real-world need. Specifically, we aim to assess the vulnerability of the Italian city of Florence at the suburban level across three dimensions: economic, demographic, and social. To demonstrate the methodology's effectiveness, it is also applied to estimate a vulnerability index using a rich, publicly available dataset on U.S. counties and validated through a simulation study.
Course-prerequisite networks (CPNs) are directed acyclic graphs that model complex academic curricula by representing courses as nodes and dependencies between them as directed links. These networks are indispensable tools for visualizing, studying, and understanding curricula. For example, CPNs can be used to detect important courses, improve advising, guide curriculum design, analyze graduation time distributions, and quantify the strength of knowledge flow between different university departments. However, most CPN analyses to date have focused only on micro- and meso-scale properties. To fill this gap, we define and study three new global CPN measures: breadth, depth, and flux. All three measures are invariant under transitive reduction and are based on the concept of topological stratification, which generalizes topological ordering in directed acyclic graphs. These measures can be used for macro-scale comparison of different CPNs. We illustrate the new measures numerically by applying them to three real and synthetic CPNs from three universities: the Cyprus University of Technology, the California Institute of Technology, and Johns Hopkins University. The CPN data analyzed in this paper are publicly available in a GitHub repository.
The complexity of experimental setups in the field of cyber-physical energy systems has motivated the development of the Holistic Test Description (HTD), a well-adopted approach for documenting and communicating test designs. Uncertainty, in its many flavours, is an important factor influencing the communication about experiment plans, execution of, and the reproducibility of experimental results. The work presented here focuses on supporting the structured analysis of experimental uncertainty aspects during planning and documenting complex energy systems tests. This paper introduces uncertainty extensions to the original HTD and an additional uncertainty analysis tool. The templates and tools are openly available and their use is exemplified in two case studies.
Biotic interactions provide a valuable window into the inner workings of complex ecological communities and capture the loss of ecological function often precipitated by environmental change. However, the financial and logistical challenges associated with collecting interaction data result in networks that are recorded with geographical and taxonomic bias, particularly when studies are narrowly focused. We develop an approach to reduce bias in link prediction in the common scenario in which data are derived from studies focused on a small number of species. Our Extended Covariate-Informed Link Prediction (COIL+) framework utilizes a latent factor model that flexibly borrows information between species and incorporates dependence on covariates and phylogeny, and introduces a framework for borrowing information from multiple studies to reduce bias due to uncertain species occurrence. Additionally, we propose a new trait matching procedure which permits heterogeneity in trait-interaction propensity associations at the species level. We illustrate the approach through an application to a literature compilation data set of 268 sources reporting frugivory in Afrotropical forests and compare the performance with and without correction for uncertainty in occurrence. Our method results in a substantial improvement in link prediction, revealing 5,255 likely but unobserved frugivory interactions, and increasing model discrimination under conditions of great taxonomic bias and narrow study focus. This framework generalizes to a variety of network contexts and offers a useful tool for link prediction given networks recorded with bias.
Frailty assessment is crucial for stratifying populations and addressing healthcare challenges associated with ageing. This study proposes a Frailty Index based on administrative health data, with the aim of facilitating informed decision-making and resource allocation in population health management. The aim of this work is to develop a Frailty Index that 1) accurately predicts multiple adverse health outcomes, 2) comprises a parsimonious set of variables, 3) aggregates variables without predefined weights, 4) regenerates when applied to different populations, and 5) relies solely on routinely collected administrative data. Using administrative data from a local health authority in Italy, we identified two cohorts of individuals aged $\ge$65 years. A set of six adverse outcomes (death, emergency room access with highest priority, hospitalisation, disability onset, dementia onset, and femur fracture) was selected to define frailty. Variable selection was performed using logistic regression modelling and a forward approach based on partially ordered set (POSET) theory. The final Frailty Index comprised eight variables: age, disability, total number of hospitalisations, mental disorders, neurological diseases, heart failure, kidney failure, and cancer. The Frailty Index performs well or very well for all adverse outcomes (AUC range: 0.664-0.854) except hospitalisation (AUC: 0.664). The index also captured associations between frailty and chronic diseases, comorbidities, and socioeconomic deprivation. This study presents a validated, parsimonious Frailty Index based on routinely collected administrative data. The proposed approach offers a comprehensive toolkit for stratifying populations by frailty level, facilitating targeted interventions and resource allocation in population health management.
This study explores the potential of large language models (LLMs) to enhance expert forecasting through ensemble learning. Leveraging the European Central Bank's Survey of Professional Forecasters (SPF) dataset, we propose a comprehensive framework to evaluate LLM-driven ensemble predictions under varying conditions, including the intensity of expert disagreement, dynamics of herd behavior, and limitations in attention allocation.
When building a world model, a common assumption is that the environment has a single, unchanging underlying causal rule, like applying Newton's laws to every situation. In reality, what appears as a drifting causal mechanism is often the manifestation of a fixed underlying mechanism seen through a narrow observational window. This brings about a problem that, when building a world model, even subtle shifts in policy or environment states can alter the very observed causal mechanisms. In this work, we introduce the \textbf{Meta-Causal Graph} as world models, a minimal unified representation that efficiently encodes the transformation rules governing how causal structures shift across different latent world states. A single Meta-Causal Graph is composed of multiple causal subgraphs, each triggered by meta state, which is in the latent state space. Building on this representation, we introduce a \textbf{Causality-Seeking Agent} whose objectives are to (1) identify the meta states that trigger each subgraph, (2) discover the corresponding causal relationships by agent curiosity-driven intervention policy, and (3) iteratively refine the Meta-Causal Graph through ongoing curiosity-driven exploration and agent experiences. Experiments on both synthetic tasks and a challenging robot arm manipulation task demonstrate that our method robustly captures shifts in causal dynamics and generalizes effectively to previously unseen contexts.
Brain cognitive and sensory functions are often associated with electrophysiological activity at specific frequency bands. Clustering multivariate time series (MTS) data like EEGs is important for understanding brain functions but challenging due to complex non-stationary cross-dependencies, gradual transitions between cognitive states, noisy measurements, and ambiguous cluster boundaries. To address these issues, we develop a robust fuzzy clustering framework in the spectral domain. Our method leverages Kendall's tau-based canonical coherence, which extracts meaningful frequency-specific monotonic relationships between groups of channels or regions. KenCoh effectively captures dominant coherence structures while remaining robust against outliers and noise, making it suitable for real EEG datasets that typically contain artifacts. Our method first projects each MTS object onto vectors derived from the KenCoh estimates (i.e, canonical directions), which capture relevant information on the connectivity structure of oscillatory signals in predefined frequency bands. These spectral features are utilized to determine clusters of epochs using a fuzzy partitioning strategy, accommodating gradual transitions and overlapping class structure. Lastly, we demonstrate the effectiveness of our approach to EEG data where latent cognitive states such as alertness and drowsiness exhibit frequency-specific dynamics and ambiguity. Our method captures both spectral and spatial features by locating the frequency-dependent structure and brain functional connectivity. Built on the KenCoh framework for fuzzy clustering, it handles the complexity of high-dimensional time series data and is broadly applicable to domains such as neuroscience, wearable sensing, environmental monitoring, and finance.
Causal inference is central to statistics and scientific discovery, enabling researchers to identify cause-and-effect relationships beyond associations. While traditionally studied within Euclidean spaces, contemporary applications increasingly involve complex, non-Euclidean data structures that reside in abstract metric spaces, known as random objects, such as images, shapes, networks, and distributions. This paper introduces a novel framework for causal inference with continuous treatments applied to non-Euclidean data. To address the challenges posed by the lack of linear structures, we leverage Hilbert space embeddings of the metric spaces to facilitate Fr\'echet mean estimation and causal effect mapping. Motivated by a study on the impact of exposure to fine particulate matter on age-at-death distributions across U.S. counties, we propose a nonparametric, doubly-debiased causal inference approach for outcomes as random objects with continuous treatments. Our framework can accommodate moderately high-dimensional vector-valued confounders and derive efficient influence functions for estimation to ensure both robustness and interpretability. We establish rigorous asymptotic properties of the cross-fitted estimators and employ conformal inference techniques for counterfactual outcome prediction. Validated through numerical experiments and applied to real-world environmental data, our framework extends causal inference methodologies to complex data structures, broadening its applicability across scientific disciplines.
The global energy landscape is experiencing a transformative shift, with an increasing emphasis on sustainable and clean energy sources. Hydrogen remains a promising candidate for decarbonization, energy storage, and as an alternative fuel. This study explores the landscape of hydrogen pricing and demand dynamics by evaluating three collaboration scenarios: market-based pricing, cooperative integration, and coordinated decision-making. It incorporates price-sensitive demand, environmentally friendly production methods, and market penetration effects, to provide insights into maximizing market share, profitability, and sustainability within the hydrogen industry. This study contributes to understanding the complexities of collaboration by analyzing those structures and their role in a fast transition to clean hydrogen production by balancing economic viability and environmental goals. The findings reveal that the cooperative integration strategy is the most effective for sustainable growth, increasing green hydrogen's market share to 19.06 % and highlighting the potential for environmentally conscious hydrogen production. They also suggest that the coordinated decision-making approach enhances profitability through collaborative tariff contracts while balancing economic viability and environmental goals. This study also underscores the importance of strategic pricing mechanisms, policy alignment, and the role of hydrogen hubs in achieving sustainable growth in the hydrogen sector. By highlighting the uncertainties and potential barriers, this research offers actionable guidance for policymakers and industry players in shaping a competitive and sustainable energy marketplace.
Electric vehicles (EVs) are a promising alternative to fuel vehicles (FVs), given some unique characteristics of EVs, for example, the low air pollution and maintenance cost. However, the increasing prevalence of EVs is accompanied by widespread complaints regarding the high likelihood of motion sickness (MS) induction, especially when compared to FVs, which has become one of the major obstacles to the acceptance and popularity of EVs. Despite the prevalence of such complaints online and among EV users, the association between vehicle type (i.e., EV versus FV) and MS prevalence and severity has not been quantified. Thus, this study aims to investigate the existence of EV-induced MS and explore the potential factors leading to it. A survey study was conducted to collect passengers' MS experience in EVs and FVs in the past one year. In total, 639 valid responses were collected from mainland China. The results show that FVs were associated with a higher frequency of MS, while EVs were found to induce more severe MS symptoms. Further, we found that passengers' MS severity was associated with individual differences (i.e., age, gender, sleep habits, susceptibility to motion-induced MS), in-vehicle activities (i.e., chatting with others and watching in-vehicle displays), and road conditions (i.e., congestion and slope), while the MS frequency was associated with the vehicle ownership and riding frequency. The results from this study can guide the directions of future empirical studies that aim to quantify the inducers of MS in EVs and FVs, as well as the optimization of EVs to reduce MS.
While age-specific fertility rates (ASFRs) provide the most extensive record of reproductive change, their aggregate nature masks the underlying behavioral mechanisms that ultimately drive fertility trends. To recover these mechanisms, we develop a likelihood-free Bayesian framework that couples an individual-level model of the reproductive process with Sequential Neural Posterior Estimation (SNPE). This allows us to infer eight behavioral and biological parameters from just two aggregate series: ASFRs and the age-profile of planned versus unplanned births. Applied to U.S. National Survey of Family Growth cohorts and to Demographic and Health Survey cohorts from Colombia, the Dominican Republic, and Peru, the method reproduces observed fertility schedules and, critically, predicts out-of-sample micro-level distributions of age at first sex, inter-birth intervals, and family-size ideals, none of which inform the estimation step. Because the fitted model yields complete synthetic life histories, it enables behaviorally explicit population forecasts and supports the construction of demographic digital twins.
Monte Carlo techniques are the method of choice for making probabilistic predictions of an outcome in several disciplines. Usually, the aim is to generate calibrated predictions which are statistically indistinguishable from the outcome. Developers and users of such Monte Carlo predictions are interested in evaluating the degree of calibration of the forecasts. Here, we consider predictions of $p$-dimensional outcomes sampling a multivariate Gaussian distribution and apply the Box ordinate transform (BOT) to assess calibration. However, this approach is known to fail to reliably indicate calibration when the sample size n is moderate. For some applications, the cost of obtaining Monte-Carlo estimates is significant, which can limit the sample size, for instance, in model development when the model is improved iteratively. Thus, it would be beneficial to be able to reliably assess calibration even if the sample size n is moderate. To address this need, we introduce a fair, sample size- and dimension-dependent version of the Gaussian sample BOT. In a simulation study, the fair Gaussian sample BOT is compared with alternative BOT versions for different miscalibrations and for different sample sizes. Results confirm that the fair Gaussian sample BOT is capable of correctly identifying miscalibration when the sample size is moderate in contrast to the alternative BOT versions. Subsequently, the fair Gaussian sample BOT is applied to two to 12-dimensional predictions of temperature and vector wind using operational ensemble forecasts of the European Centre for Medium-Range Weather Forecasts (ECMWF). Firstly, perfectly reliable situations are considered where the outcome is replaced by a forecast that samples the same distribution as the members in the ensemble. Secondly, the BOT is computed using estimates of the actual temperature and vector wind from ECMWF analyses.
In Major League Baseball, every ballpark is different, with different dimensions and climates. These differences make some ballparks more conducive to hitting home runs than others. Several factors conspire to make estimation of these differences challenging. Home runs are relatively rare, occurring in roughly 3\% of plate appearances. The quality of personnel and the frequency of batter-pitcher handedness combinations that appear in the thirty ballparks vary considerably. Because of asymmetries, effects due to ballpark can depend strongly on hitter handedness. We consider generalized linear mixed effects models based on the Poisson distribution for home runs. We use as our observational unit the combination of game and handedness-matchup. Our model allows for four theoretical mean home run frequency functions for each ballpark. We control for variation in personnel across games by constructing ``elsewhere'' measures of batter ability to hit home runs and pitcher tendency to give them up, using data from parks other than the one in which the response is observed. We analyze 13 seasons of data and find that the estimated home run frequencies adjusted to average personnel are substantially different from observed home run frequencies, leading to considerably different ballpark rankings than often appear in the media.