Loading...
Loading...
Browse, search and filter the latest cybersecurity research papers from arXiv
Econometrics in general, and Panel Data methods in particular, are becoming crucial in Public Health Economics and Social Policy analysis. In this discussion paper, we employ a helpful approach of Feasible Generalized Least Squares (FGLS) to assess if there are statistically relevant relationships between hemoglobin (adjusted to sea-level), weight, and height from 2007 to 2022 in children up to five years of age in Peru. By using this method, we may find a tool that allows us to confirm if the relationships considered between the target variables by the Peruvian agencies and authorities are in the right direction to fight against chronic malnutrition and stunting.
Moran Eigenvector Spatial Filtering (ESF) approaches have shown promise in accounting for spatial effects in statistical models. Can this extend to machine learning? This paper examines the effectiveness of using Moran Eigenvectors as additional spatial features in machine learning models. We generate synthetic datasets with known processes involving spatially varying and nonlinear effects across two different geometries. Moran Eigenvectors calculated from different spatial weights matrices, with and without a priori eigenvector selection, are tested. We assess the performance of popular machine learning models, including Random Forests, LightGBM, XGBoost, and TabNet, and benchmark their accuracies in terms of cross-validated R2 values against models that use only coordinates as features. We also extract coefficients and functions from the models using GeoShapley and compare them with the true processes. Results show that machine learning models using only location coordinates achieve better accuracies than eigenvector-based approaches across various experiments and datasets. Furthermore, we discuss that while these findings are relevant for spatial processes that exhibit positive spatial autocorrelation, they do not necessarily apply when modeling network autocorrelation and cases with negative spatial autocorrelation, where Moran Eigenvectors would still be useful.
I show that ordinary least squares (OLS) predictions can be rewritten as the output of a restricted attention module, akin to those forming the backbone of large language models. This connection offers an alternative perspective on attention beyond the conventional information retrieval framework, making it more accessible to researchers and analysts with a background in traditional statistics. It falls into place when OLS is framed as a similarity-based method in a transformed regressor space, distinct from the standard view based on partial correlations. In fact, the OLS solution can be recast as the outcome of an alternative problem: minimizing squared prediction errors by optimizing the embedding space in which training and test vectors are compared via inner products. Rather than estimating coefficients directly, we equivalently learn optimal encoding and decoding operations for predictors. From this vantage point, OLS maps naturally onto the query-key-value structure of attention mechanisms. Building on this foundation, I discuss key elements of Transformer-style attention and draw connections to classic ideas from time series econometrics.
We present four novel tests of equal predictive accuracy and encompassing for out-of-sample forecasts based on factor-augmented regression. We extend the work of Pitarakis (2023a,b) to develop the inferential theory of predictive regressions with generated regressors which are estimated by using Common Correlated Effects (henceforth CCE) - a technique that utilizes cross-sectional averages of grouped series. It is particularly useful since large datasets of such structure are becoming increasingly popular. Under our framework, CCE-based tests are asymptotically normal and robust to overspecification of the number of factors, which is in stark contrast to existing methodologies in the CCE context. Our tests are highly applicable in practice as they accommodate for different predictor types (e.g., stationary and highly persistent factors), and remain invariant to the location of structural breaks in loadings. Extensive Monte Carlo simulations indicate that our tests exhibit excellent local power properties. Finally, we apply our tests to a novel EA-MD-QD dataset by Barigozzi et al. (2024b), which covers Euro Area as a whole and primary member countries. We demonstrate that CCE factors offer a substantial predictive power even under varying data persistence and structural breaks.
This paper provides a practical introduction to Double/Debiased Machine Learning (DML). DML provides a general approach to performing inference about a target parameter in the presence of nuisance parameters. The aim of DML is to reduce the impact of nuisance parameter estimation on estimators of the parameter of interest. We describe DML and its two essential components: Neyman orthogonality and cross-fitting. We highlight that DML reduces functional form dependence and accommodates the use of complex data types, such as text data. We illustrate its application through three empirical examples that demonstrate DML's applicability in cross-sectional and panel settings.
Researchers and practitioners often wish to measure treatment effects in settings where units interact via markets and recommendation systems. In these settings, units are affected by certain shared states, like prices, algorithmic recommendations or social signals. We formalize this structure, calling it shared-state interference, and argue that our formulation captures many relevant applied settings. Our key modeling assumption is that individuals' potential outcomes are independent conditional on the shared state. We then prove an extension of a double machine learning (DML) theorem providing conditions for achieving efficient inference under shared-state interference. We also instantiate our general theorem in several models of interest where it is possible to efficiently estimate the average direct effect (ADE) or global average treatment effect (GATE).
Equilibrium effects make it challenging to evaluate the impact of an individual-level treatment on outcomes in a single market, even with data from a randomized trial. In some markets, however, a centralized mechanism allocates goods and imposes useful structure on spillovers. For a class of strategy-proof "cutoff" mechanisms, we propose an estimator for global treatment effects using individual-level data from one market, where treatment assignment is unconfounded. Algorithmically, we re-run a weighted and perturbed version of the mechanism. Under a continuum market approximation, the estimator is asymptotically normal and semi-parametrically efficient. We extend this approach to learn spillover-aware treatment rules with vanishing asymptotic regret. Empirically, adjusting for equilibrium effects notably diminishes the estimated effect of information on inequality in the Chilean school system.
Randomized experiments are increasingly employed in two-sided markets, such as buyer-seller platforms, to evaluate treatment effects from marketplace interventions. These experiments must reflect the underlying two-sided market structure in their design (e.g., sellers and buyers), making them particularly challenging to analyze. In this paper, we propose a randomization inference framework to analyze outcomes from such two-sided experiments. Our approach is finite-sample valid under sharp null hypotheses for any test statistic and maintains asymptotic validity under weak null hypotheses through studentization. Moreover, we provide heuristic guidance for choosing among multiple valid randomization tests to enhance statistical power, which we demonstrate empirically. Finally, we demonstrate the performance of our methodology through a series of simulation studies.
Multidimensional indexes are ubiquitous, and popular, but present non-negligible normative choices when it comes to attributing weights to their dimensions. This paper provides a more rigorous approach to the choice of weights by defining a set of desirable properties that weighting models should meet. It shows that Bayesian Networks is the only model across statistical, econometric, and machine learning computational models that meets these properties. An example with EU-SILC data illustrates this new approach highlighting its potential for policies.
This paper introduces the Eigenvalue-Based Randomness (EBR) test - a novel approach rooted in the Tracy-Widom law from random matrix theory - and applies it to the context of residual analysis in panel data models. Unlike traditional methods, which target specific issues like cross-sectional dependence or autocorrelation, the EBR test simultaneously examines multiple assumptions by analyzing the largest eigenvalue of a symmetrized residual matrix. Monte Carlo simulations demonstrate that the EBR test is particularly robust in detecting not only standard violations such as autocorrelation and linear cross-sectional dependence (CSD) but also more intricate non-linear and non-monotonic dependencies, making it a comprehensive and highly flexible tool for enhancing the reliability of panel data analyses.
An analyst observes an agent take a sequence of actions. The analyst does not have access to the agent's information and ponders whether the observed actions could be justified through a rational Bayesian model with a known utility function. We show that the observed actions cannot be justified if and only if there is a single deviation argument that leaves the agent better off, regardless of the information. The result is then extended to allow for distributions over possible action sequences. Four applications are presented: monotonicity of rationalization with risk aversion, a potential rejection of the Bayesian model with observable data, feasible outcomes in dynamic information design, and partial identification of preferences without assumptions on information.
The conventional linear Phillips curve model, while widely used in policymaking, often struggles to deliver accurate forecasts in the presence of structural breaks and inherent nonlinearities. This paper addresses these limitations by leveraging machine learning methods within a New Keynesian Phillips Curve framework to forecast and explain headline inflation in India, a major emerging economy. Our analysis demonstrates that machine learning-based approaches significantly outperform standard linear models in forecasting accuracy. Moreover, by employing explainable machine learning techniques, we reveal that the Phillips curve relationship in India is highly nonlinear, characterized by thresholds and interaction effects among key variables. Headline inflation is primarily driven by inflation expectations, followed by past inflation and the output gap, while supply shocks, except rainfall, exert only a marginal influence. These findings highlight the ability of machine learning models to improve forecast accuracy and uncover complex, nonlinear dynamics in inflation data, offering valuable insights for policymakers.
We develop a new approach to estimating flexible demand models with exogenous supply-side shocks. Our approach avoids conventional assumptions of exogenous product characteristics, putting no restrictions on product entry, despite using instrumental variables that incorporate characteristic variation. The proposed instruments are model-predicted responses of endogenous variables to the exogenous shocks, recentered to avoid bias from endogenous characteristics. We illustrate the approach in a series of Monte Carlo simulations.
This article introduces Regression Discontinuity Design (RDD) with Distribution-Valued Outcomes (R3D), extending the standard RDD framework to settings where the outcome is a distribution rather than a scalar. Such settings arise when treatment is assigned at a higher level of aggregation than the outcome-for example, when a subsidy is allocated based on a firm-level revenue cutoff while the outcome of interest is the distribution of employee wages within the firm. Since standard RDD methods cannot accommodate such two-level randomness, I propose a novel approach based on random distributions. The target estimand is a "local average quantile treatment effect", which averages across random quantiles. To estimate this target, I introduce two related approaches: one that extends local polynomial regression to random quantiles and another based on local Fr\'echet regression, a form of functional regression. For both estimators, I establish asymptotic normality and develop uniform, debiased confidence bands together with a data-driven bandwidth selection procedure. Simulations validate these theoretical properties and show existing methods to be biased and inconsistent in this setting. I then apply the proposed methods to study the effects of gubernatorial party control on within-state income distributions in the US, using a close-election design. The results suggest a classic equality-efficiency tradeoff under Democratic governorship, driven by reductions in income at the top of the distribution.
Critical bandwidth (CB) is used to test the multimodality of densities and regression functions, as well as for clustering methods. CB tests are known to be inconsistent if the function of interest is constant ("flat") over even a small interval, and to suffer from low power and incorrect size in finite samples if the function has a relatively small derivative over an interval. This paper proposes a solution, flatness-robust CB (FRCB), that exploits the novel observation that the inconsistency manifests only from regions consistent with the null hypothesis, and thus identifying and excluding them does not alter the null or alternative sets. I provide sufficient conditions for consistency of FRCB, and simulations of a test of regression monotonicity demonstrate the finite-sample properties of FRCB compared with CB for various regression functions. Surprisingly, FRCB performs better than CB in some cases where there are no flat regions, which can be explained by FRCB essentially giving more importance to parts of the function where there are larger violations of the null hypothesis. I illustrate the usefulness of FRCB with an empirical analysis of the monotonicity of the conditional mean function of radiocarbon age with respect to calendar age.
A triangular structural panel data model with additive separable individual-specific effects is used to model the causal effect of a covariate on an outcome variable when there are unobservable confounders with some of them time-invariant. In this setup, a linear reduced-form equation might be problematic when the conditional mean of the endogenous covariate and the instrumental variables is nonlinear. The reason is that ignoring the nonlinearity could lead to weak instruments As a solution, we propose a triangular simultaneous equation model for panel data with additive separable individual-specific fixed effects composed of a linear structural equation with a nonlinear reduced form equation. The parameter of interest is the structural parameter of the endogenous variable. The identification of this parameter is obtained under the assumption of available exclusion restrictions and using a control function approach. Estimating the parameter of interest is done using an estimator that we call Super Learner Control Function estimator (SLCFE). The estimation procedure is composed of two main steps and sample splitting. We estimate the control function using a super learner using sample splitting. In the following step, we use the estimated control function to control for endogeneity in the structural equation. Sample splitting is done across the individual dimension. We perform a Monte Carlo simulation to test the performance of the estimators proposed. We conclude that the Super Learner Control Function Estimators significantly outperform Within 2SLS estimators.
Probabilistic electricity price forecasting (PEPF) is a key task for market participants in short-term electricity markets. The increasing availability of high-frequency data and the need for real-time decision-making in energy markets require online estimation methods for efficient model updating. We present an online, multivariate, regularized distributional regression model, allowing for the modeling of all distribution parameters conditional on explanatory variables. Our approach is based on the combination of the multivariate distributional regression and an efficient online learning algorithm based on online coordinate descent for LASSO-type regularization. Additionally, we propose to regularize the estimation along a path of increasingly complex dependence structures of the multivariate distribution, allowing for parsimonious estimation and early stopping. We validate our approach through one of the first forecasting studies focusing on multivariate probabilistic forecasting in the German day-ahead electricity market while using only online estimation methods. We compare our approach to online LASSO-ARX-models with adaptive marginal distribution and to online univariate distributional models combined with an adaptive Copula. We show that the multivariate distributional regression, which allows modeling all distribution parameters - including the mean and the dependence structure - conditional on explanatory variables such as renewable in-feed or past prices provide superior forecasting performance compared to modeling of the marginals only and keeping a static/unconditional dependence structure. Additionally, online estimation yields a speed-up by a factor of 80 to over 400 times compared to batch fitting.
In this work, we are interested in studying the causal effect of an endogenous binary treatment on a dependently censored duration outcome. By dependent censoring, it is meant that the duration time ($T$) and right censoring time ($C$) are not statistically independent of each other, even after conditioning on the measured covariates. The endogeneity issue is handled by making use of a binary instrumental variable for the treatment. To deal with the dependent censoring problem, it is assumed that on the stratum of compliers: (i) $T$ follows a semiparametric proportional hazards model; (ii) $C$ follows a fully parametric model; and (iii) the relation between $T$ and $C$ is modeled by a parametric copula, such that the association parameter can be left unspecified. In this framework, the treatment effect of interest is the complier causal hazard ratio (CCHR). We devise an estimation procedure that is based on a weighted maximum likelihood approach, where the weights are the probabilities of an observation coming from a complier. The weights are estimated non-parametrically in a first stage, followed by the estimation of the CCHR. Novel conditions under which the model is identifiable are given, a two-step estimation procedure is proposed and some important asymptotic properties are established. Simulations are used to assess the validity and finite-sample performance of the estimation procedure. Finally, we apply the approach to estimate the CCHR of both job training programs on unemployment duration and periodic screening examinations on time until death from breast cancer. The data come from the National Job Training Partnership Act study and the Health Insurance Plan of Greater New York experiment respectively.