Evaluating Extreme Precipitation Forecasts: A Threshold-Weighted, Spatial Verification Approach for Comparing an AI Weather Prediction Model Against a High-Resolution NWP Model
Abstract
Recent advances in AI-based weather prediction have led to the development of artificial intelligence weather prediction (AIWP) models with competitive forecast skill compared to traditional NWP models, but with substantially reduced computational cost. There is a strong need for appropriate methods to evaluate their ability to predict extreme weather events, particularly when spatial coherence is important, and grid resolutions differ between models. We introduce a verification framework that combines spatial verification methods and proper scoring rules. Specifically, the framework extends the High-Resolution Assessment (HiRA) approach with threshold-weighted scoring rules. It enables user-oriented evaluation consistent with how forecasts may be interpreted by operational meteorologists or used in simple post-processing systems. The method supports targeted evaluation of extreme events by allowing flexible weighting of the relative importance of different decision thresholds. We demonstrate this framework by evaluating 32 months of precipitation forecasts from an AIWP model and a high-resolution NWP model. Our results show that model rankings are sensitive to the choice of neighbourhood size. Increasing the neighbourhood size has a greater impact on scores evaluating extreme-event performance for the high-resolution NWP model than for the AIWP model. At equivalent neighbourhood sizes, the high-resolution NWP model only outperformed the AIWP model in predicting extreme precipitation events at short lead times. We also demonstrate how this approach can be extended to evaluate discrimination ability in predicting heavy precipitation. We find that the high-resolution NWP model had superior discrimination ability at short lead times, while the AIWP model had slightly better discrimination ability from a lead time of 24-hours onwards.