Sensitivity of Data Size vs. Deep-learning Model Performance using GEFSv12 Reforecast Products for Rainfall and Temperatures over CONUS

Malasala, Murali Nageswararao; Malasala, Murali Nageswararao

Reforecast datasets improve weather forecasts by running prediction models with historical data to identify and correct biases. These datasets validate and calibrate models and help predict extreme events. They also provide real-time forecasts using climatological analysis. By using reforecast datasets, forecasters have more accurate information, leading to better preparedness for hazardous weather conditions. In the field of deep learning, dataset size plays a vital role in determining model performance. Larger datasets offer a broader range of samples and variations, enabling the model to learn more complex patterns and generalize better to new data. This leads to a more robust and less overfitting-prone model that captures the underlying patterns rather than simply memorizing training data. However, increasing the dataset size may not always lead to proportional improvements in model performance and there may be diminishing returns. The quality of the data is just as crucial as the quantity, and it should accurately represent real-world scenarios. It is important to consider the balance between dataset size and quality, as well as practical limitations like computational resources and time constraints. The release of GEFSv12 by NOAA NCEP is a significant development for users seeking sub-seasonal forecasts and hydrological applications. This latest system enables access to reanalysis and reforecast data from 2000-2019, providing valuable historical information. With 5 ensembles available every day up to a 16-day lead time, except on Wednesdays when 11 members and integrations offer forecasts up to 35 days in advance based on 00UTC initial conditions, users can make more informed decisions with increased confidence. This update is expected to enhance the accuracy and reliability of weather and climate predictions, benefiting a wide range of industries and sectors. The proposed study aims to investigate the impact of sample size on the development of deep-learning models for rainfall and temperature extreme prediction at a sub-seasonal scale over CONUS, using the GEFSv12 reforecast products. The results of the proposed study will be compared with benchmark post-processing methods for varying training period lengths, such as 2000-2015, 2005-2015, 2010-2015, and 2014-2015. The validation and testing periods 2016-2017 and 2018-2019 will be used, respectively. To establish the minimum data requirement for a deep-learning model that can effectively generalize new weather and climate patterns. Generally, deep-learning models are computationally intensive to train and deploy, so determining the optimal sample size in this proposed study can strike a balance between model accuracy and computational efficiency, enabling faster and more cost-effective predictions. Understanding the sensitivity of sample size can guide data collection efforts toward essential data acquisition without unnecessary redundancy.

8B.1 Sensitivity of Data Size vs. Deep-learning Model Performance using GEFSv12 Reforecast Products for Rainfall and Temperatures over CONUS