Introduction to Development of Meteorological Big Data Preprocessing Technology for Artificial Intelligence Training Data Production

Park, Junsang; Park, Junsang

Meteorological data by artificial intelligence (AI) technology has immense potential for a variety of applications, including weather forecasting, climate analysis, and disaster prediction. The meteorological big data of Korea Meteorological Administration produces 20 million observation data and 160 thousand model data per day. To enable AI experts to effectively utilize the big data, substantial effort is taken, such as convergence technology, domain knowledge, and comprehending meteorological phenomena. Our objective is to develop a comprehensive data preprocessing technology of universal applicability with the intention of alleviating challenges faced by developers. The Data preprocessing for AI training constitutes the method of preparing raw data and making it suitable for an AI model. This involves 3 tasks such as data cleaning, data integration and transformation.
It is that the procedure of data preprocessing entailed segregating it into two categories: observational data (AWS, Radar, and Satellite) and model(UM, KIM, ECMWF) data.
Integration of AWS data managed by the Korea Meteorological Administration and three local governments is achieved through quality control. The amalgamation of AWS precipitation data with Hybrid Surface Rainfall (HSR) radar rain rate data results in the generation of comprehensive rainfall Ground Truth (GT) data, including ocean areas. In addition, the Quality Control (QC) process was applied to the WISSDOM data, facilitating the derivation of three-dimensional wind field information.
Based on AI, COMS satellite channel was used to produce satellite information during the gap time of the GK2A satellite, while also generating radar (HSR) reflectivity information within unobserved regions using satellite data. During this all phase, the process of unifying area and grid information was conducted to enhance the utilization of heterogeneous data. The model data standardized the spatial resolution across all three models(UM, KIM, ECMWF). And a set of rainfall training data was generated through the application of clustering techniques, categorizing it into five distinct rainfall patterns.
All preprocessed data is saved with the NetCDF extension. It becomes imperative to contemplate a more universally adaptable file format composition, aiming to standardize preprocessing for AI training on meteorological big data.

3A.1 Introduction to Development of Meteorological Big Data Preprocessing Technology for Artificial Intelligence Training Data Production