Exploring Potential Bias in Data Selection & Processing for Artificial Intelligence in Environmental Sciences
Authors:
Haley Perez
Amy McGovern
Ann Bostrom
David Gagne
Abstract:
Artificial Intelligence (AI) is an emerging tool useful for environmental sciences and increasingly has the potential to boost weather prediction and preparedness. With the integration of these technologies, it is important to be cognizant of the potential for bias as it could affect the AI model training and deployment. Within the context of our work, bias should be defined as any effect that can systematically distort the statistical representativeness of a model and its outputs, therefore skewing its resulting reliability.
This work is an extension of our previous work (McGovern et al, 2023) which identified four major categories of bias that can be present in AI used for Environmental Sciences. Categories previously identified include Systemic and Structural bias, Data bias, Statistical and Computational bias, and Human bias. Note that each of these categories are not mutually exclusive; rather, more than likely each category influences and interacts with other categories. In our work, we take a deeper look at bias introduced by data, where the selection available or the methods used in processing have the potential to distort the statistical representativeness of a resulting model. In addition to the subcategories of data selection and data processing, there is the potential for a model to violate what is possible within the laws of physics.
The investigation into data bias was conducted with ERA5 reanalysis data fed into a static U-Net model tuned for frontal analysis (Justin et al, 2023) over the continental United States (CONUS). We looked at data selection using methods of oversampling, undersampling and regional sampling. For data processing, we used varying degrees of augmentation which adjusted the width of labels, introduced realistic noise patterns, and played with the rotation of images. We also examined the introduction of different types of noise, including physically realistic noise as well as random noise. All outcomes were evaluated by comparing Critical Success Index (CSI) scores with performance and reliability diagrams.
References:
Justin, A. D., C. Willingham, A. McGovern, and J. T. Allen, 2023: Toward Operational Real-time Identification of Frontal Boundaries Using Machine Learning. Artif. Intell. Earth Syst., volume 2, e220052, https://doi.org/10.1175/AIES-D-22-0052.
McGovern, A., Bostrom, A., Gagne, D. J., Ebert-Uphoff, I., Musgrave, K., McGraw, M. and Chase, R. (2023) Classifying and Addressing Bias in AI/ML for the Earth Sciences. Presented at the 22nd Conference on Artificial Intelligence for Environmental Science, American Meteorological Society Annual Meeting, Denver, CO, 8-12 January 2023.

