Predictive Modeling in Environmental Science with Machine Learning Algorithms

Erfani, Ehsan; Erfani, Ehsan

This study aims to utilize various remotely sensed large datasets of environmental variables as inputs to develop machine learning (ML) models to predict wildfires. Ensemble ML will be employed to improve the accuracy of the final prediction of wildfires. We will combine the outputs of multiple models to achieve a more accurate and stable prediction compared to any single model. Our preliminary analyses show that Random Forest (RF) has significant predictive power among the Ensemble ML techniques in capturing both linear and non-linear relationships with particular applications in environmental sciences. We will use the RF model as it utilizes multiple decision trees to make a prediction, such that the algorithm randomly selects a subset of the data and a subset of the available features at each node of the tree. Once the trees are constructed, their individual predictions are combined using either a majority voting scheme (for classification tasks) or an averaging scheme.

Moreover, we will develop additional ML algorithms to identify the best model with the highest performance based on the metrics. These include Extreme Gradient Boosting, Lasso Regression, Support Vector Machines, Deep Neural Network, and Adaptive Boosting. The ML models will be developed based on the train data set and the prediction will be based on the independent test data set. One of the fundamental steps in conducting ML analyses is partitioning data. This is a critical step as an ML model that is trained on a given dataset may display a degree of overfitting, resulting in highly accurate predictions on that particular dataset, while potentially failing to generalize to new data, leading to poor performance in real-world scenarios. To avoid the risk of overfitting, we will employ a well-established statistical method known as k-fold cross-validation, and ultimately, to evaluate the model performance, several statistical metrics commonly used in environmental sciences will be computed based on the collective results of the k-fold cross-validation procedure, serving as indicators of the model's predictive capabilities.

145 Predictive Modeling in Environmental Science with Machine Learning Algorithms