Handout (1.7 MB)
The random forest technique is a machine learning method that improves on the decision tree approach with two key adjustments. A traditional decision tree is a set of questions linked through a hierarchical structure that subdivide a function into discrete values based on a training set. The questions are determined through an exhaustive search of predictors. Traditional decision trees are very sensitive to small changes in the training data and can easily overfit. Random forests address this problem by training an ensemble of trees on resampled versions of the original training set to explore the range dataset variability. The questions in the trees are selected from a random subset of the predictors. The random forest model produces more accurate predictions, is less likely to overfit, and accounts for nonlinear interactions among predictor variables. It also has a short training time and has built-in variable selection.
The random forest technique was used to predict direct power production for a group of wind farms in Texas from wind forecasts at the level of the wind turbine (hub height) typically 80-m above ground level. The technique was applied to day-ahead forecasts for individual wind farms while forecast performance was examined using both the forecasts for individual wind farms and regional aggregates of wind farms. A large set of model state variables including wind speed, temperature and geopotential height at various levels were used for training of the random forest.
For this presentation, results using random forest were compared with linear regression-based MOS approaches; one of which is trained from observed power production and a second approach using observed wind speed. Both approaches were run using a training sample of different sizes. Initial results show that random forest forecasts decrease forecast mean absolute error when given a sufficiently large and diverse training set. Adding additional NWP forecast variables beyond those derived from 80-m hub height wind speed also produced an improved forecast. The presentation will highlight the sensitivity of forecast performance to the number of decision trees, variables per node, training sample size, variable selection, and the use of regime-based predictors as well as a comparison to a screening multiple linear regression approach with a similar sample size.