July 2017 at the Ontario NYSM site was selected as a case study to design and validate the algorithm performance because there are wind turbines installed near the Ontario NYSM site. At each timestep within the case study training dataset, a distribution of wind speeds exists when the month of data is vertically concatenated over a single day, and this distribution is what will be used to develop the backend of the algorithm. The average value, μ, and the standard deviation, σ, of this distribution were calculated to generate a range of upper and lower limits corresponding to μ±σ, respectively. To generate a new dataset of wind speeds that is independent from the input data, a method of stochastically subsampling values from the new range of data was employed. This algorithm “learns” as more data is funneled into its training set because the most probable wind speed for each timestep becomes more identifiable since the Gaussian error about a mean value decreases proportionally to the standard deviation multiplied by the inverse square root of the training dataset sample size.

The amount of energy that may be potentially generated in the future is a function of wind speeds, air density, and wind turbine parameters including efficiency; where the wind speed inputs have been estimated by the stochastic sampling algorithm. NYSM wind speeds are measured at 10 m so extrapolation conversions were applied to these predictions to estimate 100 m wind speeds; turbine hub heights are traditionally closer to 80 m, however, the current validation dataset being used is 100 m measured LiDAR wind speeds. For this study, empirical engineering and efficiency data of a DOE/GE 1.5 MW turbine were retrieved from NREL and was used to base the energy conversion calculations on.

For model validation, 100 m LiDAR data was correlated with the machine learning output wind speeds, a constant persistence model, and a ten-point integrate persistence model over 5,000 simulation runs. Predicted wind speeds resulted in the highest correlation with the validation dataset (RMSE = 1.75, MAE = 1.52) and, in turn, the highest correlation with expected energy outputs; the ten-point integrated persistence model and constant persistence model had correlations of RMSE = 1.80, MAE = 1.62 and RMSE = 1.83, MAE = 1.65, respectively.