Support vector regression with hyper-parameters found using genetic algorithm and predictors screened using partial mutual information

Jenkner, Johannes; Jenkner, Johannes

The provided dataset is cleared from missing values by means of a regularized expectation maximization (RegEM). In the algorithm used, each variable is first gaussianized with a Box-Cox transformation. Then, the filtering of the noise in the data is done with a continuous regularization parameter which is determined by cross-validation. Finally, a sequence of linear regressions iteratively estimates the missing values. The RegEM is run twice to account for the differing availability of the air quality variables and the meteorological variables. A pool of predictor variables is created for both measuring stations together with the air quality and the meteorological variables lagged by one to twenty four hours plus the meteorological variables used at the target time. The lagged setup automatically picks the second, third and fourth day within the provided six-day cycles for the training of the prediction method. In addition to the provided variables, the temporal cycles of a day, a week and the whole year are included into the pool of predictors. Subsequently, all predictors are screened by a forward selection method based on partial mutual information (PMI). Relevant predictors are selected sequentially based on their mutual information with the target data and with the previously selected predictors. The selection is terminated as soon as the PMI reaches zero for all the remaining predictors in the pool. The artificial intelligence method used here is support vector regression (SVR). The original data are mapped into a high-dimensional "feature space" where the cross product is defined in terms of a kernel function. In the current approach, the radial basis kernel function is used. Three hyper-parameters have to be set for SVR which define the kernel width, the penalty parameter in the cost function and the epsilon insensitive zone in the error norm. In the current approach, this is done with genetic algorithm whose initial generation is seeded by an analytically determined parameter combination. After the best hyper-parameters are found, the SVR is run separately for the two measuring stations. The final predictions are done sequentially, i.e. the twenty four hours of the target days are predicted one after the other.

3.5 Support vector regression with hyper-parameters found using genetic algorithm and predictors screened using partial mutual information