9.2 Testing the Tests: A Look at Size and Power for Hypothesis Tests of Competing Forecasts (Invited Presentation)

Wednesday, 13 January 2016: 11:00 AM
Room 226/227 ( New Orleans Ernest N. Morial Convention Center)
Eric Gilleland, NCAR, Boulder, CO; and B. G. Brown, T. L. Fowler, and A. S. Hering

One of the most common research questions utilizing forecast verification and hypothesis testing (or equivalently, confidence intervals) concerns determining whether or not a new forecast model, or a modification to an existing one, is statistically significantly better than the currently used model. It is often desired to make a direct comparison of statistical summaries along with confidence intervals (or an hypothesis test) of each model. Important considerations include temporal dependence and contemporaneous correlation (correlation between the two competing forecast models). In this study, several common tests are compared empirically by way of simulations. Some of the tests examined do not account for these types of dependence (e.g., Student's t-test, iid bootstrap), while others at least account for temporal dependence (e.g., the normal approximation with variance inflation factor applied). A popular test from Economics literature, the Diebold-Mariano test, is analyzed along with a recent modification. These latter tests directly account for temporal dependence without making assumptions about the underlying distributions for the model series in question. Empirical size and power are analyzed for varying degrees of temporal and contemporaneous correlation, as well as different temporal dependence structures for three different loss functions. Not surprisingly, it is found that the tests that do not account for either type of correlation do not have adequate size, except when the dependence is very low. For those that do account for dependence in the series, some of the results are surprising. In particular, some tests are found to be heavily under sized, while others are slightly over sized, with a clear affect introduced by contemporaneous correlation. For some of the tests with good size, the power is found to not be very high. In each case, guidance is given for when certain tests perform best, and when they should not be used.
- Indicates paper has been withdrawn from meeting
- Indicates an Award Winner