We present a methodology to predict the risk of significant train delays caused by challenging weather by analyzing the train delay data from the past and the prevailing weather observations. Our method is focused exclusively on the weather as our target is to create a stand-alone method which can be used as an input in several domains. For example, the risk of train delays can be used as an input for resource planning and communications in train operation centres. Moreover, the individual predictor can be used as an input for forecasters to issue weather warnings.
Specifically, in our approach to the problem, we predict the sum of the delays of all trains that have arrived at a specific station during a one hour slot. Here delay is considered to be the additional delay between the previous station and the current station. The prediction is based on 19 different ground observation weather parameters gathered from each station. The data is pre-processed in two ways: First, we gather observations within a 100 kilometre range and aggregate the measurements as the minimum, maximum and average for each parameter. Second, we calculate the precipitation sums for three hours and six hours as long-lasting snow precipitation which is known to be specifically hard for rail traffic.
We compared three different models to predict the delays. First, we created a simple linear regression (LR) which was experimented with l1 and l2 regularisation. Second, we used random forest regression (RFR). Third, we created a Long Short-Term Memory (LSTM) neural network which is also able to use information from time series history. We optimised the hyperparameters of linear regression and random forest regression using random search. LSTM topology and hyperparameters were searched by trial-and-error. We also experimented with all models using a different combination of features, using only the data pertaining to the winter months and both with and without PCA dimensionality reduction.
The data set used in our experiments was gathered from the Finnish Traffic Agency and the Finnish Meteorological Institute between the years 2010 and 2018. The delay data contain a history of most train shifts in Finland aggregated to an hourly time interval. Weather observations are gathered from ground observation stations as described above and aggregated to hourly values similarly. To validate model performance, we cherry-picked three months out from the data set, which was not used in training.
We consider our data analysis problem as a big data problem as the data involved is in several gigabytes and also we use machine learning techniques in the analysis of the data. We use the Apache Spark framework implemented in Google cloud as a unified analytic engine for retrieving and preprocessing the data and also for developing the machine learning models. Random forest regression and LSTM are applied in a distributed way using Apache Spark reduced time taken for training the models and predicting the delay using the models significantly.
Our results are promising. Random forest regression provided the best results: RMSE between occurred and predicted delays were, on average over all stations, 13.86 minutes. Brier skill score (BRR) against average delays of all times and stations was 0.16. The model performed very differently with different stations. While the BRR was quite good for many stations, the model showed a negative skill score for stations with fewer daily train shifts.
There is a number of possible ways to improve the results. Current delay data contain all delays which generates a lot of noise into training data. Cleaning the data, or indicating a possible correlation between delays and weather in other ways, would most probably improve the results a lot. Another clear possible improvement would be to use already occurred delays as a descriptive feature.