These restrictions motivate thinning methods, which have the goal to reduce the large satellite data sets and to retrieve the essential information content of the data for optimal use in data assimilation. In this work we devise two algorithms that are inspired by data simplification methods in vector quantization and computer graphics.
In the first thinning method, we follow the concept of data clustering, which was developed to partition a given data set into a number of groups (clusters), each of which contains data elements that are similar with respect to some distance measure. Each cluster contains a representative element, which serves as approximation of all elements in the cluster. Our method starts by assigning the complete data set to one cluster, which is then hierarchically split until all clusters satisfy an accuracy constraint. This procedure is completed by so-called Lloyd-iterations, in which cluster centers are reassigned in order to improve the approximation quality.
The second thinning method is based on an estimation filter, which approximates measurement values by a continuous estimation function. We compute the normalized weighted sum of observations in a local neighborhood of a given location to estimate an observation at this location. Our method aims at finding a minimum of a global error function, which is defined by averaging the squared differences of the estimation function with respect to the full data set and the simplified one. This estimation function is used to define a redundancy measure for observational data. If replacing an observation by its local estimation leads to a sufficiently small change in the estimation function, the observation is supposed to be redundant.
In this work, we describe the positive performance and accuracy effects achieved by modifying the former implementation of the EEA algorithm , where we evaluated the estimation function at the observation locations and iteratively removed the observation leading to the least degradation in global estimation error until a certain number of retained observations was reached. Although this EEA implementation caused a slight improvement in scores of 27 subsequent forecasts with the global GME model compared to the operational thinning method, i.e. selecting every third point in zonal and meridional direction, the approach was suboptimal with respect to computing time and accuracy of the estimation function. Therefore, we extend the EEA concept in the present work by introducing new variants of implementations, which differ in methodological and performance aspects and improve the traditional approach. In the Grid-EEA algorithm, we propose to employ a regular grid on the globe to evaluate the estimation function. This approach is motivated by the fact that in most cases, the observations are distributed non-uniformly. The traditional EEA algorithm is based on evaluation of squared error differences at the positions of the observations of the original data set, which consequently leads to an over estimation of the error function in regions with increased data density. We apply the grid of the GME (global model) of the German Weather Service DWD which provides a near-uniform discretization of the sphere. Depending on the grid resolution, the Grid-EEA algorithm provides thinnings of varying accuracy at different run times.
While the EEA algorithm adopts the concept of iterative point removal, we propose another EEA concept, in which the observations are iteratively inserted, starting with the empty set. This approach is preferable in settings where only a small part of the original data set is retained, i.e. 10%, as it is the case with the satellite soundings in the assimilation process at the DWD. For these thinnings, the resulting number of insertion operations is consequently much smaller than the number of observation removal steps in the traditional EEA method, which makes the method faster and less sensitive to unavoidable inaccuracies, caused by each step. We show that this results in an improved accuracy compared to the approach of point removal when the number of retained observations is sufficiently small.
We finally focus on the problem of efficient implementation of the methods. The proposed EEA variants rely on iterative removal or insertion of observations, whereby each step leads to an increase or decrease of the global estimation error. We propose to organize all candidates for observation removal or insertion operations in efficient data structures, such as priority queues, which allow sorting the observations according to their redundancy degree. The removal or insertion of an observation corresponds to removing one element from the queue, followed by an updating procedure, in which the correct ordering of the observations is restored. This is mandatory since observations in a local neighborhood are affected by each single observation removal/insertion step. We propose two variants of this updating procedure and compare them with respect to accuracy and computational complexity. One of the updating variants leads to a significant speed-up, e.g., a data set of about 10^5 observations is processed in only four seconds compared to two minutes for the traditional approach , while the loss in accuracy is negligible.
 Ochotta, T., Gebhardt, C., Saupe, D., Wergen, W., Adaptive thinning of atmospheric observations in data assimilation with vector quantization and filtering methods, QJ of R. Met. Soc., in press.