Currently more than 40 sources of data have already been added to the databank. Although collection of new source datasets is ongoing, efforts have been made to develop the initial version of the Stage 3 merged dataset. This involves development of automated algorithms for removing duplicate station records, identifying two or more station records that can be merged into a single record, and incorporating new and unique stations. The program runs iteratively through all the sources which are ordered based upon criteria established by the ISTI. The highest preferred source, known as the master, runs through all the candidate sources, calculating station comparisons that are acceptable for merging. The process is Bayesian in approach, and the final fate of a candidate station is based upon metadata matching and data equivalence criteria. If there is not enough information, then the station is withheld for further investigation. The algorithm has been validated using a pseudo-source of stations with a known time of observation bias, and correct matches have been made nearly 95% of the time.
The final Stage 3 product contains over 40,000 stations, however slight changes in the algorithm can perturb results. Subjective decisions, such as the ordering of the sources, or changing metadata and data matching thresholds, can yield a different outcome. In order to address the uncertainty, multi-member ensembles of the merge program have been produced based upon expert decisions from the databank working group. All data and code will be provided openly and without charge, which facilitates easy access and ease of use by anyone in the international community. We strongly encourage the use of these data and feedback on any relevant aspect of the Databank effort from interested parties.