The implementation is based on limiting costly data transfer between the GPU and the host processor. Traditionally, data is transferred to and from the GPU for the parts of the code that is executed on the GPU. We present an implementation where all computation is executed on the GPU. Data transfer therefore only needs to occur once at the beginning and once at the end of the simulation, avoiding data transfers after each timestep. The model state therefore remains on the GPU for the duration of the simulation run. The role of the CPU is solely for input/output operations and to manage the execution of the GPU kernels. This allows for greater speed-up compared to a modular approach where some parts of the code execute on the host and the remaining parts on the GPU.
Preliminary speed-up results as well as programming challenges involved with this implementation are presented.