The speed optimizations for the CPU-based version using pure NumPy are now largely tapped out. Some optimizations could still be done with Numba, but even there, hardware constraints set the limits. NumPy’s math libraries are already heavily optimized, making the current bottleneck the reading and writing of data to and from memory. Modern GPUs have dedicated video memory that’s orders of magnitude faster than the RAM the CPU uses. So, implementing the algorithm on the GPU should yield a significant performance boost.
One widely adopted framework for GPU computing is Nvidia CUDA. High-end consumer GeForce cards offer data transfer rates comparable to the much pricier RTX Titan, meaning you can compute large areas quickly even with a more affordable card.
After researching potential GPU computing implementations for NumPy, I chose CuPy. It provides plenty of flexibility and optimization options while maintaining easy compatibility with existing Python and NumPy code. As a first step, I’ve created a fork and am beginning the implementation of the iteration on the GPU.

