This code has also been optimised to be much more memory
friendly by removing a _lot_ of extraneous copies. The result
is that, when profiled, it's around 8x faster than the previous
implementation.
Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>