Faster Image Transforms With Cython

Python to Cython

This is the second post on how to accelerate Python with Cython. The previous post used Cython to declare static C data types in Python, to speed up the runtime of a prime number generator. In this post we shall be modifying a more complex program, one that performs an image transform on a map. This will allow me to demonstrate some more advanced Cython techniques, such as importing C functions from the C math library, using memory views of Numpy arrays and turning off the Python global interpreter lock (GIL).

As with the previous post, we shall be making a series of Cython modifications to Python code, noting the speed improvement with each step. As we go, we’ll be using the Cython compiler’s annotation feature to see which lines are converted into C at each stage, and which lines are still using Python objects and functions. And as we tune the code to run at increasingly higher speeds, we shall be profiling it to see what’s still holding us up, and where to refocus our attention.

Although I will be using Python 3 on a Mac, the instructions I give will mostly be platform agnostic:  I will assume you have installed Cython on your system (on Windows, Linux or OS/X) and have followed and understood the installation and testing steps in my previous post. This will be essential if you are to follow the steps I outline below. As stated in the previous post, Cython is not for Python beginners.

Compilation Method

Just to recap, I’ll be using a simple editor (IDLE, in my case), editing my Cython code as .py files (not .pyx files) so that IDLE can recognize them. I’ll only be converting them to .pyx when I’m ready to compile them. To compile them with Cython I’m using a combination of two files. The first is a simple executable shell file I’ve written, containing these two lines:

cp MyCythonModule.py MyCythonModule.pyx
python3 setup.py build_ext --inplace

Save these lines to a file called CompileCython.sh and make it executable. Two things to note. First, you don’t have to use this file – you can just run the second line from the command line. Second, however you run it, this line will need some additional compiler flags for Windows, depending on your Python distribution.

Whichever way you choose to run it, this Python command in turn calls setup.py, which contains the compilation instructions for Cython:

from distutils.core import setup 
from distutils.extension import Extension 
from Cython.Build import cythonize 
 
ext_modules = [Extension("MyCythonModule", 
                            sources=["MyCythonModule.pyx"])] 
 
ext_options = {"compiler_directives": {"profile": True}, "annotate": True} 
 
setup(name="MyCythonModule",
      ext_modules=cythonize(ext_modules, **ext_options))

 

Once you have edited this file to be exactly how you want it, you will rarely have a reason to touch it. We shall only be making changes to it once, when we import the C math libraries later in the post.

Testing each Cython code mod will therefore be a five step process (A-E):

$ ./CompileCython.sh # step A - run the shell file to compile and build the external module 

Compiling MyCythonModule.pyx because it changed. 
[1/1] Cythonizing MyCythonModule.pyx 
running build_ext 
building 'MyCythonModule' extension 
creating build 
... etc ... 

$ ipython --no-banner # step B - running iPython 

In [1]: # step C - importing the new module:
In [2]: from MyCythonModule import inv_transform_map 
 
In [3]: # step D - testing the code: 
In [4]: inv_transform_map(displaying_images=True)  
Running the map transform...

In [5]: # step E - timing the code: 
In [6]: %timeit -r5 inv_transform_map()  
Running the map transform...
893 ms ± 6.1 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

 

The only other step will be to profile the code, which I’ll be doing early and often to re-check which functions are taking up most of the execution time. There are several great tools for this, but the main one I’ll be using is the Python code profiler prun, in combination with the Cython compiler’s annotation output to check which lines are still using Python after each mod.

The Starting Point

The code I’ll be using is a faster version of that originally posted for performing a transverse Mercator transform. The original code did not have to be perfect, and was written to explore some interesting mathematics and cartography. It was also reworked in my parallel processing post, where it served its purpose by being computationally intensive. But for this post it needs to start lean and get leaner. So I’ve removed some unnecessary rounding and type casting, and simplified some of the equations.

This means that almost all the gains from the changes I’m about to make are down to Cython, not just from restructuring poorly written code. I say ‘almost all’ because once things start to get fast, any lines of Python holding things up will start to stand out, and so restructuring the Python code will become necessary.

So here is the initial Python code we’ll be using:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
 
from PIL import Image
import numpy as np
 
 
def inv_merc_lat(translat_in_rads, translong_in_rads):
    """Calculates orig transverse mercator latitude for a given image coord
    """
    orig_lat_rad = -np.arcsin((np.sin(translong_in_rads)) /
                              (np.cosh(translat_in_rads)))
    return orig_lat_rad
 
 
def inv_merc_long(trans_lat_rads, trans_long_rads):
    """Calculates the orig transverse mercator longitude for a given image coord
    """
    orig_long_rads = np.arctan(np.sinh(trans_lat_rads) / np.cos(trans_long_rads))
 
    if (trans_long_rads > np.pi/2) or (trans_long_rads < -np.pi / 2):
        if trans_lat_rads >= 0:
            orig_long_rads += np.pi
        elif trans_lat_rads < 0:
            orig_long_rads -= np.pi
    return orig_long_rads
 
 
def inv_transform_pixel(x_pix_num, x_max, y_max, transformed_lat):
    """Calculates the pixel position needed from the input image
    """
    trans_longitude = (2 * np.pi * x_pix_num / x_max) - np.pi
 
    orig_latitude = inv_merc_lat(transformed_lat, trans_longitude)
    orig_longitude = inv_merc_long(transformed_lat, trans_longitude)
    original_x = int(0.5 * (orig_longitude / np.pi + 1.0) * x_max)
    original_y = int((0.5 - orig_latitude  / np.pi) * y_max)
 
    # fixes potential rounding and index errors:
    original_x -= 1 if original_x == x_max else 0
    original_y -= 1 if original_y == y_max else 0
 
    return original_y, original_x
 
 
def inv_transform_line(map_line_num, x_max, y_max):
    """Finds the transformed image pixel pos, for each pixel in the given line
    """
    # A pixel's x and y coordinates will usually exceed 256, so need uint16:
    line_pix_mapping = np.zeros((x_max, 2), dtype=np.uint16)
    y_norm = - 2 * map_line_num / y_max + 1
    trans_lat = y_norm * np.pi
 
    for x_coord in range(x_max):  # for each pixel on the output line
        line_pix_mapping[x_coord] = inv_transform_pixel(x_coord, x_max,
                                                          y_max, trans_lat)
    return line_pix_mapping
 
 
def inv_transform_map(map_input_name='WorldMap.png',
                                  map_output_name='ShiftedPoleMap.jpg',
                                  displaying_images=False):
    """Builds an output map image from the input image via the inv transform
    """    
    try:
        input_image = Image.open(map_input_name)
        X_PIC_SIZE, Y_PIC_SIZE = input_image.size
        if displaying_images:
            input_image.show()
 
    except FileNotFoundError:
        print("\nInput map {} isn't there. Check the filename is correct.\n".
              format(map_input_name))
        raise SystemExit()
 
    print("Running the map transform...")
 
    line_pixel_mapping = np.zeros((Y_PIC_SIZE, X_PIC_SIZE, 2), dtype=np.uint16)
    for map_image_line in range(Y_PIC_SIZE):
        line_pixel_mapping[map_image_line] = inv_transform_line(map_image_line,
                                                        X_PIC_SIZE, Y_PIC_SIZE)
    output_image_data = np.zeros((Y_PIC_SIZE, X_PIC_SIZE, 3), dtype=np.uint8)
 
    for line in range(Y_PIC_SIZE):
        for pixel in range(X_PIC_SIZE):
            new_pix_line, new_pix_x = line_pixel_mapping[line][pixel]
            # The [:3] is to read only the RGB input image data if it's RGBA:
            output_image_data[line][pixel] = input_image.getpixel((
                    int(new_pix_x), int(new_pix_line)))[:3]
 
    output_image = Image.fromarray(output_image_data, 'RGB')
 
    if displaying_images:
        output_image.show()
 
    output_image.save(map_output_name)
    return

 

Depending on the input map you’ve used, if you’ve got this far, you should have an output image that looks something like this:

 

Cleaning up the code in this way brought the runtime on a 700×350 map down from 7.5 seconds  to about 4 seconds. Here’s how it runs when you import the standard Python code into iPython:

$ ipython --no-banner

In [1]: from MyModule import inv_transform_map                                            
In [2]: inv_transform_map(displaying_images=True)                                         
Running the map transform...

In [3]: %timeit inv_transform_map()                                                   
Running the map transform...
...
3.99 s ± 5.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This will be the Python-only benchmark time for the image I’m testing it on. An initial profile of the code using the Python profiler prun shows where most of the time is being spent:

In [4]: %prun inv_transform_map()                                                         
 
Running the map transform...
 
         2021446 function calls in 4.653 seconds
 
   Ordered by: internal time
 
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   252404    1.032    0.000    1.032    0.000 MyModule.py:16(inv_merc_long)
   252404    1.000    0.000    1.000    0.000 MyModule.py:8(inv_merc_lat)
        1    0.976    0.976    4.653    4.653 MyModule.py:60(inv_transform_map)
   252404    0.718    0.000    2.750    0.000 MyModule.py:29(inv_transform_pixel)
      356    0.357    0.001    3.109    0.009 MyModule.py:46(inv_transform_line)
   252404    0.158    0.000    0.560    0.000 Image.py:1303(getpixel)
   252404    0.135    0.000    0.333    0.000 ImageFile.py:134(load)
   252407    0.134    0.000    0.187    0.000 Image.py:809(load)
   252404    0.069    0.000    0.069    0.000 {method 'getpixel' of 'ImagingCore' objects}
   252406    0.053    0.000    0.053    0.000 {method 'pixel_access' of 'ImagingCore' objects}
       28    0.011    0.000    0.011    0.000 {method 'decode' of 'ImagingDecoder' objects}
        1    0.005    0.005    0.005    0.005 {method 'encode_to_file' of 'ImagingEncoder' objects}
      358    0.001    0.000    0.001    0.000 {built-in method numpy.core.multiarray.zeros}
	...  etc  ...

 

Note that the total runtime takes longer when we profile the code, especially from within iPython. The point though is to find the relative execution times of each function. It’s clear that the majority of the runtime is being spent in the low level function inv_transform_pixel(), and its sub-functions inv_merc_long() and inv_merc_lat() with their heavy use of Numpy trigonometry. Once we have declared all of our initial static C variables, these will be the areas we’ll examine first.

1. Compiling Python With Cython

But before we do that, first we need to see how much faster the code runs when it is compiled, rather than just interpreted. Compiling Python code with Cython is often the only step that many companies take, as a way of hiding their Python source code. (People often use the Python compiler Nuitka for the same reason.)

Here is the result of compiling the Python code with Cython which, apart from hiding the source code, brings the additional benefit of running slightly faster:

In [1]: from MyModule import inv_transform_map 
In [2]: %timeit inv_transform_map()                                                   
Running the map transform...
Running the map transform...
...
3.73 s ± 4.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The speed gain of 6% isn’t much, but since we are already dealing with some optimized Python Numpy code, it’s perhaps to be expected.

We can see from the Cython HTML annotation file from this compilation that Python’s objects and functions are being used on every line:

Compiling Python 1Compiling Python 2

As we proceed with the Cython mods, the yellow lines will gradually fade to white, indicating less and less Python interaction.

 

2. Using Static C Data Types

This step includes most of the changes I implemented in the last Cython post, apart from the array stuff. The changes I’ve made in this step are:

  • Using Cython’s cdef static C data types for all ints and floats, using longs and doubles where necessary.
  • Using cdef blocks, if declaring many static C variables at once.
  • Use Cython’s cdef type Py_ssize_t for any array indices.
  • Using fast C division, which does not check for a zero denominator, as Python does.
  • Some cosmetic changes: dropping the Python environment line, and using language_level=3 as a Cython compiler directive to reduce the length of the compiler message.

Here is the code:

# -*- coding: utf-8 -*-
# cython: language_level=3
 
cimport cython
from PIL import Image
import numpy as np
 
 
@cython.cdivision(True)
def inv_merc_lat(float translat_in_rads, float translong_in_rads):
    """Calculates orig transverse mercator latitude for a given image coord
    """
    cdef float orig_lat_rad
    orig_lat_rad = -np.arcsin(np.sin(translong_in_rads) / np.cosh(translat_in_rads))
    return orig_lat_rad
 
 
@cython.cdivision(True)
def inv_merc_long(float trans_lat_rads, float trans_long_rads):
    """Calculates the orig transverse mercator longitude for a given image coord
    """
    cdef float orig_long_rads
 
    orig_long_rads = np.arctan(np.sinh(trans_lat_rads) / np.cos(trans_long_rads))
 
    if (trans_long_rads > np.pi/2) or (trans_long_rads < -np.pi / 2):
        if trans_lat_rads >= 0:
            orig_long_rads += np.pi
        elif trans_lat_rads < 0:
            orig_long_rads -= np.pi
    return orig_long_rads
 
 
@cython.cdivision(True)
def inv_transform_pixel(int x_pix_num, int x_max, int y_max, float transformed_lat):
    """Calculates the pixel position needed from the input image
    """
    cdef:
        float trans_longitude, orig_latitude, orig_longitude
        int original_x, original_y
 
    trans_longitude = (2 * np.pi * x_pix_num / x_max) - np.pi
 
    orig_latitude = inv_merc_lat(transformed_lat, trans_longitude)
    orig_longitude = inv_merc_long(transformed_lat, trans_longitude)
    original_x = int(0.5 * (orig_longitude / np.pi + 1.0) * x_max)
    original_y = int((0.5 - orig_latitude  / np.pi) * y_max)
 
    # fixes potential rounding and index errors:
    original_x -= 1 if original_x == x_max else 0
    original_y -= 1 if original_y == y_max else 0
 
    return original_y, original_x
 
 
def inv_transform_line(int map_line_num, int x_max, int y_max):
    """Finds the transformed image pixel pos, for each pixel in the given line
    """
    cdef:
        float y_norm, trans_lat
        Py_ssize_t x_coord
 
    # A pixel's x and y coordinates will usually exceed 256, so need uint16:
    line_pix_mapping = np.zeros((x_max, 2), dtype=np.uint16)
    y_norm = - 2 * map_line_num / y_max + 1
    trans_lat = y_norm * np.pi
 
    for x_coord in range(x_max):  # for each pixel on the output line
        line_pix_mapping[x_coord] = inv_transform_pixel(x_coord, x_max,
                                                          y_max, trans_lat)
    return line_pix_mapping
 
 
def inv_transform_map(map_input_name='WorldMap.png',
                                  map_output_name='ShiftedPoleMap.jpg',
                                  displaying_images=False):
    """Builds an output map image from the input image via the inv transform
    """
    cdef:
        int X_PIC_SIZE, Y_PIC_SIZE
        Py_ssize_t map_image_line, line, pixel, new_pix_line, new_pix_x
 
    try:
        input_image = Image.open(map_input_name)
        X_PIC_SIZE, Y_PIC_SIZE = input_image.size
        if displaying_images:
            input_image.show()
 
    except FileNotFoundError:
        print("\nInput map {} isn't there. Check the filename is correct.\n".
              format(map_input_name))
        raise SystemExit()
 
    print("Running the map transform...")
 
    line_pixel_mapping = np.zeros((Y_PIC_SIZE, X_PIC_SIZE, 2), dtype=np.uint16)
    for map_image_line in range(Y_PIC_SIZE):
        line_pixel_mapping[map_image_line] = inv_transform_line(map_image_line,
                                                        X_PIC_SIZE, Y_PIC_SIZE)
    output_image_data = np.zeros((Y_PIC_SIZE, X_PIC_SIZE, 3), dtype=np.uint8)
 
    for line in range(Y_PIC_SIZE):
        for pixel in range(X_PIC_SIZE):
            new_pix_line, new_pix_x = line_pixel_mapping[line][pixel]
            # The [:3] is to read only the RGB input image data if it's RGBA:
            output_image_data[line][pixel] = input_image.getpixel((
                    int(new_pix_x), int(new_pix_line)))[:3]
 
    output_image = Image.fromarray(output_image_data, 'RGB')
 
    if displaying_images:
        output_image.show()
 
    output_image.save(map_output_name)
    return

 

And here is the new runtime:

In [3]: %timeit inv_transform_map() 
Running the map transform... 
...
3.46 s ± 5.34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So again we’ve only achieved only a modest runtime improvement, this time of about 7%.

Note how the HTML annotation file has barely changed for the most highly used functions. The lines still using Python functions and objects are still in yellow, but we are starting to see some white lines appear for those lines handling only static C variables:

There is clearly a lot more we have to do a make this code run faster.

3. Replacing Numpy Math with C Math Functions

Even though Numpy is ultimately implemented in C, at its highest level every Numpy function call is still a Python function, each of which comes with an execution overhead. The purpose of this step is to replace these Python function calls with direct C calls. Since many Numpy functions and constants are also implemented in the C math library, it’s a fairly simple exercise to import the C functions corresponding to the Numpy trigonometry functions and mathematical constants and use them directly.

Here is the code to do so:

# -*- coding: utf-8 -*-
# cython: language_level=3
 
from libc.math cimport sin, cos, tan, asin, acos, atan
from libc.math cimport sinh, cosh, tanh, asinh, acosh, atanh
from libc.math cimport M_PI, M_PI_2
 
cimport cython
from PIL import Image
import numpy as np
 
cdef double PI = M_PI
cdef double TWO_PI = 2 * M_PI
cdef double HALF_PI = M_PI_2
 
DTYPE = np.intc
 
 
@cython.cdivision(True)
def inv_merc_lat(float translat_in_rads, float translong_in_rads):
    """Calculates orig transverse mercator latitude for a given image coord
    """
    cdef float orig_lat_rad
    orig_lat_rad = -asin(sin(translong_in_rads) / cosh(translat_in_rads))
    return orig_lat_rad
 
 
@cython.cdivision(True)
def inv_merc_long(float trans_lat_rads, float trans_long_rads):
    """Calculates the orig transverse mercator longitude for a given image coord
    """
    cdef float orig_long_rads
 
    orig_long_rads = atan(sinh(trans_lat_rads) /cos(trans_long_rads))
 
    if (trans_long_rads > HALF_PI) or (trans_long_rads < -HALF_PI):
        if trans_lat_rads >= 0:
            orig_long_rads += PI
        elif trans_lat_rads < 0:
            orig_long_rads -= PI
    return orig_long_rads
 
 
@cython.cdivision(True)
def inv_transform_pixel(int x_pix_num, int x_max, int y_max, float transformed_lat):
    """Calculates the pixel position needed from the input image
    """
    cdef:
        float trans_longitude, orig_latitude, orig_longitude
        int original_x, original_y
 
    trans_longitude = (TWO_PI * x_pix_num / x_max) - PI
 
    orig_latitude = inv_merc_lat(transformed_lat, trans_longitude)
    orig_longitude = inv_merc_long(transformed_lat, trans_longitude)
    original_x = <int>(0.5 * (orig_longitude / PI + 1.0) * x_max)
    original_y = <int>((0.5 - orig_latitude  / PI) * y_max)
 
    # fixes potential rounding and index errors:
    original_x -= 1 if original_x == x_max else 0
    original_y -= 1 if original_y == y_max else 0
    return original_y, original_x
 
 
def inv_transform_line(int map_line_num, int x_max, int y_max):
    """Finds the transformed image pixel pos, for each pixel in the given line
    """
    cdef:
        float y_norm, trans_lat
        Py_ssize_t x_coord
 
    line_pix_mapping = np.zeros((x_max, 2), dtype=np.intc)
    y_norm = - 2 * map_line_num / y_max + 1
    trans_lat = y_norm * PI
 
    for x_coord in range(x_max):  # for each pixel on the output line
        line_pix_mapping[x_coord] = inv_transform_pixel(x_coord, x_max,
                                                          y_max, trans_lat)
    return line_pix_mapping
 
 
def inv_transform_map(map_input_name='WorldMap.png',
                                  map_output_name='ShiftedPoleMap.jpg',
                                  displaying_images=False):
    """Builds an output map image from the input image via the inv transform
    """    
    cdef:
        int X_PIC_SIZE, Y_PIC_SIZE
        Py_ssize_t map_image_line, line, pixel, new_pix_line, new_pix_x
 
    try:
        input_image = Image.open(map_input_name)
        X_PIC_SIZE, Y_PIC_SIZE = input_image.size
        if displaying_images:
            input_image.show()
 
    except FileNotFoundError:
        print("\nInput map {} isn't there. Check the filename is correct.\n".
              format(map_input_name))
        raise SystemExit()
 
    print("Running the map transform...")
 
    line_pixel_mapping = np.zeros((Y_PIC_SIZE, X_PIC_SIZE, 2), dtype=np.intc)
    for map_image_line in range(Y_PIC_SIZE):
        line_pixel_mapping[map_image_line] = inv_transform_line(map_image_line,
                                                        X_PIC_SIZE, Y_PIC_SIZE)
    output_image_data = np.zeros((Y_PIC_SIZE, X_PIC_SIZE, 3), dtype=np.uint8)
 
    for line in range(Y_PIC_SIZE):
        for pixel in range(X_PIC_SIZE):
            new_pix_line, new_pix_x = line_pixel_mapping[line][pixel]
            # The [:3] is to read only the RGB input image data if it's RGBA:
            output_image_data[line][pixel] = input_image.getpixel((
                    int(new_pix_x), int(new_pix_line)))[:3]
 
    output_image = Image.fromarray(output_image_data, 'RGB')
 
    if displaying_images:
        output_image.show()
 
    output_image.save(map_output_name)
    return

 

To compile this version, we first need to change the setup.py file to tell Cython about the C math libraries. Here is the version of setup.py we shall be using from now on:

from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
 
ext_modules = [Extension("MyCythonModule",
                        sources=["MyCythonModule.pyx"],
                        libraries=["m"]
                        )
              ]
 
ext_options = {"compiler_directives": {"profile": True},
                      "annotate": True}
 
setup(ext_modules=cythonize(ext_modules, **ext_options))

 

(If you have Windows, check your distribution for any additional compiler options you may need).  Once they are compiled and imported, the C math libraries give a nice boost to the execution speed:

In [3]: %timeit inv_transform_map() 
Running the map transform... 
... 
1.31 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

This is an incremental time improvement of 62% for one step, and with an overall runtime now 3 times faster than Python.  And here’s how Cython’s HTML annotation file now looks for the most frequently called functions:

The reason the return lines are still in Python is because the functions are still Python def functions. The next step will fix this.

4. Making Python Functions Inline C Functions

The purpose of this step is to stop making thousands of high repetition Python function calls, which are incredibly expensive time-wise. The hard way to do this is to rewrite your code into a flat, non-hierarchical block. The easy way is just to tell Cython just to treat your Python functions as if you had actually done this, while leaving the code with the apparent logical structure you’ve given it. This step also requires us to replace the Python def keyword with the Cython cdef keyword cdef. This has the primary effect of turning these Python functions into C functions, with the side effect that they cannot now be seen from outside the module by another program that may want to import them. As a final step, the functions’ return types have been added to the function signature/definition lines.

Here is the code in full:

# -*- coding: utf-8 -*-
# cython: language_level=3
 
from libc.math cimport sin, cos, tan, asin, acos, atan
from libc.math cimport sinh, cosh, tanh, asinh, acosh, atanh
from libc.math cimport M_PI, M_PI_2
 
cimport cython
from PIL import Image
import numpy as np
 
cdef double PI = M_PI
cdef double TWO_PI = 2 * M_PI
cdef double HALF_PI = M_PI_2
 
DTYPE = np.intc
 
 
@cython.cdivision(True)
cdef inline float inv_merc_lat(float translat_in_rads, float translong_in_rads):
    """Calculates orig transverse mercator latitude for a given image coord
    """
    cdef float orig_lat_rad
    orig_lat_rad = -asin(sin(translong_in_rads) / cosh(translat_in_rads))
    return orig_lat_rad
 
 
@cython.cdivision(True)
cdef inline float inv_merc_long(float trans_lat_rads, float trans_long_rads):
    """Calculates the orig transverse mercator longitude for a given image coord
    """
    cdef float orig_long_rads
 
    orig_long_rads = atan(sinh(trans_lat_rads) /cos(trans_long_rads))
 
    if (trans_long_rads > HALF_PI) or (trans_long_rads < -HALF_PI):
        if trans_lat_rads >= 0:
            orig_long_rads += PI
        elif trans_lat_rads < 0:
            orig_long_rads -= PI
    return orig_long_rads
 
 
@cython.cdivision(True)
cdef inline (int, int) inv_transform_pixel(int x_pix_num, int x_max, int y_max,
                                           float transformed_lat):
    """Calculates the pixel position needed from the input image
    """
    cdef:
        float trans_longitude, orig_latitude, orig_longitude
        int original_x, original_y
 
    trans_longitude = (TWO_PI * x_pix_num / x_max) - PI
 
    orig_latitude = inv_merc_lat(transformed_lat, trans_longitude)
    orig_longitude = inv_merc_long(transformed_lat, trans_longitude)
    original_x = <int>(0.5 * (orig_longitude / PI + 1.0) * x_max)
    original_y = <int>((0.5 - orig_latitude  / PI) * y_max)
 
    # fixes potential rounding and index errors:
    original_x -= 1 if original_x == x_max else 0
    original_y -= 1 if original_y == y_max else 0
    return original_y, original_x
 
 
def inv_transform_line(int map_line_num, int x_max, int y_max):
    """Finds the transformed image pixel pos, for each pixel in the given line
    """
    cdef:
        float y_norm, trans_lat
        Py_ssize_t x_coord
 
    line_pix_mapping = np.zeros((x_max, 2), dtype=np.intc)
    y_norm = - 2 * map_line_num / y_max + 1
    trans_lat = y_norm * PI
 
    for x_coord in range(x_max):  # for each pixel on the output line
        line_pix_mapping[x_coord] = inv_transform_pixel(x_coord, x_max,
                                                          y_max, trans_lat)
    return line_pix_mapping
 
 
def inv_transform_map(map_input_name='WorldMap.png',
                                  map_output_name='ShiftedPoleMap.jpg',
                                  displaying_images=False):
    """Builds an output map image from the input image via the inv transform
    """
    cdef:
        int X_PIC_SIZE, Y_PIC_SIZE
        Py_ssize_t map_image_line, line, pixel, new_pix_line, new_pix_x
 
    try:
        input_image = Image.open(map_input_name)
        X_PIC_SIZE, Y_PIC_SIZE = input_image.size
        if displaying_images:
            input_image.show()
 
    except FileNotFoundError:
        print("\nInput map {} isn't there. Check the filename is correct.\n".
              format(map_input_name))
        raise SystemExit()
 
    print("Running the map transform...")
 
    line_pixel_mapping = np.zeros((Y_PIC_SIZE, X_PIC_SIZE, 2), dtype=np.intc)
    for map_image_line in range(Y_PIC_SIZE):
        line_pixel_mapping[map_image_line] = inv_transform_line(map_image_line,
                                                        X_PIC_SIZE, Y_PIC_SIZE)
    output_image_data = np.zeros((Y_PIC_SIZE, X_PIC_SIZE, 3), dtype=np.uint8)
 
    for line in range(Y_PIC_SIZE):
        for pixel in range(X_PIC_SIZE):
            new_pix_line, new_pix_x = line_pixel_mapping[line][pixel]
            # The [:3] is to read only the RGB input image data if it's RGBA:
            output_image_data[line][pixel] = input_image.getpixel((
                    int(new_pix_x), int(new_pix_line)))[:3]
 
    output_image = Image.fromarray(output_image_data, 'RGB')
 
    if displaying_images:
        output_image.show()
 
    output_image.save(map_output_name)
    return

 

This only gives us a small runtime gain of about 4%:

In [3]: %timeit inv_transform_map() 
Running the map transform... 
...
1.25 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

It’s a small speed gain, but the Cython compiler’s annotation file reveals that the high frequency, low level functions are now entirely in C, which will give us more options later:

 

5. Using a C list to Store Transform Results

Moving to the higher level functions, there are two Numpy arrays used in this program. The first one [called line_pix_mapping in inv_transform_line() and  line_pix_mapping in inv_transform_map() ]  stores the transform results. This array is really just a correlation between the input and output pixel positions, without any pixel values. The second Numpy array [ called output_image_data in inv_transform_map() ]  uses this transform data to create the output image from the input image data. This Cython mod replaces the first Numpy array (both as line_pix_mapping and line_pixel_mapping) with a simple C list to store the image transform correlations, as they are calculated, line by line:

# -*- coding: utf-8 -*-
# cython: language_level=3
 
from libc.math cimport sin, cos, tan, asin, acos, atan
from libc.math cimport sinh, cosh, tanh, asinh, acosh, atanh
from libc.math cimport M_PI, M_PI_2
 
cimport cython
from PIL import Image
import numpy as np
 
cdef double PI = M_PI
cdef double TWO_PI = 2 * M_PI
cdef double HALF_PI = M_PI_2
 
 
@cython.cdivision(True)
cdef inline float inv_merc_lat(float translat_in_rads, float translong_in_rads):
    """Calculates orig transverse mercator latitude for a given image coord
    """
    cdef float orig_lat_rad
    orig_lat_rad = -asin(sin(translong_in_rads) / cosh(translat_in_rads))
    return orig_lat_rad
 
 
@cython.cdivision(True)
cdef inline float inv_merc_long(float trans_lat_rads, float trans_long_rads):
    """Calculates the orig transverse mercator longitude for a given image coord
    """
    cdef float orig_long_rads
 
    orig_long_rads = atan(sinh(trans_lat_rads) /cos(trans_long_rads))
 
    if (trans_long_rads > HALF_PI) or (trans_long_rads < -HALF_PI):
        if trans_lat_rads >= 0:
            orig_long_rads += PI
        elif trans_lat_rads < 0:
            orig_long_rads -= PI
    return orig_long_rads
 
 
@cython.cdivision(True)
cdef inline (int, int) inv_transform_pixel(int x_pix_num, int x_max, int y_max,
                                           float transformed_lat):
    """Calculates the pixel position needed from the input image
    """
    cdef:
        float trans_longitude, orig_latitude, orig_longitude
        int original_x, original_y
 
    trans_longitude = (TWO_PI * x_pix_num / x_max) - PI
 
    orig_latitude = inv_merc_lat(transformed_lat, trans_longitude)
    orig_longitude = inv_merc_long(transformed_lat, trans_longitude)
    original_x = <int>(0.5 * (orig_longitude / PI + 1.0) * x_max)
    original_y = <int>((0.5 - orig_latitude  / PI) * y_max)
 
    # fixes potential rounding and index errors:
    original_x -= 1 if original_x == x_max else 0
    original_y -= 1 if original_y == y_max else 0
    return original_y, original_x
 
 
def inv_transform_line(int map_line_num, int x_max, int y_max):
    """Finds the transformed image pixel pos, for each pixel in the given line
    """
    cdef:
        float y_norm, trans_lat
        Py_ssize_t x_coord
        list line_pix_mapping = []
 
    y_norm = - 2 * map_line_num / y_max + 1
    trans_lat = y_norm * PI
 
    for x_coord in range(x_max):  # for each pixel on the output line
        line_pix_mapping.append(inv_transform_pixel(x_coord, x_max,
                                                          y_max, trans_lat))
    return line_pix_mapping
 
 
def inv_transform_map(map_input_name='WorldMap.png',
                                  map_output_name='ShiftedPoleMap.jpg',
                                  displaying_images=False):
    """Builds an output map image from the input image via the inv transform
    """
    cdef:
        int X_PIC_SIZE, Y_PIC_SIZE
        Py_ssize_t map_image_line, line, pixel, new_pix_line, new_pix_x
        list line_pixel_mapping = []
 
    try:
        input_image = Image.open(map_input_name)
        X_PIC_SIZE, Y_PIC_SIZE = input_image.size
        if displaying_images:
            input_image.show()
 
    except FileNotFoundError:
        print("\nInput map {} isn't there. Check the filename is correct.\n".
              format(map_input_name))
        raise SystemExit()
 
    print("Running the map transform...")
 
    for map_image_line in range(Y_PIC_SIZE):
        line_pixel_mapping.append(inv_transform_line(map_image_line,
                                                        X_PIC_SIZE, Y_PIC_SIZE))
 
    output_image_data = np.zeros((Y_PIC_SIZE, X_PIC_SIZE, 3), dtype=np.uint8)
 
    for line in range(Y_PIC_SIZE):
        for pixel in range(X_PIC_SIZE):
            new_pix_line, new_pix_x = line_pixel_mapping[line][pixel]
            # The [:3] is to read only the RGB input image data if it's RGBA:
            output_image_data[line][pixel] = input_image.getpixel((
                    int(new_pix_x), int(new_pix_line)))[:3]
 
    output_image = Image.fromarray(output_image_data, 'RGB')
 
    if displaying_images:
        output_image.show()
 
    output_image.save(map_output_name)
    return

 

The speed gain from this step was another 420ms, or an improvement of about 33% :

In [5]: %timeit inv_transform_map() 
Running the map transform...
...
832 ms ± 5.39 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

This means the Cython code is now running 4.8 times faster than Python.

6. Reading the Input Image Into a Numpy Array

At this point the Python profiler was run again. Here is the result:

In [4]: %prun inv_transform_map()                                                         
Running the map transform...
 
         2020174 function calls in 1.143 seconds
 
   Ordered by: internal time
 
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.455    0.455    1.143    1.143 MyCythonModule.pyx:84(inv_transform_map)
   252404    0.139    0.000    0.477    0.000 Image.py:1303(getpixel)
   252407    0.121    0.000    0.163    0.000 Image.py:809(load)
   252404    0.109    0.000    0.284    0.000 ImageFile.py:134(load)
   252404    0.076    0.000    0.143    0.000 MyCythonModule.pyx:42(inv_transform_pixel)
      356    0.061    0.000    0.204    0.001 MyCythonModule.pyx:66(inv_transform_line)
   252404    0.054    0.000    0.054    0.000 {method 'getpixel' of 'ImagingCore' objects}
   252406    0.042    0.000    0.042    0.000 {method 'pixel_access' of 'ImagingCore' objects}
   252404    0.034    0.000    0.034    0.000 MyCythonModule.pyx:27(inv_merc_long)
   252404    0.033    0.000    0.033    0.000 MyCythonModule.pyx:18(inv_merc_lat)
       28    0.011    0.000    0.011    0.000 {method 'decode' of 'ImagingDecoder' objects}
        1    0.005    0.005    0.005    0.005 {method 'encode_to_file' of 'ImagingEncoder' objects}
   ... etc ...

 

So the PIL Image function getpixel() is now the slowest function in the program. The getpixel() routine was therefore dropped, and the image data read into a Numpy array when it was first opened. This shaved another 460ms off the time:

In [3]: %timeit inv_transform_map()
Running the map transform...
...
370 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Which means the code is now running nearly 11 times faster than Python. This last gain was not directly due to Cython, of course. Cython acceleration has merely revealed a Python bottleneck which was not apparent earlier.

Here is how the code now looks:

# -*- coding: utf-8 -*-
# cython: language_level=3
 
from libc.math cimport sin, cos, tan, asin, acos, atan
from libc.math cimport sinh, cosh, tanh, asinh, acosh, atanh
from libc.math cimport M_PI, M_PI_2
 
cimport cython
from PIL import Image
import numpy as np
 
cdef double PI = M_PI
cdef double TWO_PI = 2 * M_PI
cdef double HALF_PI = M_PI_2
 
 
@cython.cdivision(True)
cdef inline float inv_merc_lat(float translat_in_rads, float translong_in_rads):
    """Calculates orig transverse mercator latitude for a given image coord
    """
    cdef float orig_lat_rad
    orig_lat_rad = -asin(sin(translong_in_rads) / cosh(translat_in_rads))
    return orig_lat_rad
 
 
@cython.cdivision(True)
cdef inline float inv_merc_long(float trans_lat_rads, float trans_long_rads):
    """Calculates the orig transverse mercator longitude for a given image coord
    """
    cdef float orig_long_rads
 
    orig_long_rads = atan(sinh(trans_lat_rads) /cos(trans_long_rads))
 
    if (trans_long_rads > HALF_PI) or (trans_long_rads < -HALF_PI):
        if trans_lat_rads >= 0:
            orig_long_rads += PI
        elif trans_lat_rads < 0:
            orig_long_rads -= PI
    return orig_long_rads
 
 
@cython.cdivision(True)
cdef inline (int, int) inv_transform_pixel(int x_pix_num, int x_max, int y_max,
                                           float transformed_lat):
    """Calculates the pixel position needed from the input image
    """
    cdef:
        float trans_longitude, orig_latitude, orig_longitude
        int original_x, original_y
 
    trans_longitude = (TWO_PI * x_pix_num / x_max) - PI
 
    orig_latitude = inv_merc_lat(transformed_lat, trans_longitude)
    orig_longitude = inv_merc_long(transformed_lat, trans_longitude)
    original_x = <int>(0.5 * (orig_longitude / PI + 1.0) * x_max)
    original_y = <int>((0.5 - orig_latitude  / PI) * y_max)
 
    # fixes potential rounding and index errors:
    original_x -= 1 if original_x == x_max else 0
    original_y -= 1 if original_y == y_max else 0
    return original_y, original_x
 
 
def inv_transform_line(int map_line_num, int x_max, int y_max):
    """Finds the transformed image pixel pos, for each pixel in the given line
    """
    cdef:
        float y_norm, trans_lat
        Py_ssize_t x_coord
        list line_pix_mapping = []
 
    y_norm = - 2 * map_line_num / y_max + 1
    trans_lat = y_norm * PI
 
    for x_coord in range(x_max):  # for each pixel on the output line
        line_pix_mapping.append(inv_transform_pixel(x_coord, x_max,
                                                          y_max, trans_lat))
    return line_pix_mapping
 
 
def inv_transform_map(map_input_name='WorldMap.png',
                                  map_output_name='ShiftedPoleMap.jpg',
                                  displaying_images=False):
    """Builds an output map image from the input image via the inv transform
    """
    cdef:
        int X_PIC_SIZE, Y_PIC_SIZE
        Py_ssize_t map_image_line, line, pixel, new_pix_line, new_pix_x
        list line_pixel_mapping = []
 
    try:
        input_image = Image.open(map_input_name)
        X_PIC_SIZE, Y_PIC_SIZE = input_image.size
        input_image_data = np.array(input_image)
        if displaying_images:
            input_image.show()
 
    except FileNotFoundError:
        print("\nInput map {} isn't there. Check the filename is correct.\n".
              format(map_input_name))
        raise SystemExit()
 
    print("Running the map transform...")
 
    for map_image_line in range(Y_PIC_SIZE):
        line_pixel_mapping.append(inv_transform_line(map_image_line,
                                                        X_PIC_SIZE, Y_PIC_SIZE))
 
    output_image_data = np.zeros((Y_PIC_SIZE, X_PIC_SIZE, 3), dtype=np.uint8)
 
    for line in range(Y_PIC_SIZE):
        for pixel in range(X_PIC_SIZE):
            new_pix_line, new_pix_x = line_pixel_mapping[line][pixel]
            # The [:3] is to read only the RGB input image data if it's RGBA:
            output_image_data[line][pixel] = input_image_data[int(new_pix_line),
                                                              int(new_pix_x)][:3]
 
    output_image = Image.fromarray(output_image_data, 'RGB')
 
    if displaying_images:
        output_image.show()
 
    output_image.save(map_output_name)
    return

 

7. Using Memory Views of Numpy Arrays

Memory views are a way of bypassing Numpy’s administrative overhead by reading and writing Numpy array data to and from memory directly. They are only marginally slower than C arrays and much easier to set up. The Numpy arrays are still there, and can still be accessed for what they are good at – matrix mathematics and fast array transforms. But if you just want to get your hands on the data and manipulate it, memory views are the way to go. The code to implement memory views is as follows:

# -*- coding: utf-8 -*-
# cython: language_level=3
 
from libc.math cimport sin, cos, tan, asin, acos, atan
from libc.math cimport sinh, cosh, tanh, asinh, acosh, atanh
from libc.math cimport M_PI, M_PI_2
 
cimport cython
from PIL import Image
import numpy as np
 
cdef double PI = M_PI
cdef double TWO_PI = 2 * M_PI
cdef double HALF_PI = M_PI_2
 
 
@cython.cdivision(True)
cdef inline float inv_merc_lat(float translat_in_rads, float translong_in_rads):
    """Calculates orig transverse mercator latitude for a given image coord
    """
    cdef float orig_lat_rad
    orig_lat_rad = -asin(sin(translong_in_rads) / cosh(translat_in_rads))
    return orig_lat_rad
 
 
@cython.cdivision(True)
cdef inline float inv_merc_long(float trans_lat_rads, float trans_long_rads):
    """Calculates the orig transverse mercator longitude for a given image coord
    """
    cdef float orig_long_rads
 
    orig_long_rads = atan(sinh(trans_lat_rads) /cos(trans_long_rads))
 
    if (trans_long_rads > HALF_PI) or (trans_long_rads < -HALF_PI):
        if trans_lat_rads >= 0:
            orig_long_rads += PI
        elif trans_lat_rads < 0:
            orig_long_rads -= PI
    return orig_long_rads
 
 
@cython.cdivision(True)
cdef inline (int, int) inv_transform_pixel(int x_pix_num, int x_max, int y_max,
                                           float transformed_lat):
    """Calculates the pixel position needed from the input image
    """
    cdef:
        float trans_longitude, orig_latitude, orig_longitude
        int original_x, original_y
 
    trans_longitude = (TWO_PI * x_pix_num / x_max) - PI
 
    orig_latitude = inv_merc_lat(transformed_lat, trans_longitude)
    orig_longitude = inv_merc_long(transformed_lat, trans_longitude)
    original_x = <int>(0.5 * (orig_longitude / PI + 1.0) * x_max)
    original_y = <int>((0.5 - orig_latitude  / PI) * y_max)
 
    # fixes potential rounding and index errors:
    original_x -= 1 if original_x == x_max else 0
    original_y -= 1 if original_y == y_max else 0
    return original_y, original_x
 
 
def inv_transform_line(int map_line_num, int x_max, int y_max):
    """Finds the transformed image pixel pos, for each pixel in the given line
    """
    cdef:
        float y_norm, trans_lat
        Py_ssize_t x_coord
        list line_pix_mapping = []
 
    y_norm = - 2 * map_line_num / y_max + 1
    trans_lat = y_norm * PI
 
    for x_coord in range(x_max):  # for each pixel on the output line
        line_pix_mapping.append(inv_transform_pixel(x_coord, x_max,
                                                          y_max, trans_lat))
    return line_pix_mapping
 
 
def inv_transform_map(map_input_name='WorldMap.png',
                                  map_output_name='ShiftedPoleMap.jpg',
                                  displaying_images=False):
    """Builds an output map image from the input image via the inv transform
    """
    cdef:
        int X_PIC_SIZE, Y_PIC_SIZE
        Py_ssize_t map_image_line, line, pixel, new_pix_line, new_pix_x
        list line_pixel_mapping = []
 
    try:
        input_image = Image.open(map_input_name)
        X_PIC_SIZE, Y_PIC_SIZE = input_image.size
        input_image_data = np.array(input_image)
        if displaying_images:
            input_image.show()
 
    except FileNotFoundError:
        print("\nInput map {} isn't there. Check the filename is correct.\n".
              format(map_input_name))
        raise SystemExit()
 
    cdef unsigned char [:, :, :] in_image_view = input_image_data
    print("Running the map transform...")
 
    for map_image_line in range(Y_PIC_SIZE):
        line_pixel_mapping.append(inv_transform_line(map_image_line,
                                                        X_PIC_SIZE, Y_PIC_SIZE))
 
    output_image_data = np.zeros((Y_PIC_SIZE, X_PIC_SIZE, 3), dtype=np.uint8)
    cdef unsigned char [:, :, :] out_image_view = output_image_data
 
    for line in range(Y_PIC_SIZE):
        for pixel in range(X_PIC_SIZE):
            new_pix_line, new_pix_x = line_pixel_mapping[line][pixel]
            # The [:3] is in case it's RGBA:
            out_image_view[line, pixel] = in_image_view[new_pix_line][new_pix_x][:3]
 
    output_image = Image.fromarray(output_image_data, 'RGB')
 
    if displaying_images:
        output_image.show()
 
    output_image.save(map_output_name)
    return

 

This gave the runtime a huge boost:

In [3]: %timeit inv_transform_map() 
Running the map transform... 
... 
110 ms ± 376 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So the code is now running 36 times faster than Python.

8. Turning Off the Python GIL

The fact that the three lowest level functions (the ones called thousands of times) are all now in C gives us an extra card to play in the Python acceleration game: parallel processing. As we noted earlier, the compiler’s HTML annotation file clearly shows that there are no Python objects or functions left in these functions, which means that – for these functions at least – we can turn off the Python global interpreter lock (GIL).

With Cython, this can done to whatever granularity you like –  on a function, a loop or even an individual line, by using the Cython keyword nogil. Since we now have entire functions is C, we will be using it on a function level, leaving Cython to handle all the threading issues in the background.

Note: you can’t just turn off the GIL for a mid-level function, if it’s calling other functions defined in the same scope.  You have to do it for all its locally-defined sub-functions as well, or else you’ll get this error message:

Error compiling Cython file:
...
Calling gil-requiring function not allowed without gil.

The rule is that a local function that requires the GIL can call another local function that does not, but not the other way round.

Here is the code:

# -*- coding: utf-8 -*-
# cython: language_level=3
 
from libc.math cimport sin, cos, tan, asin, acos, atan
from libc.math cimport sinh, cosh, tanh, asinh, acosh, atanh
from libc.math cimport M_PI, M_PI_2
 
cimport cython
from PIL import Image
import numpy as np
 
cdef double PI = M_PI
cdef double TWO_PI = 2 * M_PI
cdef double HALF_PI = M_PI_2
 
 
@cython.cdivision(True)
cdef inline float inv_merc_lat(float translat_in_rads, float translong_in_rads) nogil:
    """Calculates orig transverse mercator latitude for a given image coord
    """
    cdef float orig_lat_rad
    orig_lat_rad = -asin(sin(translong_in_rads) / cosh(translat_in_rads))
    return orig_lat_rad
 
 
@cython.cdivision(True)
cdef inline float inv_merc_long(float trans_lat_rads, float trans_long_rads) nogil:
    """Calculates the orig transverse mercator longitude for a given image coord
    """
    cdef float orig_long_rads
 
    orig_long_rads = atan(sinh(trans_lat_rads) /cos(trans_long_rads))
 
    if (trans_long_rads > HALF_PI) or (trans_long_rads < -HALF_PI):
        if trans_lat_rads >= 0:
            orig_long_rads += PI
        elif trans_lat_rads < 0:
            orig_long_rads -= PI
    return orig_long_rads
 
 
@cython.cdivision(True)
cdef inline (int, int) inv_transform_pixel(int x_pix_num, int x_max, int y_max,
                                           float transformed_lat) nogil:
    """Calculates the pixel position needed from the input image
    """
    cdef:
        float trans_longitude, orig_latitude, orig_longitude
        int original_x, original_y
 
    trans_longitude = (TWO_PI * x_pix_num / x_max) - PI
 
    orig_latitude = inv_merc_lat(transformed_lat, trans_longitude)
    orig_longitude = inv_merc_long(transformed_lat, trans_longitude)
    original_x = <int>(0.5 * (orig_longitude / PI + 1.0) * x_max)
    original_y = <int>((0.5 - orig_latitude  / PI) * y_max)
 
    # fixes potential rounding and index errors:
    original_x -= 1 if original_x == x_max else 0
    original_y -= 1 if original_y == y_max else 0
    return original_y, original_x
 
 
def inv_transform_line(int map_line_num, int x_max, int y_max):
    """Finds the transformed image pixel pos, for each pixel in the given line
    """
    cdef:
        float y_norm, trans_lat
        Py_ssize_t x_coord
        list line_pix_mapping = []
 
    y_norm = - 2 * map_line_num / y_max + 1
    trans_lat = y_norm * PI
 
    for x_coord in range(x_max):  # for each pixel on the output line
        line_pix_mapping.append(inv_transform_pixel(x_coord, x_max,
                                                          y_max, trans_lat))
    return line_pix_mapping
 
 
def inv_transform_map(map_input_name='WorldMap.png',
                                  map_output_name='ShiftedPoleMap.jpg',
                                  displaying_images=False):
    """Builds an output map image from the input image via the inv transform
    """
    cdef:
        int X_PIC_SIZE, Y_PIC_SIZE
        Py_ssize_t map_image_line, line, pixel, new_pix_line, new_pix_x
        list line_pixel_mapping = []
 
    try:
        input_image = Image.open(map_input_name)
        X_PIC_SIZE, Y_PIC_SIZE = input_image.size
        input_image_data = np.array(input_image)
        if displaying_images:
            input_image.show()
 
    except FileNotFoundError:
        print("\nInput map {} isn't there. Check the filename is correct.\n".
              format(map_input_name))
        raise SystemExit()
 
    cdef unsigned char [:, :, :] in_image_view = input_image_data
    print("Running the map transform...")
 
    for map_image_line in range(Y_PIC_SIZE):
        line_pixel_mapping.append(inv_transform_line(map_image_line,
                                                        X_PIC_SIZE, Y_PIC_SIZE))
 
    output_image_data = np.zeros((Y_PIC_SIZE, X_PIC_SIZE, 3), dtype=np.uint8)
    cdef unsigned char [:, :, :] out_image_view = output_image_data
 
    for line in range(Y_PIC_SIZE):
        for pixel in range(X_PIC_SIZE):
            new_pix_line, new_pix_x = line_pixel_mapping[line][pixel]
            # The [:3] is in case it's RGBA:
            out_image_view[line, pixel] = in_image_view[new_pix_line][new_pix_x][:3]
 
    output_image = Image.fromarray(output_image_data, 'RGB')
 
    if displaying_images:
        output_image.show()
 
    output_image.save(map_output_name)
    return

 

On a dual core laptop, this gives us an 8% runtime boost:

In [3]: %timeit inv_transform_map() 
Running the map transform... 
... 
101 ms ± 297 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Which means the code is now running 39 times faster. The overall gain would obviously be greater with more cores, but with an 8% speed improvement in going from 1 to 2 cores it’s probably not worth it. Generally speaking, the gain from going parallel with Cython increases with the amount of C-only code you can get to compile without the GIL, and what percentage of the total execution time that code was taking. If you have highly computational for-loops and functions taking up most of your execution time, the result for your own project might be more dramatic.

Cython vs. Parallel Python

It’s worth noting that the gains achieved from parallel computing in a previous post using almost the same code were nominally much higher – over 40% faster in going from 1 to 2 cores. A couple of things to note though.

First, it’s easier to make inefficient code run faster: on two cores it was still taking nearly 4 seconds.  Second, even with the best coders money can buy, the most you can theoretically expect from adding more cores is a linear division of the original runtime. So even if you managed to scale the original Python to run on 32 cores (after much time, coding effort and expense) and achieve the theoretical limit in speed, you would still only have a runtime of 220ms, half as fast as Python with Cython mods running on one core.

Finally, the notion of Cython vs. parallel Python doesn’t really make sense. Cython gives you near-C speeds, on top of which you still have the option of running on multiple cores, if you choose to do so.

The Final Profile

The final Python profiler run shows that the function inv_transform_line() still takes over half the execution time. The good news is that that same execution time has been reduced from 4 seconds to a little over 100ms:

In [4]: %prun inv_transform_map()                                                                                                   
Running the map transform...
 
         970 function calls in 0.120 seconds
 
   Ordered by: internal time
 
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      356    0.066    0.000    0.066    0.000 MyCythonModule.pyx:64(inv_transform_line)
        1    0.034    0.034    0.120    0.120 MyCythonModule.pyx:81(inv_transform_map)
       28    0.011    0.000    0.011    0.000 {method 'decode' of 'ImagingDecoder' objects}
        1    0.005    0.005    0.005    0.005 {method 'encode_to_file' of 'ImagingEncoder' objects}
        1    0.000    0.000    0.000    0.000 {method 'join' of 'bytes' objects}
       16    0.000    0.000    0.000    0.000 {method 'encode' of 'ImagingEncoder' objects}
        1    0.000    0.000    0.000    0.000 {built-in method PIL._imaging.fill}
        1    0.000    0.000    0.000    0.000 {built-in method PIL._imaging.new}
       91    0.000    0.000    0.000    0.000 {method 'read' of '_io.BufferedReader' objects}
        2    0.000    0.000    0.000    0.000 {built-in method io.open}
        3    0.000    0.000    0.000    0.000 Image.py:601(__del__)
        1    0.000    0.000    0.012    0.012 ImageFile.py:134(load)
        1    0.000    0.000    0.000    0.000 {method 'close' of '_io.BufferedRandom' objects}
       30    0.000    0.000    0.000    0.000 {method 'tell' of '_io.BufferedReader' objects}
       27    0.000    0.000    0.000    0.000 PngImagePlugin.py:613(load_read)
       ... etc ...

Note how the Python profiler can no longer see inside inv_transform_line(), since the sub-functions that it calls are now C functions, not Python functions.

Summary of Changes

We have taken some simple Python image transform code and made it run 39 times faster than the original Python, using nothing but a laptop. Here is a summary of the changes we made:

Summary of changes and gains

From the table we can see that the single most dramatic change was step 3 – replacing the Numpy math functions with directly imported C math functions. We can also see that many of these changes had to happen in roughly this order. Certainly, 3, 4 & 5 could have happened in any order, but 1 and 2 had to happen first. And step 8 could have happened earlier, but not before the lower level functions were rendered entirely into C functions at steps 2, 3 & 4. Similarly, step 7’s memory views first needed all the input image data to be loaded into an array, which meant that the PIL routine getpixel had to happen first in step 6.

Ultimately, you can make as many Cython mods to Python as you like, but the trick is knowing when to stop. Using three dimensional C arrays for the image data illustrates the point. They would make the two highest level functions noticeably more complex, and therefore significantly harder to maintain – even by yourself in six months’ time – in return for perhaps only a few milliseconds’ gain. Memory views of Numpy arrays might be fractionally slower than C arrays, but the code is somewhat easier to understand. The law of diminishing returns very much applies here.

There is much more to Cython, but these two posts should be enough to illustrate what is possible. The changes are always a trade-off between how much work you are willing to put in, the gains you might achieve, how complex the code might become, and so how difficult it might be to maintain when you’re finished. If you’ve followed all the steps you should be able to work your way through the Cython online documentation, and you’ll soon be ready to bring C-like speeds to your own Python applications.

 

Parallel Python – 1: Prime Numbers

With the impending demise of Moore’s Law, multiple cores are a common manufacturers’ workaround for improving hardware performance, whether or not your installed apps can use the parallel architecture.

And with each new release of Python, parallel programming gets even easier. But the degree to which your code can use your multiple cores will depend on the kind of problem you are trying to solve, on the implementation of Python you are running and, as it turns out, how truly parallel the underlying architecture of your system actually is.

The goal of this series of posts is to see how adaptable some of my existing code is to take advantage of multi-core hardware, to see what changes need be made to scale it, and to measure the performance improvements from the exercise. Continue reading “Parallel Python – 1: Prime Numbers”