From Python To Cython

This longer post will show you some of the coding skills you’ll need for turning your existing Python code into the Python-C hybrid we call Cython. In doing so, we’ll be digging into some C static data types, to see how much faster Python code will run, and restructuring some Python code along the way for maximum speed.

With Cython, all the benefits of Python are still yours – easily readable code, fast development cycles, powerful high level commands, maintainability, a suite of web development frameworks, a huge standard library for data science, machine learning, imaging, databases and security, plus easy manipulation of files, documents and strings. You should still use Python for all these things – these are what Python does best. But you should also think about combining them with Cython to speed up your computationally intensive Python functions that needs to be fast.

For those unfamiliar with Cython, it’s Python with C or C++ extensions. It can be used for importing externally compiled C libraries into your Python code or, as we shall see, for adding static C data types into your existing Python code to speed it up. After reading and following this post you should be in a good position to start rewriting some of your existing Python functions with Cython’s C data types, importing them back into your Python project as a module, and then running your program as before, but much faster.

I’ll be making a series of Cython code modifications to one of my existing programs – a prime number calculator – to see what gains each change brings. This program has already been tuned to be fast, using Numpy masking of binary arrays. As we shall see, the speed gains can still be quite dramatic.

Note that this exercise will not be looking at using Cython to import external C code such as functions from the C math libraries. This is a more complex area that will be looked at in a future post.

Please also note that this post is not recommended for Python newbies, and would best suit those with a few years’ experience.  I’ll be using Python 3.7, although Cython will work with any version of Python 2.X or 3.X. Nor is this an advanced course in Cython. If you want to dig deeper, you should refer to the latest on-line documentation.

I’ve tried to give instructions that will work on either Mac, Windows or Linux. I’m mainly a Mac user, but I’ve tried to include the Linux and Windows commands where I know they differ. These differences should be minimal – the instructions will bypass the explicit use of a platform-specific C compiler (such as gcc), and will instead using the Cython.Build function cythonize() to manage all the compilation flags you need for your platform. Your Cython code will still be compiled, but Cython will handle all the details and build your Cython module from inside a Python script with the appropriate compiler options setting for your operating system.

Generally speaking, when you import a module, you usually don’t know what’s inside it – all you see is the API, which are the functions the developer lets you use when you import it. If you already use Python, the chances are you’ve already been importing modules with C-extensions (such as Numpy or Pandas) without realizing it. These are essentially modules written in C/C++ with a Python API. Once you’ve turned your Python into a Cython module and compiled it, you’ll be importing and running your code in exactly the same way.

What You’ll Need

Numpy. I’ll be converting some existing Python 3.X code, but Cython also works with Python 2.X (also known as legacy Python). As you probably know, support for Python 2.X ends in 2020, so if you’re working on a new project, just use Python 3.

Cython. Install it as you would any PyPi package. If you’re using Python 3, just use pip3:

$ sudo pip3 install Cython

If you’re still using Python 2.X, use pip. On Windows, if you have any problems, check the instructions for your Python distribution. See the Cython website for more details about installation. As of this post, the current version of Cython is version 3.0a0. Be careful about some of the online Cython documentation, web postings, YouTube videos and tutorials: many of them are out of date, as are some of the books. Cython used to be quite hard to use. With the latest tools, it’s now a little bit easier, but still tricky.

An editor. Use IDLE, Emacs, Vi, Sublime or equivalent simple non-IDE editor. Once you start making Cython mods to your code, most advanced Python IDEs will just complain – and then refuse to run your code. Two exceptions (there are probably more) are Spyder (from 3.2) and PyCharm Professional Edition. Both of these will recognize and highlight your Cython keywords (if you have Cython installed and give the file a .pyx ending) and tell you any compilation errors when you run, but you’ll still need to run the setup command (see below) from a command line to build your external module. So it’s back to basics for the editor, unless you want to use Spyder or PyCharm Pro. The instructions I’m giving will work fine with a simple editor, which anyone can use.

A debugger (optional). Many Cython programmers work without a debugger, using the helpful Cython build and C compiler errors to tell them what the problems are. If you feel you must have an interactive debugger, as far as I’m aware you’ve currently got only one choice: you’ll need to use Linux, where you can use cygdb (Cython gnu debugger). Note that for now, PyCharm Pro and Spyder will not let you run their debuggers on a Cython file.

iPython (optional). Not essential, but a powerful console environment that lets you load and run modules, time them, run shell commands, and test Cython commands all in one place, something you cannot do in a standard Python console. For example, just as you would do with Python code, you might want to quickly check if some Cython code will work:

$ ipython
Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 03:13:28) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.2.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: %load_ext cython  
In [2]: %%cython 
    ...: cdef float x = 3.0 
    ...: cdef float g = 0.0 
    ...: cdef int a = 10 
    ...: g = a ** x 
    ...: print ("g = {:,}".format(g)) 
    ...:  
    ...:                                                                                  
g = 1,000.0

If you use iPython the important thing to remember is that you can’t stay in iPython and simply re-import after running a compile/build as a shell command using the “!” shell escape character. In other words, if you make any edits to your code, you MUST remember to quit iPython {using ^D or exit() }, recompile, restart iPython, then re-import and rerun your code.

On Mac, install it as you would most Python packages:

$ pip3 install ipython

If you’re use Python 2, just use pip. On Linux, if you’re using Ubuntu/Debian use apt  (or yum, rpm, dnf, etc, for other Linux varieties):

$ sudo apt install ipython 

If you want to use Spyder or PyCharm Pro, it’s included.

Getting Started

I’ll be using my offset prime sieve program from a former post to demonstrate what’s possible with Cython. Here is the source listing for the functions I’ll be turning into Cython:

# -*- coding: utf-8 -*-
 
import numpy as np
 
def init_prime_array(maximum, prime_window_size):
    LastArray = np.ones(prime_window_size, dtype=np.uint16)
    LastArray[0::2] = False
    return LastArray
 
 
def primes_in_window_below(limit, window_size=1000):
    is_prime = init_prime_array(limit, window_size)
    window_start = limit - window_size
    root_limit = int(limit ** 0.5 + 1.5)
 
    for n in range(3, root_limit, 2):
        n2 = n * 2
        rem_2n = window_start % n2
        rem_2n += n if (rem_2n <= n) else -n
        start_index = n2 - rem_2n
        is_prime[start_index : : n2] = False
 
# Generator:
    for m in range(1, window_size, 2):
        try:
            if is_prime[m]:
                n = window_start + m
                yield (n)
        except IndexError:
            break
    return

 

  • First, note that I’ve stripped away all the user interaction stuff, including asking the user for a limit and printing the results. The two functions are the only ones I want to speed up. Note also that it already contains some pretty fast Numpy binary masking. Finally, note that you don’t need the initial Python 3 environment line, as you will never be running this file as a script. And, strictly speaking, after you start making your Cython mods, it will no longer be Python.
  • Save this code as MyCythonModule.py to a new folder. This name will make the following shortcuts I’ve created for you work. So why haven’t I given my Cython file a .pyx ending, as is conventional? So you can right-click on it and edit it with your Python editor of choice. My shell file CompileCython.sh (see below) will create the .pyx file when you compile.
  • Create a Python file called setup.py containing the following lines. Note: if you use the same MyCythonModule name for every Cython module project, you’ll only ever have to edit this file when you want to change your Cython compiler options:
setup.py   
from distutils.core import setup
from Cython.Build import cythonize
 
ext_options = {"compiler_directives": {"profile": True}, "annotate": True}
 
setup(ext_modules = cythonize("MyCythonModule.pyx", **ext_options)
)

 

  • The “annotate”:True flag creates an interactive HTML view of your Cython code, as a result of the Cython compilation. You’ll be using this to check what’s now running at C speeds, and how much code is still running as Python. This will therefore tell you what’s left to do. More on this later.
  • Create a shell build script called CompileCython.sh (or BuildCython.sh, or something similar) containing the following shell commands on each line:
cp MyCythonModule.py MyCythonModule.pyx
python3 setup.py build_ext --inplace

 

Make this file executable with chmod +x and save it to the same folder. After you make your Cython mods to MyCythonModule.py in your basic editor, this shell command is the only command you’ll ever have to run. The above script will create the .pyx file before you run Python on setup.py for the Cython build. Note: if you’re using a Cython-aware IDE such as Spyder or PyCharm Pro, you won’t need this shell file. Once your code runs OK you’ll only have to run this command:

python3 setup.py build_ext --inplace

Before you run your build command, your folder’s contents should now look like this (minus the shell file if you’re using Spyder or PyCharm Pro):

CompileCython.sh	MyCythonModule.py	setup.py

 

  • Make sure you’re using Git or similar version control system. You’re going to be making a lot of incremental changes, some of which will completely mess things up. You’re going to have to keep track of what you’ve done, and you’re often going to need to roll things back to the last working version.
  • Make your Cython changes atomic. In other words, don’t make more than one sort of Cython change per commit. If you’re going to experiment with combinations of changes, try to make each change in isolation on its own Git branch, so that you can cherry pick from your Cython changes, once you’re happy with them.

You’re now ready to modify your Python code to use Cython’s C data types.

Timing Your Code As You Go

Before you make any Cython mods, you first need to time your pure Python program. As you make them, this will give you a speed comparison for each of your Cython mods. Here is the time for the Python interpreter running my offset prime sieve program, finding the primes just below 1.0e+12:

$ python3 -m timeit -s "from MyCythonModule import primes_in_window_below" "print('{:,}'.format(max(primes_in_window_below(10**12))))"
999,999,999,989
999,999,999,989
999,999,999,989
999,999,999,989
999,999,999,989
999,999,999,989
1 loop, best of 5: 355 msec per loop

I will use the 10**12 time as the benchmark test from now. You should decide on a similar benchmark test for you own program and write down the time. This is the time you will be comparing the speed of every Cython mod to.

Compiling Your Python With Cython

Even before you make any Cython mods, you’re ready to run your Cython build command. This should give some immediate speed gains, without any work. To build your Cython module, simply run this command:

$ ./CompileCython.sh

If your code had no errors in Python it should compile OK with Cython. If you’re running on a Mac, your folder contents should now look something like this:

CompileCython.sh			MyCythonModule.py
MyCythonModule.c MyCythonModule.pyx
MyCythonModule.cpython-37m-darwin.so build
MyCythonModule.html setup.py

If you’re building a Python module on Mac or Linux, the only file you will need is the .so (shared object) file (which will have a slightly different name on Linux). On Windows, you’ll get a .pyd file instead of a .so file, but the Python module import should work in the same way. The build directory and the .c file were created as part of the compilation process and you can ignore them.  Both were stepping stones to the same result, namely, a module called MyCythonModule, ready to import into a Python program.

If you do nothing except import the Python module you’ve built with setup (as above), you should get an immediate speed gain. In effect, all you’ve done is compile your Python code into C, without any Cython code changes. For my offset prime sieve program, this was the result:

python3 -m timeit -s "from MyCythonModule import primes_in_window_below" "print('{:,}'.format(max(primes_in_window_below(10**12))))"
999,999,999,989
999,999,999,989
999,999,999,989
999,999,999,989
999,999,999,989
999,999,999,989
1 loop, best of 5: 305 msec per loop

Which means that by only compiling it (using Cython), my offset prime sieve program now runs 14% faster than it does when launched by the Python interpreter. This is less than the usual 50% improvement you would see with most programs, but the code was already highly tuned with Numpy masking, so this is perhaps understandable.

Examining the Cython HTML File

If your Cython build was successful, and you used the “annotate”:True argument in your setup.py file, you should now have a MyCythonModule.html file in your folder. Since I haven’t yet made any Cython code mods, every Python line in MyCythonModule.html should appear yellow, which is Cython’s way of saying, ‘Surely we could do better – at least give me a hint.’:

Cython HTML - all Python

The lighter the yellow, the less C behind the line. Open your own file and click on any yellow line to reveal the C it is generating. As you add and compile more C static data types, the lines will start to turn white.

Now you can start making Cython code mods. Open your Cython file MyCythonModule.py in your simple code editor, or save it as MyCythonModule.pyx and open it in Spyder or PyCharm Pro. Let’s see how we can speed things up.

Making Your Cython Mods

Making Cython modifications to Python code is an iterative process that will yield different gains on different projects for each change. In increasing order of complexity, I’ve listed below a collection of typical Cython code changes, with the gains each one gave on the runtime for my offset prime sieve:

A. Use Cython’s infer types directive

This is the quick, lazy way of using static C data types in your Python code, and can give you some moderate speed improvements. Essentially, you tell the Cython compiler to guess the C data types it needs. This will not give you the highest speeds, but it can lead to some cleaner code inside your functions with minimal effort. Here is one way to do this:

# -*- coding: utf-8 -*-
# cython: infer_types=True
 
import numpy as np
 
def init_prime_array(prime_window_size):
    LastArray = np.ones(prime_window_size, dtype=np.uint16)
    LastArray[0::2] = False
    return LastArray
 
 
def primes_in_window_below(limit, window_size=1000):
    is_prime = init_prime_array(window_size)
    window_start = limit - window_size
    root_limit = int(limit ** 0.5 + 1.5)
 
    for n in range(3, root_limit, 2):
        n2 = n * 2
        rem_2n = window_start % n2
        rem_2n += n if (rem_2n <= n) else -n
        start_index = n2 - rem_2n
        is_prime[start_index : : n2] = False
 
# Generator:
    for m in range(1, window_size, 2):
        try:
            if is_prime[m]:
                n = window_start + m
                yield (n)
        except IndexError:
            break
    return

 

This gave the same runtime (about 305ms) as the Cython-compiled Python with no mods. With no advantage in using this (in the code we are modifying at least) this change was rolled back. This will also allow us to examine the individual gains we get from the different Cython mods we’re going to make.

B. Using cdefs ints for all integers

This will not work, but it illustrates how to debug your code by reading the Cython, C and runtime errors. Starting again with the original Python code, set all the Python integer types to Cython cdef int types:

# -*- coding: utf-8 -*-
 
import numpy as np
 
def init_prime_array(int prime_window_size):
    LastArray = np.ones(prime_window_size, dtype=np.uint16)
    LastArray[0::2] = False
    return LastArray
 
 
def primes_in_window_below(int limit, int window_size=1000):
    cdef int window_start, root_limit, n, n2, m, rem_2n, start_index
 
    is_prime = init_prime_array(limit, window_size)
 
    window_start = limit - window_size
    root_limit = int(limit ** 0.5 + 1.5)
 
    for n in range(3, root_limit, 2):
        n2 = n * 2
        rem_2n = window_start % n2
        rem_2n += n if (rem_2n <= n) else -n
        start_index = n2 - rem_2n
        is_prime[start_index : : n2] = False
 
# Generator:
    for m in range(1, window_size, 2):
        try:
            if is_prime[m]:
                n = window_start + m
                yield (n)
        except IndexError:
            break
    return

 

This compiled OK, but when the command line was input with a limit of 10**12, this was the result:

File "MyCythonModule.pyx", line 11, in MyCythonModule.primes_in_window_below
def primes_in_window_below(int limit, int window_size=1000):

OverflowError: value too large to convert to int 

The runtime error tells us to bump up the C integer type we’re using for limit in the function definition. This makes sense. As we start to assign C data types to Python variables, we need to be aware of their meanings. We are passing a value of 1.0e+12 into the function and asking it to use a signed 16-bit C int, which by definition is <= 32,767. We’re gonna need a bigger float. Sorry, I mean int.

C. Using cdef longs for integers

The standard C type long is 32-bits, and can be used to represent signed integers up to 2.147e+09. This was used for all the variables and indices that would hold large integers:

# -*- coding: utf-8 -*-
 
import numpy as np
 
def init_prime_array(int prime_window_size):
    LastArray = np.ones(prime_window_size, dtype=np.uint16)
    LastArray[0::2] = False
    return LastArray
 
 
def primes_in_window_below(long limit, int window_size=1000):
    cdef long window_start, root_limit, n, n2, rem_2n, start_index
    cdef int m
 
    is_prime = init_prime_array(window_size)
 
    window_start = limit - window_size
    root_limit = int(limit ** 0.5 + 1.5)
 
    for n in range(3, root_limit, 2):
        n2 = n * 2
        rem_2n = window_start % n2
        rem_2n += n if (rem_2n <= n) else -n
        start_index = n2 - rem_2n
        is_prime[start_index : : n2] = False
 
# Generator:
    for m in range(1, window_size, 2):
        try:
            if is_prime[m]:
                n = window_start + m
                yield (n)
        except IndexError:
            break
    return

 

This will actually run OK up to a limit of 10**16, which means that it probably actually corresponds to the C type long long. The lesson here is that Cython’s versions of C data types do not necessarily correspond one-to-one with C’s data types. More on this later. Here is the time for the 10**12 test:

1 loop, best of 5: 228 msec per loop

Which is now 36% faster than Python.

D. Using Py_ssize_t for array indices

This is a step that is recommended in the Cython online manual. The indexing variable was declared on its own line as:

cdef Py_ssize_t m

It made almost no difference to the runtime, but was left in as a best practice.

E. Using inline code

This is something that goes against normal programming practice. The idea is to restructure hierarchical code into flat inline code. In doing so, we are moving code from Python sub-functions back into the calling function, producing one block of inline code. The reason is that as we start to get faster and faster code, the overhead of Python function calls become a larger factor. If a Python function is called thousands of times, it starts to add up. Here is the new code listing, as one code block:

# -*- coding: utf-8 -*-
 
import numpy as np
 
def primes_in_window_below(long limit, int window_size=1000):
    cdef long window_start, root_limit, n, n2, rem_2n, start_index
    cdef Py_ssize_t m
 
    is_prime = np.ones(window_size, dtype=np.uint16)
    is_prime[0::2] = False
 
    window_start = limit - window_size
    root_limit = int(limit ** 0.5 + 1.5)
 
    for n in range(3, root_limit, 2):
        n2 = n * 2
        rem_2n = window_start % n2
        rem_2n += n if (rem_2n <= n) else -n
        start_index = n2 - rem_2n
        is_prime[start_index : : n2] = False
 
# Generator:
    for m in range(1, window_size, 2):
        try:
            if is_prime[m]:
                n = window_start + m
                yield (n)
        except IndexError:
            break
    return

 

This change gave no speed improvement for the 10**12 test, probably because the function is called only once, but the inline code was left in as the original function wasn’t doing very much anyway:

1 loop, best of 5: 219 msec per loop

This is such a common requirement that Cython provides a faster way to do this, without restructuring your code, using the Cython keyword inline in your function definitions, like this:

cdef inline float my_function(int a, float b):
	
    # function body...
	
    return 

This is the recommended way to make your code compile inline, while saving coding time and keeping your program readable.

F. Using intc for Numpy integer arrays

The folks at Cython recommend that you use the intc data type for Numpy integer arrays, rather than the Numpy types uint8 and uint16. The Numpy array declaration line now looks like this:

is_prime = np.ones(window_size, dtype=np.intc)

This also gave no speed improvement.

G. Removing the exception handling from around the generator

Exception handing is a great feature of Python which C lacks, but it can have a heavy cost in computation time. This section was needed in the earlier multi-array version of the code, but is no longer needed. Unfortunately, this gave no improvement in speed.

H. Using C arrays instead of Numpy arrays

I decided to see if I could write some faster array-handling code than Numpy. All I was doing with Numpy was masking a 1000-item binary array. With access to C’s static data types, surely I could write something leaner than Numpy using a C array. Here is the code:

# -*- coding: utf-8 -*-
 
DEF WIN_SIZE = 1000
 
def primes_in_window_below(long limit):
 
    cdef long window_start, root_limit, n, n2, start_index, rem_2n
    cdef Py_ssize_t counter1, counter2, m
    cdef bint is_prime[WIN_SIZE]
 
    counter1 = 0
    while counter1 < WIN_SIZE:
        is_prime[counter1] = 0
        is_prime[counter1 + 1] = 1
        counter1 += 2
 
    window_start = limit - WIN_SIZE    
    root_limit = int(limit ** 0.5 + 1.5)
 
    for n in range(3, root_limit, 2):
        n2 = n * 2
        rem_2n = window_start % n2
        rem_2n += n if (rem_2n <= n) else -n
        start_index = n2 - rem_2n
 
        counter2 = start_index
        while counter2 < WIN_SIZE:
            is_prime[counter2] = 0
            counter2 += n2
 
# Generator:
    for m in range(1, WIN_SIZE, 2):
        if is_prime[m]:
            n = window_start + m
            yield (n)
 
    return

 

The effect of this change was fairly dramatic. This was the result for 10**12:

50 loops, best of 5: 6.22 msec per loop

Which means that the Cython runtime is now 1.7% of the original Python runtime. Or, more meaningfully, the Cython code is now over 57 times faster than the Python Numpy version. And this for an algorithm that had already been tuned for speed in Python with Numpy binary masking.

A few of things to note. First, we no longer need to import Numpy with this code. Second, another way would have been to keep Numpy and use memory views. This would have been the normal way to do things, and would also have yielded near-C speeds, while retaining Numpy’s strengths in areas such as image handling and matrix mathematics. In other words, Cython knows all about Numpy, and you can keep using your Numpy code. Third, this is the quick and easy way to dimension a C array with a constant. If you want to dimension an array with a variable, you’ll need to use the C function malloc(). Or use Numpy to dimension your arrays with variables, and use memory views.

I. Using unsigned longs for the integers

This gave another 12% boost in the speed:

50 loops, best of 5: 5.1 msec per loop

Which means that the Cython code is now running 69 times faster than Python.

The HTML file reveals that almost every line of the function is now being converted directly into C:

After the Cython mods

 

J. Converting The Last Python Code To C

Apart from the Python generator yield, the HTML file above reveals that there is still one line of Python left to convert to C: the modulus (remainder) calculation line. The reason this has not been converted directly into C is that Python has a thing it does with all divisions: it checks if the denominator isn’t zero, to avoid an overflow error. This is something that C does not do. So the quick way to turn this off (if you know it’s safe to do so) is to tell Cython via a compiler directive. One form of this is a decorator, around any function where you don’t want it to happen:

cimport cython

@cython.cdivision(True)
def primes_in_window_below(long limit):

This gave a further improvement in execution time:

50 loops, best of 5: 4.53 msec per loop

Which means that the code is now running 78 times faster than the original, Python-interpreted, Numpy masking version.

Possible Further Cython Mods

Here are just a few of the many more Cython modifications you could make:

  • Using C++ vector type for the is_prime array.
  • Using C memory views of the original Numpy array, as a speed comparison to C array used. Would probably be almost as fast.
  • Splitting the offset masking off into its own sub-function to separate it from the Python-only generator section, making the array masking section a potentially faster C function. Probably would not make much difference, as it is called only once.
  • However, if the offset masking is hived off into its own C function, then using Cython to turn off the Python GIL for this function would then be an option. This would then take advantage of any additional cores you have on your system, multiplying the speed gains you’ve achieved with Cython.
  • Increasing the maximum limit to 2^128 by importing the C header for 128-bit integers.

Conclusion

The exercise shows that Cython offers a great way of speeding up your Python apps. With only a few modifications, the Python code above was made to run 78 times faster than Numpy binary masking. Remember, all we have done is to add C static data types – we have not even touched the more complex area of using Cython to import C functions.

With parallel processing, after restructuring your code the best you can hope for is a linear increase in speed for every CPU you add. By re-typing only your speed-critical functions using Cython, your code might end up running over 1,000 times faster at near-C speeds on a single core, gains which can be achieved without having to rewrite all your code in C++ or Java. If you’re a company looking to write your next speed-critical app, Python+Cython will give you the best of both worlds – you can write most of your high-level code in Python, and set your Cython coders on to the parts that need to be fast.

After reading this, hopefully you’ll understand what it means when someone says that Python is slow. It means that they don’t know what they’re talking about. And that they’ve never used Cython.

2 Replies to “From Python To Cython”

Leave a Reply

Your email address will not be published. Required fields are marked *