With the impending demise of Moore’s Law, multiple cores are a common manufacturers’ workaround for improving hardware performance, whether or not your installed apps can use the parallel architecture.
And with each new release of Python, parallel programming gets even easier. But the degree to which your code can use your multiple cores will depend on the kind of problem you are trying to solve, on the implementation of Python you are running and, as it turns out, how truly parallel the underlying architecture of your system actually is.
The goal of this series of posts is to see how adaptable some of my existing code is to take advantage of multi-core hardware, to see what changes need be made to scale it, and to measure the performance improvements from the exercise.
Options For Running Python In Parallel
There are many ways of running Python on multiple cores. The options available to you depend on what implementation of Python you are using. You can use Cython, run IronPython, write Java extensions for Jython or install PyPy. Python code can even be scaled to thousands of cores, using pyfora, as discussed on the Talk Python podcast.
This post is not, however, an exhaustive survey of the different ways of running Python in parallel. Instead, it will look at how to implement parallel processing using standard Python (CPython). Most Python users on Windows, Mac and Linux are actually already running CPython, which allows a form of parallel processing using the built-in multiprocessing module, accessed via the higher level concurrent.futures module. This was introduced in Python 3.2, and has since been made available to Python 2.5. This will be the module we will be using in this post.
Criticism of This Approach
I once heard a Python expert express his loathing for the multiprocessing module’s implementation of parallel Python, mainly because Python is reaching out into the operating system to create a child process of the same process running it, in order to bypass Python’s global interpreter lock or GIL. The argument is that Python should stay in its box and not start controlling the OS. Python already interacts with the OS when you use IO threads (such as web utilities) to avoid halting the program flow, and actually turns off the GIL to do so. But the multiprocessor module is creating a clone of its own Python process, with the GIL still on inside each cloned process. So you can see why some CompSci purists would have a problem with it.
And in absence of any imminent change as to how tightly Python controls its GIL, and without rolling up your sleeves and writing C or Java extensions, for standard Python users this would seem to be the simplest way to go parallel.
Adapting Your Code To Run On Multiple Cores
Whichever way you choose to go, the key to making parallel programming work is in identifying which parts of your code will benefit from parallel execution, and how to optimise your code structure to take advantage of your multiple cores. Each CPU process must be given its own safe variables to process, so that it’s neither holding up the next core’s process by locking access to some variable or array – or worse, modifying it at the same time. If you’re manipulating a large array, it must either be broken up and the parts passed to each CPU, or the parts created inside each process.
If it’s an image you’re working on, you have to decide whether to hand it to your cores block-by-block, line-by-line, or pixel-by-pixel. The optimum size of the computation you pass to each process is largely a case of trial and error, as you find a balance between the admin overhead of parallel processing and the ideal size for the computation you’re passing to each core.
Going Parallel With concurrent.futures
A common code structure using the concurrent.Futures module is shown below:
The important things to note are as follows:
- The Python shown sets the number of CPUs (workers) to the maximum available (A, D).
- The futures.ProcessPoolExecutor line is the CPU manager (F). It never changes. Depending on how you want to write your code, the desired number of workers can be assigned to the maximum available, hard coded, or set at the command line.
- The first variable passed to the executor.submit() function (G) is the name of the function you want to execute in parallel on each CPU. The next parameters are the variables you want to pass to this function. These should match the function definition (B).
- The key to using this module is understanding that there is no guarantee that the order you send the processes to CPUs (via executor.submit) is the same order that they are completed. Which means that you have to know which result is which when they return. Which means you have to tag them. One way to do this is to return each result as a tuple, the first of whose items is a marker tag, in this case the variable loop_var (C). In the above example, the results are accessed using futures.as_completed (J) inside the executor loop (F), which from your_function() would therefore be (0, this_result), (1, this_result), (2, this_result), etc, in any order, as each CPU completes its process. As the results come back, the trick is to store them in a list, which in the above example is called numbered_results[]. (E, K)
- So the last thing you have to do with the results before you use them is to sort them by loop_var into the correct order (L)
First Attempt
To kick things off with real code, I’ve taken some single-core code from a previous post for calculating large prime numbers and converted it to run on multiple cores. The code calculates an offset odd sieve, a computationally intensive program that makes extensive use of Numpy binary array masking.
To achieve parallel computation with this code, I’ve divided up the task of masking the 1000-item Numpy array between the N cores being used. Thus, for 4-core calculation, each CPU will be handed a 250-item array to perform millions of Numpy binary masking operations on. This version sets up one 1000-item Numpy array before the parallel loop and sends slices of it to each core.
Here is a first-pass parallel version of the code, with the sieve inside the function del_non_primes():
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Program description: runs an ofset odd sieve inside a small, 1000-wide window of
numbers just below a large, user-defined limit. Uses a binary Numpy array for
the integer window and uses parallel processing to mask slices of the array on
different cores.
Created 20 Nov, 2018
Author: matta_idlecoder at protonmail dot com
"""
import timeit
import datetime
import numpy as np
from concurrent import futures
timing_it = True
listing_primes = True
def init_prime_array(maximum, prime_window_size):
"""Returns list of Boolean numpy arrays, initialised for eliminating primes
Creates a small numpy array, just under the maximum.
"""
maximum += 1 if (maximum % 2) else 0 # make maximum even
LastArray = np.ones(prime_window_size, dtype=bool)
LastArray[0::2] = False
win_start = maximum - prime_window_size
return LastArray, win_start
def del_non_primes(sieved_array, root_limit, array_start):
"""Mask out non-primes from the numbers in the array
"""
for n in range(3, root_limit, 2):
n2 = n * 2
rem_2n = array_start % n2
rem_2n += n if (rem_2n <= n) else -n
start_index = n2 - rem_2n
sieved_array[start_index : : n2] = False
# Note how the function returns the value represented by
# array_slice[0] as a marker for which array it is. This is
# needed to sort the results as them come back:
return array_start, sieved_array
def primes_in_window_below(limit, prime_window_size=1000):
"""An offset odd sieve for finding primes, using a Generator
"""
# creates a blank boolean array with all the evens preset to False:
is_prime, window_start = init_prime_array(limit, prime_window_size)
root_limit = int(limit ** 0.5 + 1.5)
# prep to slice up the problem into the the number of cores you want to use.
AVAIL_CORES = 2 # os.cpu_count() # Typically pseudo-4 on many laptops
stride = int(prime_window_size / AVAIL_CORES)
# This code creates a mask of slice_pts, eg for 4 CPUs and a list 20 long,
# it would create slice_pts of [(0, 5), (5, 10), (10, 15), (15, 20)]:
slices = list(zip(range(0, prime_window_size - stride + 1, stride),
range(stride, prime_window_size + 1, stride)))
# Use the mask to slice up the Numpy array into the number of cores:
is_prime_slices = [is_prime[slice_pt[0]: slice_pt[1]] for slice_pt in slices]
# parallel bit:
this_array_start, numbered_results = window_start, []
with futures.ProcessPoolExecutor(AVAIL_CORES) as executor:
result = (executor.submit(del_non_primes, array_slice, root_limit,
this_array_start + (slice_num * stride))
for slice_num, array_slice in enumerate(is_prime_slices))
for future in futures.as_completed(result):
res = future.result()
numbered_results.append(res)
# Create an array for the results:
is_prime_arrays = np.zeros([AVAIL_CORES, stride], dtype=bool)
# sort the results into order by first item (= this_array_start):
numbered_results = sorted(numbered_results)
for index, result in enumerate(numbered_results):
is_prime_arrays[index] = result[1]
# Generator: calculate each prime from its array position:
for core_result in range(AVAIL_CORES):
for array_index in range(1, stride, 2):
try:
if is_prime_arrays[core_result][array_index]:
n = window_start + (core_result * stride) + array_index
yield (n)
except IndexError:
break
return
def main():
PRIME_WINDOW_SIZE = 1000
upper_limit = 1
question = "\nNear what power of 10 do you want to find the primes?\n"
question += "Note: it will take about 3 times longer for each additional\n"
question += "power of 10. Enter an integer greater than 6: "
while upper_limit < 10**6:
try:
ten_power = int(input(question))
upper_limit = 10**ten_power
except ValueError:
upper_limit = 1
print("Searching for the primes in the last {:,} numbers up to {:.1e}...".
format(PRIME_WINDOW_SIZE, upper_limit))
if timing_it:
print("Starting the search at",
datetime.datetime.now().strftime("%H:%M:%S"))
start_time = timeit.default_timer()
prime_set = list(primes_in_window_below(upper_limit,
prime_window_size=PRIME_WINDOW_SIZE))
if not prime_set:
print ("There were no primes found in the last 1,000 numbers below {:,}".
format(upper_limit))
elif listing_primes:
print ("\nThe last 20 primes are:\n")
for prime1, prime2 in zip(prime_set[-20::][::2], prime_set[-20::][1::2]):
print ("{:<35,}{:<,}".format(prime1, prime2))
else:
prime_max = max(prime_set)
print("The largest prime found was {:,d}".format(prime_max))
if timing_it:
prime_time = timeit.default_timer() - start_time
days, hours = divmod(prime_time, 86400)
hours, mins = divmod(hours, 3600)
mins, secs = divmod(mins, 60)
print(
'\nCompleted the search in {:,} days {:,} hours {:,} mins {:0,.1f} seconds.'.format(
int(days), int(hours), int(mins), secs))
return
if __name__ == '__main__':
main()
First Pass Results
The results from running the offset sieve to 1.0e15 on 1 CPU are shown below:
Near what power of 10 do you want to find the primes? Note: it will take about 3 times longer for each additional power of 10. Enter an integer greater than 6: 15 Searching for the primes in the last 1,000 numbers up to 1.0e+15... Starting the search at 15:36:07 The last 20 primes are: 999,999,999,999,103 999,999,999,999,109 999,999,999,999,163 999,999,999,999,199 999,999,999,999,283 999,999,999,999,353 999,999,999,999,389 999,999,999,999,409 999,999,999,999,491 999,999,999,999,521 999,999,999,999,571 999,999,999,999,577 999,999,999,999,643 999,999,999,999,659 999,999,999,999,809 999,999,999,999,827 999,999,999,999,877 999,999,999,999,883 999,999,999,999,947 999,999,999,999,989 Completed the search in 0 days 0 hours 0 mins 12.9 seconds.
Running my prime sieve on one core. Note how the user input is now a bit simpler. It’s not as flexible, but I got tired of tapping in all the zeros to test the larger limits.
For comparison, here are the results from running on 1, 2, 3 and 4 cores:
# CPUs | Average Execution Time |
---|---|
1 | 13.0 secs |
2 | 13.2 secs |
3 | 20.5 secs |
4 | 28.5 secs |
Note that there was no performance improvement by running on 2 cores, and that the performance actually degrades in going to 3 and 4 CPUs. The decrease in performance is explained by the fact that the underlying architecture of my desktop is not true quad core, but rather 2-core, emulating a quad. Maybe it’s time to upgrade my hardware.
So, ignoring the results for 3 and 4 CPUs, the question was why I’d gained no advantage by running this code on two CPUs. This has three possible causes.
Either:
- The computational chunks being passed to each CPU are the wrong size,
- The code has been poorly adapted to run on multiple cores, or
- My algorithm is poorly suited to running on multiple cores.
To eliminate option (1), the prime window being masked was changed from 1,000 to 10,000, then 100,000, 1 million and finally 10 million, with no change in performance between single and dual core execution. Note that this did not eliminate options 2 or 3.
Second Attempt
Next, using a strategy from my previous prime sieve post, the code was changed to avoid using slices of the same list with each core process. The goal was to pass only independent variables to the CPU processes, which my function del_non_primes() would then use to build and initialize the prime elimination arrays from scratch. If this worked, it would mean that option (2) above had been the problem. If not, then option (3) would be the only conclusion. Here is the new version of code:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Program description: runs an ofset odd sieve inside a small, 1000-wide window of
numbers just below a large, user-defined limit. Uses a binary Numpy array for
the integer window and uses parallel processing to mask slices of the array on
different cores.
Created 20 Nov, 2018
Author: matta_idlecoder at protonmail dot com
"""
import timeit
import datetime
import numpy as np
from concurrent import futures
timing_it = True
listing_primes = True
def init_prime_array(prime_window_size):
"""Returns a Boolean numpy array, initialised with evens eliminated
"""
LastArray = np.ones(prime_window_size, dtype=bool)
LastArray[0::2] = False
return LastArray
def del_non_primes(CPU_index, num_cores, window_size, limit):
"""Initialize the prime array slice and mask out non-primes
The array initialization lines have been moved here, so that only variables
are passed from the futures.ProcessPoolExecutor, rather than refs to
different sections of the same list
"""
array_start = limit - window_size + CPU_index * int(window_size / num_cores)
is_prime = init_prime_array(int(window_size/num_cores))
root_limit = int(limit ** 0.5 + 1.5)
for n in range(3, root_limit, 2):
n2 = n * 2
rem_2n = array_start % n2
rem_2n += n if (rem_2n <= n) else -n
start_index = n2 - rem_2n
is_prime[start_index : : n2] = False
return array_start, is_prime
def primes_in_window_below(limit, prime_window_size=1000):
"""Pseudo-sieve of Eratosthenes, using a Generator
"""
#AVAIL_CORES = os.cpu_count()
AVAIL_CORES = 2 # os.cpu_count() # Typically pseudo-4 on many laptops
stride = int(prime_window_size / AVAIL_CORES)
window_start = limit - prime_window_size
numbered_results = []
with futures.ProcessPoolExecutor(AVAIL_CORES) as executor:
result = (executor.submit(del_non_primes, core_index, AVAIL_CORES,
prime_window_size, limit) for core_index in range(AVAIL_CORES))
for future in futures.as_completed(result):
res = future.result()
numbered_results.append(res)
is_prime_arrays = np.zeros([AVAIL_CORES, stride], dtype=bool)
numbered_results = sorted(numbered_results)
for index, result in enumerate(numbered_results):
is_prime_arrays[index] = result[1]
# Generator: calculate each prime from its array position:
for core_result in range(AVAIL_CORES):
for array_index in range(1, stride, 2):
try:
if is_prime_arrays[core_result][array_index]:
n = window_start + (core_result * stride) + array_index
yield (n)
except IndexError:
break
return
def main():
PRIME_WINDOW_SIZE = 1000
upper_limit = 1
question = "\nNear what power of 10 do you want to find the primes?\n"
question += "Note: it will take about 3 times longer for each additional\n"
question += "power of 10. Enter an integer greater than 6: "
while upper_limit < 10**6:
try:
ten_power = int(input(question))
upper_limit = 10**ten_power
except ValueError:
upper_limit = 1
print("Searching for the primes in the last {:,} numbers up to {:.1e}...".
format(PRIME_WINDOW_SIZE, upper_limit))
if timing_it:
print("Starting the search at",
datetime.datetime.now().strftime("%H:%M:%S"))
start_time = timeit.default_timer()
prime_set = list(primes_in_window_below(upper_limit,
prime_window_size=PRIME_WINDOW_SIZE))
if not prime_set:
print ("There were no primes found in the last 1,000 numbers below {:,}".
format(upper_limit))
elif listing_primes:
print ("\nThe last 20 primes are:\n")
for prime1, prime2 in zip(prime_set[-20::][::2], prime_set[-20::][1::2]):
print ("{:<35,}{:<,}".format(prime1, prime2))
else:
prime_max = max(prime_set)
print("The largest prime found was {:,d}".format(prime_max))
if timing_it:
prime_time = timeit.default_timer() - start_time
days, hours = divmod(prime_time, 86400)
hours, mins = divmod(hours, 3600)
mins, secs = divmod(mins, 60)
print(
'\nCompleted the search in {:,} days {:,} hours {:,} mins {:0,.1f} seconds.'.format(
int(days), int(hours), int(mins), secs))
return
if __name__ == '__main__':
main()
Second Pass Results
The results have been summarized for 1.0e15 on 1 and 2 cores:
# CPUs | Average Execution Time |
---|---|
1 | 12.7 secs |
2 | 13.7 secs |
Given the typical margin of error, it seems that building and initializing the prime elimination arrays from scratch within each parallel process has given almost no speed improvement, although there is now a slight speed differential between 1 and 2 CPUs.
Conclusion
This has not been an entirely fruitless exercise. The finding is a useful one: that algorithms that rely heavily on Numpy masking of simple binary arrays are unsuitable for multi-core implementation with the concurrent.futures module.
Next
In the search for algorithms better suited to multi-core processing using concurrent.futures, next I’ll be testing Python’s concurrent.futures module on a computationally intensive image processing problem. And in future posts, I’ll be looking at other ways of scaling Python through alternative parallel implementations.
After examine a few of the blog posts on your website now, and I truly like your manner of blogging. I bookmarked it to my bookmark website checklist and shall be checking again soon. Pls take a look at my website as effectively and let me know what you think.
Thanks, Orlando.
Didn’t get any link to a website, though.