Parallel Python – 2: Image Transforms

This post continues my investigation into running standard Python on multiple cores using the concurrent.futures module. This time, I will be taking some image processing code from a previous post on the transverse Mercator transform and attempting to scale it to multiple CPUs.

This kind of image transform is at the low end of image processing complexity. There is no edge detection or shape recognition happening, just building an output image line-by-line from an input image’s pixels. The mathematics involves a few lines of trigonometry using Numpy to calculate each transformed pixel position.

If I succeed in getting it to scale correctly, the goal will be to find the optimum granularity and type of calculation to send to each core, i.e. the kind of data structures to use and whether to calculate the results pixel-by-pixel, line-by-line or in multiple lines at a time for the fastest calculation time.

The Original Code

Here is the code from the original Mercator transform post:

 

The initial input maps to be used are low (720×360) and higher resolution (2k x 1k) PNG files of a Mercator projection to 85º N & S of the equator:

Mercator map

When the unchanged transform code from the previous post is run, this is the result: a Mercator projection of the Earth, but with the map’s projection poles shifted to the Equator, at 90º E and W from Greenwich. In this projection, Central and South America seem a little on the large size, but Greenland is much closer to its actual size and position, at least relative to Europe and North America:

The time for running this as a normal, single threaded program on one CPU on my desktop is about 6.5 seconds.

Making It Parallel

Drawing again on the concurrent.futures code structure described in the previous post, it will be adapted for parallel image processing as follows:

  1. your_function (def at B, called at G) is your highest level image transform function, acting on any level of image data you want to give it. Remember, the first variable passed to ProcessPoolExecutor is the name of the function you are passing to each core.
  2. After the function name come the variables (G) you are passing to this function and the FOR loop you want to execute (H) as you cycle through the cores. The loop is whatever you want it to be – for each pixel, line, block, whatever. The trick with image data is going to be avoiding data lock between the cores, as each CPU process performs its calculation. For example, different cores cannot work on the same array. To avoid this, in this exercise the ProcessPoolExecutor will merely be given pixel position calculations. The actual image data to which the results correspond will be handled later, outside the parallel loop.
  3. The FOR loop at line J collects the results from the core processes, in whatever order they are completed. Within this FOR loop, line K stores the results in a list for later sorting.
  4. Once sorted, the list numbered_results (L) is the data used to construct the output image. In this case, since we’re just moving pixels around, it’s just a lookup matrix to tell us the correlation between the pixels positions of the two images, to allow us to build the output image from the input image pixels.

Calculating Individual Pixels On Each Core

To give me a baseline for the parallel processing overhead, I began with the finest level of granularity – giving 2 CPUs each the trivial task of calculating the Mercator transverse transform for one pixel at a time, just as the original code had done. Here is the code to do this:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
{0}: redraws a given standard Mercator projection world map with the poles
moved to 90E & 90W at the equator. Uses the concurrent.futures parallel 
processing module to calculate the map transform pixel by pixel.
 
Equations are explained at:
en.wikipedia.org/wiki/
Transverse_Mercator_projection#Inverse_transformation_formulae
 
The following flags are optional:
    -i input_map
        The name of the input map filename. No extension required: program will
        auto-detect the image type. With no -i flag, it will look for an
        input file called WorldMap.png
 
    -o output_map
        The name you want to give your transformed output map. With no -o flag,
        it will output to a file called ShiftedPoleMap.jpg
        Requires an image type suffix, which can be any of the standard image
        file types (.gif .bmp .jpg .jpeg .ppm .png .tif or .tiff).
        Defaults to a .jpg output if none given.
 
    -d
        don't display maps when running the program
 
    -h
        Prints this __doc__ string
 
    -t
        Time the program execution
 
Example command line:
    python3 ./{0} -i myMap.png -o myNewMap.png -t
 
Or, if you cd to the same folder and make this file executable:
    ./{0} -i myMap.png -o myNewMap.png -t
 
Created 15 Dec, 2018
Author: matta_idlecoder at protonmail dot com
"""
 
from PIL import Image
import numpy as np
import sys
import getopt
import timeit
from concurrent import futures
import os
 
displaying_images = True
timing_it = True
 
def inv_merc_lat(translat_in_rads, translong_in_rads):
    """Converts a transverse mercator latitude to its original latitude.
    """
    orig_lat_rad = -np.arcsin((np.sin(translong_in_rads)) / (np.cosh(translat_in_rads)))
    return orig_lat_rad
 
 
def inv_merc_long(trans_lat_rads, trans_long_rads):
    """Converts a transverse mercator longitude to its original longitude.
    """
    orig_long_rads = np.arctan(np.sinh(trans_lat_rads) / np.cos(trans_long_rads))
    if (trans_long_rads > np.pi / 2) or (trans_long_rads < -np.pi / 2):
        if trans_lat_rads >= 0:
            orig_long_rads += np.pi
        elif trans_lat_rads < 0:
            orig_long_rads -= np.pi
    return orig_long_rads
 
 
def inv_transform_pixel(map_line_num, x_pix_num, x_max, y_max, transformed_lat):
    """Calculates the new pixel position in the transformed image
    """
    trans_longitude = round((float(x_pix_num) / x_max * 2 * np.pi) - np.pi, 2)
    orig_latitude = round(inv_merc_lat(transformed_lat, trans_longitude), 3)
    orig_longitude = round(inv_merc_long(transformed_lat, trans_longitude), 3)
    original_x = int((orig_longitude + np.pi) / (2 * np.pi) * x_max)
    original_y = int((orig_latitude - np.pi / 2) / (-np.pi) * y_max)
    # fixes potential rounding and index errors:
    original_x -= 1 if original_x == x_max else 0
    original_y -= 1 if original_y == y_max else 0
 
    return map_line_num, x_pix_num, original_y, original_x
 
 
def inv_transform_map(map_input_name, output_map):
    """Builds a new transformed map image line by line, pixel by pixel
    """
    workers = 4
    try:
        input_image = Image.open(map_input_name)
        X_PIC_SIZE, Y_PIC_SIZE = input_image.size
        if displaying_images:
            input_image.show()
 
    except FileNotFoundError:
        print("\nInput map {} isn't there. Check the filename is correct.\n".
              format(map_input_name))
        sys.exit(2)
 
    print("\nRunning the map transform...\n")
 
    X_PIC_SIZE, Y_PIC_SIZE = input_image.size
    pixel_result_list = []
 
    for map_line in range(Y_PIC_SIZE):  # for each line
        y_norm = float(map_line) / Y_PIC_SIZE
        trans_lat = round(y_norm * (-35 * np.pi / 18) + (17 * np.pi / 18), 2)
 
        with futures.ProcessPoolExecutor(workers) as executor:
            result = (executor.submit(inv_transform_pixel, map_line, x_coord,
                                      X_PIC_SIZE, Y_PIC_SIZE, trans_lat)
                        for x_coord in range(X_PIC_SIZE))
 
            # save the results, in whatever order they are completed:
            for future in futures.as_completed(result):
                res = future.result()
                pixel_result_list.append(
                    ([res[0], res[1]], [res[2], res[3]]))
 
    # Create transformed image data:
    output_image_data = np.zeros((Y_PIC_SIZE, X_PIC_SIZE, 3), dtype=np.uint8)
    for pixel_mapping in sorted(pixel_result_list):
        # The [:3] is to read only the RGB input image data if it's RGBA:
        output_image_data[pixel_mapping[0][0], pixel_mapping[0][1]] = \
                    input_image.getpixel(tuple(pixel_mapping[1][::-1]))[:3]
 
    # Create transformed image:
    output_image = Image.fromarray(output_image_data, 'RGB')
 
    if displaying_images:
        output_image.show()
    if not output_map:
        output_map = "ShiftedPoleMap.png"
    output_image.save(output_map)
    return
 
 
def usage(prog_name):
    """Prints a short command line guide, using whatever filename this prog has.
 
    """
    print("""\nUsage: from the command line type:
    {} [-i input_filename] [-o output_filename.ext] [-dht]
 
    If you don't want to use flags, give your input file the name WorldMap.png
    and your output filename will be given the name ShiftedPoleMap.jpg
 
    Use the -t flag to time your map transformation.
 
    Use the -h flag for a detailed flag listing.
 
    """.format(prog_name))
    return
 
 
def main(argv):
    global displaying_images, timing_it
 
    input_map_name = 'WorldMap.png'
    output_map_name = 'ShiftedPoleMap.jpg'
 
    try:
        opts, args = getopt.getopt(argv, "i:o:dht")
 
    except getopt.GetoptError as err:
        print("\n", str(err))
        usage(sys.argv[0])
        sys.exit(2)
 
    for opt, arg in opts:
        if opt == '-d':
            displaying_images = False
        elif opt == '-i':
            input_map_name = arg
        elif opt == '-o':
            output_map_name = arg
            ext = os.path.splitext(output_map_name)[-1].lower()
            if ext not in ".gif .bmp .jpg .jpeg .ppm .png .tif .tiff".split():
                output_map_name += '.jpg'  # smallest file format
                print('Unknown output file extension. Creating a JPEG file.')
        elif opt == '-h':
            print("\n", __doc__.format(sys.argv[0][2:]))
            sys.exit(2)
        elif opt == '-t':
            timing_it = True
        else:
            assert False, "unhandled option"
 
    if timing_it:
        start = timeit.default_timer()
 
    inv_transform_map(input_map_name, output_map_name)
 
    if timing_it:
        stop = timeit.default_timer()
        print('\nFinished the map transform in', round((stop - start), 2),
              'seconds.\n')
    return
 
 
if __name__ == '__main__':
    main(sys.argv[1:])

 

This was clearly too fine a granularity per core, but it was a revealing exercise. The execution time using alternate cores to calculate the individual pixel positions went from 6.7 secs to about around 90 secs. The fact that the times were the same for 1 & 2 cores told me that even though both CPUs were being used, no parallel processing was going on – each core was spending most of its time waiting for the last one to finish. This was also reflected in the much slower times than from running the original code on 1 CPU. But it was a good place to start, even though I had given each CPU almost nothing to do. If I now gave incrementally larger tasks to each CPU, things should only get faster.


Calculating Multiple Lines On Each Core

The code was then modified to run on potentially all the cores, on up to 10 lines at a time. The results were a great improvement over calculating one pixel at a time:

 Calculating: On 1 Core: On 2 Cores:
1 pixel 1 min 30s 1 min 31s
1 line 7.53 secs 4.65 secs
2 lines 7.55 secs 4.63 secs
3 lines 7.51 secs 4.66 secs
4 lines 7.56 secs 4.63 secs

These results show that:

  1. Scaling to 2 cores had been a success for this image transform.
  2. There had been a small price in converting the code to multi-core and then only running on one core (7.55s vs 6.7s), but there had been an appreciable improvement over both times by running it on 2 cores (4.65s).
  3. The code would probably scale to more cores.
  4. For a small image (720×360), the ideal calculation size per core for this transform was in the order of lines, rather than pixels.

 

Here is how these times changed for a much larger resolution map of 2k x 1k pixels:

Calculating: On 1 Core: On 2 Cores:
1 line 52.3 secs 33.1 secs
2 lines 52.4 secs 33.0 secs
3 lines 52.1 secs 32.1 secs
4 lines 53.3 secs 34.1 secs

Again, the execution time almost halved when the number of CPUs was doubled. So for this particular image transform, regardless of image size, the optimum calculation for each core seems to be a few lines at a time.


The Code

The code that gave these results is as follows:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
{0}: redraws a given standard Mercator projection world map with the poles
moved to 90E & 90W at the equator. Runs on multiple CPU cores. Will run on any
image format or size of input map. Assumes maps stop at 85º N/S.
 
Equations are explained at:
en.wikipedia.org/wiki/Transverse_Mercator_projection#Inverse_transformation_formulae
 
The following flags are optional:
    -i input_map
        The name of the input map filename. No extension required: program will
        auto-detect the image type. With no -i flag, it will look for an
        input file called WorldMap.png
 
    -o output_map
        The name you want to give your transformed output map. With no -o flag,
        it will output to a file called ShiftedWorldMap.jpg
        Requires an image type suffix, which can be any of the standard image
        file types (.gif .bmp .jpg .jpeg .ppm .png .tif or .tiff).
        Defaults to a .jpg output if none given.
 
    -c NumCores
        The number of cores you want to run the program on.
        On this computer you can set it to a number between 1 and {1}.
        Default setting is 2.
 
    -n LinesPerCore
        The number of lines you want each CPU to process at a time.
        Must be between 1 and 10. Default is 2.
 
    -d
        don't display the input and output maps on the console
 
    -h
        Prints this __doc__ string
 
    -t
        Time the program execution
 
Example command line:
    python3 {0} -i myMap.png -o myNewMap.png -c4 -n4 -t
 
Or, if you cd to the same folder and make this file executable using chmod +x:
    ./{0} -i myMap.png -o myNewMap.png -c4 -n4 -t
 
Created 20 Dec, 2018
Author: matta_idlecoder at protonmail dot com
"""
 
from PIL import Image
import numpy as np
import sys
import getopt
import timeit
from concurrent import futures
import os
 
timing_it = False
displaying_images = True
profiling = False
 
 
def inv_merc_lat(translat_in_rads, translong_in_rads):
    """Calculates orig transverse mercator latitude for a given image coord
    """
    orig_lat_rad = -np.arcsin((np.sin(translong_in_rads)) / (np.cosh(translat_in_rads)))
    return orig_lat_rad
 
 
def inv_merc_long(trans_lat_rads, trans_long_rads):
    """Calculates the orig transverse mercator longitude for a given image coord
    """
    orig_long_rads = np.arctan(np.sinh(trans_lat_rads) / np.cos(trans_long_rads))
    if (trans_long_rads > np.pi/2) or (trans_long_rads < -np.pi / 2):
        if trans_lat_rads >= 0:
            orig_long_rads += np.pi
        elif trans_lat_rads < 0:
            orig_long_rads -= np.pi
    return orig_long_rads
 
 
def inv_transform_pixel(x_pix_num, x_max, y_max, transformed_lat):
    """Calculates the corresponding pixel position from the input image
 
    :param x_pix_num: pixel number on the output image
    :param x_max: x dimension of the input and output images
    :param y_max: y dimension of the input and output images
    :param transformed_lat: radians of longitude of output image line
    :return: the input image coordinate to use for this the output image
    coordinate given by x_coord, map_line_num (in inv_transform_line)
    """
    trans_longitude = round((float(x_pix_num) / x_max * 2 * np.pi) - np.pi, 2)
    orig_latitude = round(inv_merc_lat(transformed_lat, trans_longitude), 3)
    orig_longitude = round(inv_merc_long(transformed_lat, trans_longitude), 3)
 
    original_x = int((orig_longitude + np.pi) / (2 * np.pi) * x_max)
    original_y = int((orig_latitude - np.pi / 2) / (-np.pi) * y_max)
 
    # fixes potential rounding and index errors:
    original_x -= 1 if original_x == x_max else 0
    original_y -= 1 if original_y == y_max else 0
 
    return original_y, original_x
 
 
def inv_transform_line(map_line_num, x_max, y_max):
    """Finds the transformed image pixel pos, for each pixel in the given line
 
    :param map_line_num: The line number being transformed.
    :param x_max: The width of the input image, in pixels.
    :param y_max: The height of the input image, in lines.
    :return: (1) the map_line_num that was sent to it. This is a modification to
            the single threaded version, to tag the returned data, so that we
            know which line it is. The order that the ProcessPoolExecutor
            returns the processed lines is not guaranteed.
            (2) line_pixel_mapping, a mapping for every pixel on the line, to
            the corresponding pixel coordinates before the transformation. This is
            not an image line, but rather a pixel-to-pixel mapping that will
            then be unpacked to construct the transformed output image of pixel
            values, line by line, from the input image.
    """
    # image may be 1000s of pixels wide. Thus > 8 bit & < 64 bit, and unsigned:
    line_pixel_mapping = np.zeros((x_max, 2), dtype=np.uint16)
 
    # mapping of line # to a decimal between 0 and 1.0:
    y_norm = float(map_line_num+1) / y_max
 
    # maps lines top to bottom to +175 -> -175 deg (in radians)
    # line number has now been converted into radians of longitude:
    trans_lat = round((y_norm * (-35 * np.pi / 18)) + (17.5 * np.pi / 18), 2)
 
    # for each pixel coord on the output image line:
    for x_coord in range(x_max):  
        # Map it to a curve on the input image map:
        line_pixel_mapping[x_coord] = inv_transform_pixel(x_coord, x_max, y_max,
                                                          trans_lat)
    # return the mapped line, plus its line number to identify it:
    return map_line_num, line_pixel_mapping
 
 
def inv_trans_mult_lines(max_x, max_y, *lines):
    """    Calculates multiple lines at a time
 
    :param x_max: the x dimension of the image
    :param y_max: the y dimension of the image
    :param lines: list of line numbers
    :return:
    """
    line_pair_res = []
    for line_num in lines:
        line_pair_res.append(inv_transform_line(line_num, max_x, max_y))
    return line_pair_res
 
 
def inv_transform_map(map_input_name, map_output_name, workers=2,
                      lines_per_core=2):
    """Builds an output map image from the input image via the inv transform
 
    :param map_input_name: string
    :param map_output_name: string
    :param workers: number of CPUs to use in parallel
    :param lines_per_core: number of image lines to pass to each CPU
    :return: when complete
    """
 
    try:
        input_image = Image.open(map_input_name)
        X_PIC_SIZE, Y_PIC_SIZE = input_image.size
        if displaying_images:
            input_image.show()
 
        map_trimmed = False
        while Y_PIC_SIZE % lines_per_core:  # there will be lines left over
            # trim it a little, so the zip don't run out of lines:
            Y_PIC_SIZE -= 1
            map_trimmed = True
 
    except FileNotFoundError:
        raise SystemExit("\nInput map {} isn't there. Check the filename is correct.\n".
              format(map_input_name))
 
    core_plural = "" if workers == 1 else "s"
    line_plural = "" if lines_per_core == 1 else "s"
    trim_string = "" if not map_trimmed else "slightly trimmed "
    print("\nRunning map transform on {} core{}, {} line{} at a time on a {}{}x{} map...".
          format(workers, core_plural, lines_per_core, line_plural, trim_string,
                 X_PIC_SIZE, Y_PIC_SIZE))
 
    transformed_lines = []
    # essentially, a short list of generators which, once zipped, will create
    # groups of lines to be processeed by each core:
    list_of_ranges = [range(line_number, Y_PIC_SIZE, lines_per_core)
                      for line_number in range(lines_per_core)]
 
    with futures.ProcessPoolExecutor(workers) as executor:
        result = (executor.submit(inv_trans_mult_lines, X_PIC_SIZE,
                    Y_PIC_SIZE, *lines) for lines in zip(*(list_of_ranges)))
 
        # save the results, in whatever order they are completed:
        for future in futures.as_completed(result):
            res = future.result()
            for res_line in res:
                    transformed_lines.append((res_line[0], res_line[1]))
 
    # Extract and sort 2D list of pixel mappings:
    trans_image_data = np.zeros((Y_PIC_SIZE, X_PIC_SIZE, 2), dtype=np.uint16)
    for trans_line in transformed_lines:
        trans_image_data[trans_line[0]] = trans_line[1]
 
    # Create transformed image data:
    output_image_data = np.zeros((Y_PIC_SIZE, X_PIC_SIZE, 3), dtype=np.uint8)
    for line in range(Y_PIC_SIZE):
        for pixel in range(X_PIC_SIZE):
            new_pix_line, new_pix_x = trans_image_data[line][pixel]
            # The [:3] is to read only the RGB input image data if it's RGBA:
            output_image_data[line][pixel] = input_image.getpixel(
                (int(new_pix_x), int(new_pix_line)))[:3]
 
    # Create transformed image:
    output_image = Image.fromarray(output_image_data, 'RGB')
 
    if displaying_images:
        output_image.show()
    if not map_output_name:
        map_output_name = "ShiftedPoleMap.png"
    output_image.save(map_output_name)
    return
 
 
def usage(prog_name):
    """Prints a short command line guide, using whatever filename this prog has.
    """
    print("""\nUsage: from the command line type:
    {} [-i input_filename] [-o output_filename.ext] [-c NumCores] [-n LinesPerCore] [-dht]
 
    If you don't want to use flags, give your input file the name WorldMap.png
    and your output filename the name will be given the name ShiftedWorldMap.jpg
 
    NumCores is the number of cores/CPUs/processors you want to use. 
    On this computer you can set it to a number between 1 and {} (defaults to 2).
 
    LinesPerCore is the number of lines you want each core to process at a time.
    Must be between 1 and 10. Default is 2.
 
    Use the -d flag to suppress displaying the input and output images.
 
    Use the -t flag to time your map transformation.
 
    Use the -h flag for a detailed flag listing.
 
    """.format(prog_name, os.cpu_count()))
    return
 
 
def main(argv):
 
    global displaying_images, timing_it
 
    # Just in case your IDE hands you the whole path as sys.argv[0]:
    prog_name = os.path.split(sys.argv[0])[-1]
 
    input_map_name = 'WorldMap.png'
    output_map_name = 'ShiftedPoleMap.jpg'
 
    cores_in_use = 2
    AVAIL_CORES = os.cpu_count()
    MAX_LINES_PER_CORE = 10
    lines_per_calc = 2
 
    try:
        opts, args = getopt.getopt(argv, "c:n:i:o:dht")
 
    except getopt.GetoptError as err:
        print("\n", str(err))
        usage(prog_name)
        raise SystemExit()
 
    for opt, arg in opts:
        if opt == '-d':
            displaying_images = False
        elif opt == '-i':
            input_map_name = arg
        elif opt == '-o':
            output_map_name = arg
            ext = os.path.splitext(output_map_name)[-1].lower()
            if ext not in ".gif .bmp .jpg .jpeg .ppm .png .tif .tiff".split():
                output_map_name += '.jpg'  # smallest file format
                print ('Unknown output file extension. Creating a JPEG file.')
        elif opt == '-n':
            try:
                if 1 <= int(arg) <= MAX_LINES_PER_CORE:
                    lines_per_calc = int(arg)
                else:
                    raise SystemExit("\nCommand line error: number of image lines per core must be between 1 and {}.\n".
                          format(MAX_LINES_PER_CORE))
            except ValueError:
                raise SystemExit("\nCommand line type error: -n flag needs a integer.\n")
 
        elif opt == '-c':
            try:
                if 1 <= int(arg) <= AVAIL_CORES:
                    cores_in_use = int(arg)
                else:
                    raise SystemExit("\nCommand line core number error: NumCores must be between 1 and {}.\n".
                          format(AVAIL_CORES))
            except ValueError:
                print("\nCommand line type error: -c flag needs an integer.\n".
                      format(AVAIL_CORES))
                raise SystemExit("Check your command line.")
        elif opt == '-h':
            print("\n", __doc__.format(prog_name, AVAIL_CORES))
            raise SystemExit()
        elif opt == '-t':
            timing_it = True
        else:
            assert False, "unhandled option"
 
    if timing_it:
        start = timeit.default_timer()
 
    inv_transform_map(input_map_name, output_map_name, workers=cores_in_use,
                      lines_per_core=lines_per_calc)
 
    if timing_it:
        stop = timeit.default_timer()
        print('\nFinished the map transform in', round((stop - start), 2),
              'seconds.\n')
    return
 
 
if __name__ == '__main__':
    if profiling:
        import profile
        profile.run('main(sys.argv[1:])')
 
    else:
        main(sys.argv[1:])

 

For a list of command line instructions, save and run this file on your own system as follows:

$  python3 InvMerc_MultiLine.py -h

Alternatively, make the Python program script executable and run it like this:

$ chmod +x InvMerc_MultiLine.py 
$ ./InvMerc_MultiLine.py -h

The command lines to choose the number of cores and lines you’d like to use are fairly simple. For example, if you’ve made the file executable, to run it on the default input filename (WorldMap.png) on 2 cores and send 5 lines at a time to each core, timing the result, the command would be:

$ ./InvMerc_MultiLine.py -n5 -c2 -t

The parallel part of the code has the same structure as the concurrent.futures template given above, but adapted to handle multiple lines. I’ve used a list comprehension and a zip command to process multiple lines at a time on each core, in a way that’s easy to understand:

transformed_lines = []

list_of_ranges = [range(line_number, Y_PIC_SIZE, lines_per_core)
                        for line_number in range(lines_per_core)]

with futures.ProcessPoolExecutor(workers) as executor:
    result = (executor.submit(inv_trans_mult_lines, X_PIC_SIZE,
                      Y_PIC_SIZE, *lines) for lines in zip(*list_of_ranges))

    # save the results, in whatever order they are completed:
    for future in futures.as_completed(result):
        res = future.result()
        for res_line in res:
                transformed_lines.append((res_line[0], res_line[1]))

 

The zip interleaves groups of lines in succession so that, for example, for lines_per_core==3, each core will successively process the lines:

Python is clever enough for the listcomp and zip to work when lines_per_core==1.

Using the debug feature of the Spyder IDE, the order that the ProcessPoolExecutor returns its results can be seen:

Note how they are not in order. The image shows a snapshot of the results in transformed_lines[], after the results have been returned, but before they are sorted. Note how each result is a tuple, consisting of the result number as the first item, and the result data as the second. For a single line calculation, this is also the output image line number. Using this trick of tagging the results, you don’t care what order the results come back from the cores – as long as you sort them once they are all in.


Scaling To More Cores

To see how the program scaled, it was run on a different system, a true 4-core desktop quad Mac i7, without relying on Intel’s claimed hyper-threading to 8 cores. On running the parallel code on the low resolution image (720×360), the best performance again came from the most cores, although with steadily diminishing returns for each added CPU:

Calculating: On 1 Core: On 2 Cores: On 3 Cores: On 4 Cores:
1 line 7.1 secs 3.9 secs 3.1 secs 2.6 secs
2 lines 6.9 secs 3.85 secs 3.15 secs 2.63 secs
3 lines 7.1 secs 3.9 secs 3.0 secs 2.61 secs
4 lines 7.0 secs 3.92 secs 3.2 secs 2.59 secs

 

For the higher resolution image (2k x 1k), there were again diminishing performance returns from each additional core:

Calculating: On 1 Core: On 2 Cores: On 3 Cores: On 4 Cores:
1 line 52.0 secs 30.5 secs 22.5 secs 20.5 secs
2 lines 52.5 secs 30.3 secs 23.0 secs 20.5 secs
3 lines 51.9 secs 30.7 secs 23.4 secs 19.2 secs
4 lines 52.5 secs 30.5 secs 23.6 secs 20.5 secs
 

Summary of Findings

It would seem that the important lessons here were that:

  1. In using the futures.concurrent module, there was a gain from adding each additional CPU, but one that began to tail off after the second core. It’s possible that these diminishing returns are normal with the concurrent.futures module, since it’s an OS-level workaround of the GIL, rather than true multicore parallel processing.
  2. One pixel calculation per CPU process was too little, but more than one line per core process gave equal runtimes, indicating that for each set of processed lines, all the cores were being fully utilised for the calculation, with minimum wait time.
  3. This tuning exercise will always be necessary, regardless of the algorithm.
  4. Scaling a calculation to multiple CPUs only works for the real cores, not the simulated, Intel hyper-threaded ones. Check your system specification, not what os.cpu_count() says.
  5. For more advanced image transforms, the passing of independent data to the different cores would be one of the challenges. Otherwise image transforms, and probably many other kinds of image processing, are prefect applications for parallel processing, provided you can avoid data lock situations.
  6. This program would probably scale better on a more efficient parallel Python implementation.


Future Work

Now that I have an application that lends itself to parallel computing on multiple cores, for comparison purposes I’ll be looking to try porting it to other parallel Python implementations.

 

Leave a Reply

Your email address will not be published. Required fields are marked *