A speed test

Let's write up a little program (time_calc0.py) that will do a speed comparison test between a scalar multiplication on the CPU and then the same operation on the GPU. We will then use NumPy's allclose function to compare the two output values. We will generate an array of 50 million random 32-bit floating point values (this will amount to roughly 48 megabytes of data, so this should be entirely feasible with several gigabytes of memory on any somewhat modern host and GPU device), and then we will time how long it takes to scalar multiply the array by two on both devices. Finally, we will compare the output values to ensure that they are equal. Here's how it's done:

import numpy as np
import pycuda.autoinit
from pycuda import gpuarray
from time import time
host_data = np.float32( np.random.random(50000000) )

t1 = time()
host_data_2x = host_data * np.float32(2)
t2 = time()

print 'total time to compute on CPU: %f' % (t2 - t1)
device_data = gpuarray.to_gpu(host_data)

t1 = time()
device_data_2x = device_data * np.float32( 2 )
t2 = time()

from_device = device_data_2x.get()
print 'total time to compute on GPU: %f' % (t2 - t1)

print 'Is the host computation the same as the GPU computation? : {}'.format(np.allclose(from_device, host_data_2x) )

(You can find the time_calc0.py file on the repository provided to you earlier.)

Now, let's load up IPython and run this a few times to get an idea of the general speed of these, and see if there is any variance. (Here, this is being run on a 2017-era Microsoft Surface Book 2 with a Kaby Lake i7 processor and a GTX 1050 GPU.):

We first notice that the CPU computation time is about the same for each computation (roughly 0.08 seconds). Yet, we notice that the GPU computation time is far slower than the CPU computation the first time we run this (1.09 seconds), and it becomes much faster in the subsequent run, which remains roughly constant in every following run (in the range of 7 or 9 milliseconds). If you exit IPython, and then run the program again, the same thing will occur. What is the reason for this phenomenon? Well, let's do some investigative work using IPython's built-in prun profiler. (This works similarly to the cProfiler module that was featured in Chapter 1, Why GPU Programming?.)

First, let's load our program as text within IPython with the following lines, which we can then run with our profiler via Python's exec command:

with open('time_calc0.py','r') as f:
time_calc_code = f.read()

We now type %prun -s cumulative exec(time_calc_code) into our IPython console (with the leading %) and see what operations are taking the most time:

Now, there are a number of suspicious calls to a Python module file, compiler.pythese take roughly one second total, a little less than the time it takes to do the GPU computation here. Now let's run this again and see if there are any differences:

Notice that this time, there are no calls to compiler.py. Why is this? By the nature of the PyCUDA library, GPU code is often compiled and linked with NVIDIA's nvcc compiler the first time it is run in a given Python session; it is then cached and, if the code is called again, then it doesn't have to be recompiled. This may include even simple operations such as this scalar multiply! (We will see eventually see that this can be ameliorated by using the pre-compiled code in, Chapter 10, Working with Compiled GPU Code, or by using NVIDIA's own linear algebra libraries with the Scikit-CUDA module, which we will see in Chapter 7, Using the CUDA Libraries with Scikit-CUDA).

In PyCUDA, GPU code is often compiled at runtime with the NVIDIA  nvcc compiler and then subsequently called from PyCUDA. This can lead to an unexpected slowdown, usually the first time a program or GPU operation is run in a given Python session.