Using GPU(s) in Chainer¶
In this section, you will learn about the following things:
- Relationship between Chainer and PyCUDA
- Basics of GPUArray
- Single-GPU usage of Chainer
- Multi-GPU usage of model-parallel computing
- Multi-GPU usage of data-parallel computing
After reading this section, you will be able to:
- Use Chainer on a CUDA-enabled GPU
- Write model-parallel computing in Chainer
- Write data-parallel computing in Chainer
Relationship between Chainer and PyCUDA¶
Chainer uses PyCUDA as its backend for GPU computation and the
pycuda.gpuarray.GPUArray class as the GPU array implementation.
GPUArray has far less features compared to
numpy.ndarray, though it is still enough to implement the required features for Chainer.
chainer.cuda module imports many important symbols from PyCUDA.
For example, the GPUArray class is referred as
cuda.GPUArray in the Chainer code.
Chainer provides wrappers of many PyCUDA functions and classes, mainly in order to support customized default allocation mechanism.
As shown in the previous sections, Chainer constructs and destructs many arrays during learning and evaluating iterations.
It is not well suited for CUDA architecture, since memory allocation and release in CUDA (i.e.
cuMemFree functions) synchronize CPU and GPU computations, which hurts performance.
In order to avoid memory allocation and deallocation during the computation, Chainer uses PyCUDA’s memory pool utilities as the standard memory allocator.
Since memory pool is not the default allocator in PyCUDA, Chainer provides many wrapper functions and classes to use memory pools in a simple way.
At the same time, Chainer’s wrapper functions and classes make it easy to handle multiple GPUs.
We also do not touch the detail of PyCUDA. See PyCUDA’s documentation instead.
Basics of GPUArray in Chainer¶
In order to use GPU in Chainer, we must initialize
chainer.cuda module before any GPU-related operations:
cuda.init() function initializes global state and PyCUDA.
This function accepts an optional argument
device, which indicates the GPU device ID to select initially.
If you are using
multiprocessing, the initialization must take place for each process after the fork.
The main process is no exception, i.e.,
cuda.init() should not be called before all the children that use GPU have been forked.
Then we can create a GPUArray object using functions of the
Chainer provides many constructor functions resembling the ones of NumPy:
x_cpu = np.ones((5, 4, 3), dtype=np.float32) x_gpu = cuda.to_gpu(x_cpu)
generates the same
x_gpu as the following code:
x_gpu = cuda.ones((5, 4, 3))
Allocation functions of the
cuda module use
numpy.float32 as the default element type.
x_cpu = cuda.to_cpu(x_gpu)
All GPUArray constructors allocate memory on the current device.
In order to allocate memory on a different device, we can use device switching utilities.
cuda.use_device() function changes the current device:
cuda.use_device(1) x_gpu1 = cuda.empty((4, 3))
There are many situations in which we want to temporarily switch the device, where the
cuda.using_device() function is useful.
It returns an resource object that can be combinated with the
with cuda.using_device(1): x_gpu1 = cuda.empty((4, 3))
These device switching utilities also accepts a GPUArray object as a device specifier. In this case, Chainer switches the current device to one that the array is allocated on:
with cuda.using_device(x_gpu1): y_gpu1 = x_gpu1 + 1
An array that is not allocated by Chainer’s allocator cannot be used as a device specifier.
A GPUArray object allocated by Chainer can be copied between GPUs by
cuda.use_device(0) x0 = cuda.ones((4, 3)) x1 = cuda.copy(x0, out_device=1)
Run Neural Networks on a Single GPU¶
Single-GPU usage is very simple.
What you have to do is transferring
FunctionSet and input arrays to the GPU beforehand.
In this subsection, the code is based on our first MNIST example in this tutorial.
model = FunctionSet( l1 = F.Linear(784, 100), l2 = F.Linear(100, 100), l3 = F.Linear(100, 10), ).to_gpu() optimizer = optimizers.SGD() optimizer.setup(model.collect_parameters())
Note that this method returns the function set itself. The device specifier can be omitted, in which case it uses the current device.
Then, all we have to do is transferring each minibatch to the GPU:
batchsize = 100 for epoch in xrange(20): print 'epoch', epoch indexes = np.random.permutation(60000) for i in xrange(0, 60000, batchsize): x_batch = cuda.to_gpu(x_train[indexes[i : i + batchsize]]) y_batch = cuda.to_gpu(y_train[indexes[i : i + batchsize]]) optimizer.zero_grads() loss, accuracy = forward(x_batch, y_batch) loss.backward() optimizer.update()
This is almost identical to the code of the original example,
we just inserted a call to the
cuda.to_gpu() function to the minibatch arrays.
Model-parallel Computation on Multiple GPUs¶
Parallelization of machine learning is roughly classified into two types called “model-parallel” and “data-parallel”. Model-parallel means parallelizations of the computations inside the model. In contrast, data-parallel means parallelizations using data sharding. In this subsection, we show how to use the model-parallel approach on multiple GPUs in Chainer.
Recall the MNIST example. Now suppose that we want to modify this example by expanding the network to 6 layers with 2000 units each using two GPUs. In order to make multi-GPU computation efficient, we only make the two GPUs communicate at the third and sixth layer. The overall architecture looks like the following diagram:
(GPU0) input --+--> l1 --> l2 --> l3 --+--> l4 --> l5 --> l6 --+--> output | | | (GPU1) +--> l1 --> l2 --> l3 --+--> l4 --> l5 --> l6 --+
We first have to define a
Be careful that parameters that will be used on a device must reside on that device.
Here is a simple example of the model definition:
model = FunctionSet( gpu0 = FunctionSet( l1=F.Linear( 784, 1000), l2=F.Linear(1000, 1000), l3=F.Linear(1000, 2000), l4=F.Linear(2000, 1000), l5=F.Linear(1000, 1000), l6=F.Linear(1000, 10) ).to_gpu(0), gpu1 = FunctionSet( l1=F.Linear( 784, 1000), l2=F.Linear(1000, 1000), l3=F.Linear(1000, 2000), l4=F.Linear(2000, 1000), l5=F.Linear(1000, 1000), l6=F.Linear(1000, 10) ).to_gpu(1) )
FunctionSet.to_gpu() returns the FunctionSet object itself.
Note that FunctionSet can be nested as above.
Now we can define the network architecture that we have shown in the diagram:
def forward(x_data, y_data): x_0 = Variable(cuda.to_gpu(x_data, 0)) x_1 = Variable(cuda.to_gpu(x_data, 1)) t = Variable(cuda.to_gpu(y_data, 0)) h1_0 = F.relu(model.gpu0.l1(x_0)) h1_1 = F.relu(model.gpu1.l1(x_1)) h2_0 = F.relu(model.gpu0.l2(h1_0)) h2_1 = F.relu(model.gpu1.l2(h1_1)) h3_0 = F.relu(model.gpu0.l3(h2_0)) h3_1 = F.relu(model.gpu1.l3(h2_1)) # Synchronize h3_0 += F.copy(h3_1, 0) h3_1 = F.copy(h3_0, 1) h4_0 = F.relu(model.gpu0.l4(h3_0)) h4_1 = F.relu(model.gpu1.l4(h3_1)) h5_0 = F.relu(model.gpu0.l5(h4_0)) h5_1 = F.relu(model.gpu1.l5(h4_1)) h6_0 = F.relu(model.gpu0.l6(h5_0)) h6_1 = F.relu(model.gpu1.l6(h5_1)) # Synchronize y = h6_0 + F.copy(h6_1, 0) return F.softmax_cross_entropy(y, t), F.accuracy(y, t)
First, recall that
cuda.to_gpu() accepts an optional argument to specify the device identifier.
We use this to transfer the input minibatch to both the 0th and the 1st devices.
Then, we can write this model-parallel example employing the
This function transfers an input array to another device.
Since it is a function on
Variable, the operation supports backprop, which reversely transfers an output gradient to the input device.
Above code is not parallelized on CPU, but is parallelized on GPU. This is because most of the GPU computation is asynchronous to the host CPU.
An almost identical example code can be found at
Data-parallel Computation on Multiple GPUs¶
Data-parallel computation is another strategy to parallelize online processing. In the context of neural networks, it means that a different device does computation on a different subset of the input data. In this subsection, we review the way to achieve data-parallel learning on two GPUs.
Suppose again our task is the MNIST example. This time we want to directly parallelize the three-layer network. The most simple form of data-parallelization is parallelizing the gradient computation for a distinct set of data. First, define the model:
model = FunctionSet( l1 = F.Linear(784, 100), l2 = F.Linear(100, 100), l3 = F.Linear(100, 10), )
import copy model_0 = copy.deepcopy(model).to_gpu(0) model_1 = model.to_gpu(1)
Then, set up optimizer as:
optimizer = optimizers.SGD() optimizer.setup(model_0.collect_parameters())
Here we use the first copy of the model as the master model.
Before its update, gradients of
model_1 must be aggregated to those of
Forward function is almost same as the original example:
def forward(x_data, y_data, model): x = Variable(x_data) t = Variable(y_data) h1 = F.relu(model.l1(x)) h2 = F.relu(model.l2(h1)) y = model.l3(h2) return F.softmax_cross_entropy(y, t), F.accuracy(y, t)
The only difference is that
model as an argument.
We can feed it with a model and arrays on an appropriate device.
Then, we can write a data-parallel learning loop as follows:
batchsize = 100 for epoch in xrange(20): print 'epoch', epoch indexes = np.random.permutation(60000) for i in xrange(0, 60000, batchsize): x_batch = x_train[indexes[i : i + batchsize]] y_batch = y_train[indexes[i : i + batchsize]] optimizer.zero_grads() loss_0, accuracy_0 = forward( cuda.to_gpu(x_batch[:batchsize/2], 0), cuda.to_gpu(y_batch[:batchsize/2], 0), model_0) loss_0.backward() loss_1, accuracy_1 = forward( cuda.to_gpu(x_batch[batchsize/2:], 1), cuda.to_gpu(y_batch[batchsize/2:], 1), model_1) loss_1.backward() optimizer.acumulate_grads(model_1.gradients) optimizer.update() model_1.copy_parameters_from(model_0.parameters)
One half of the minibatch is forwarded to GPU 0, the other half to GPU 1.
Then the gradients are accumulated by the
After the gradients are prepared, we can update the optimizer in usual way.
Note that the update only modifies the parameters of
So we must manually copy them to
Now you can use Chainer with GPUs.
All examples in the
examples directory support GPU computation, so please refer to them if you want to know more practices on using GPUs.
In the next section, we will show how to define a differentiable (i.e. backpropable) function on Variable objects.
We will also show there how to write a simple (elementwise) CUDA kernel using Chainer’s CUDA utilities.