Chainer – A flexible framework of neural networks¶
Chainer is a powerful, flexible and intuitive deep learning framework.
Chainer supports CUDA computation. It only requires a few lines of code to leverage a GPU. It also runs on multiple GPUs with little effort.
Chainer supports various network architectures including feedforward nets, convnets, recurrent nets and recursive nets. It also supports perbatch architectures.
Forward computation can include any control flow statements of Python without lacking the ability of backpropagation. It makes code intuitive and easy to debug.
Chainer at a Glance¶
Welcome to Chainer!
Chainer is a rapidly growing neural network platform. The strengths of Chainer are:
Pythonbased – Chainer is developed in Python, allowing for inspection and customization of all code in python and understandable python messages at run time
Define by Run – neural networks definitions are defined onthefly at run time, allowing for dynamic network changes
NumPy based syntax for working with arrays, thanks to CuPy implementation
Fully customizable – since Chainer is pure python, all classes and methods can be adapted to allow for the latest cutting edge or specialized approaches
Broad and deep support – Chainer is actively used for most of the current approaches for neural nets (CNN, RNN, RL, etc.), aggressively adds new approaches as they’re developed, and provides support for many kinds of hardware as well as parallelization for multiple GPUs
Mushrooms – tasty or deadly?¶
Let’s take a look at a basic program of Chainer to see how it works. For a dataset, we’ll work with Kaggle’s edible vs. poisonous mushroom dataset, which has over 8,000 examples of mushrooms, labelled by 22 categories including odor, cap color, habitat, etc., in a mushrooms.csv file.
How will Chainer learn which mushrooms are edible and which mushrooms will kill you? Let’s see!
The code below is from the glance example in the examples/glance directory.
Code Breakdown¶
Initialization¶
Let’s start the program. Here are the typical imports for a Chainer program. chainer.links
contain trainable parameters and chainer.functions
do not.
6 7 8 9 10 11 12 13  import chainer as ch
from chainer import datasets
import chainer.functions as F
import chainer.links as L
from chainer import training
from chainer.training import extensions
import numpy as np

We’ll use Matplotlib for the graphs to show training progress.
15 16  import matplotlib
matplotlib.use('Agg')

Trainer Structure¶
A trainer
is used to set up our neural network and data for training. The components of the trainer
are generally hierarchical, and are organized as follows:
Each of the components is fed information from the components within it. Setting up the trainer starts at the inner components, and moves outward, with the exception of extensions, which are added after the trainer
is defined.
Dataset¶
Our first step is to format the dataset
. From the raw mushrooms.csv, we format the data into a Chainer TupleDataset
.
18 19 20 21 22 23 24 25 26 27  mushroomsfile = 'mushrooms.csv'
data_array = np.genfromtxt(
mushroomsfile, delimiter=',', dtype=str, skip_header=1)
for col in range(data_array.shape[1]):
data_array[:, col] = np.unique(data_array[:, col], return_inverse=True)[1]
X = data_array[:, 1:].astype(np.float32)
Y = data_array[:, 0].astype(np.int32)[:, None]
train, test = datasets.split_dataset_random(
datasets.TupleDataset(X, Y), int(data_array.shape[0] * .7))

Iterator¶
Configure iterators
to step through batches of the data for training and for testing validation. In this case, we’ll use a batch size of 100. For the training iterator, repeating and shuffling are implicitly enabled, while they are explicitly disabled for the testing iterator.
29 30 31  train_iter = ch.iterators.SerialIterator(train, 100)
test_iter = ch.iterators.SerialIterator(
test, 100, repeat=False, shuffle=False)

Model¶
Next, we need to define the neural network for inclusion in our model. For our mushrooms, we’ll chain together two fullyconnected, Linear
, hidden layers between the input and output layers.
As an activation function, we’ll use standard Rectified Linear Units (relu()
).
Using Sequential
allows us to define the neural network model in a compact format.
34 35 36 37 38 39 40  # Network definition
def MLP(n_units, n_out):
layer = ch.Sequential(L.Linear(n_units), F.relu)
model = layer.repeat(2)
model.append(L.Linear(n_out))
return model

Since mushrooms are either edible or poisonous (no information on psychedelic effects!) in the dataset, we’ll use a Link Classifier
for the output, with 44 units (double the features of the data) in the hidden layers and a single edible/poisonous category for classification.
43 44  model = L.Classifier(
MLP(44, 1), lossfun=F.sigmoid_cross_entropy, accfun=F.binary_accuracy)

Note that in the two code snippets above we have not specified the size of the input layer. Once we start feeding the neural network with samples, Chainer will recognize the dimensionality of the input automatically and initialize the matrix for each layer with the appropriate shape. In the example above, that is 44×22 for the first hidden layer, 44×44 for the second hidden layer, and 1×44 for the output layer.
Optimizer¶
Pick an optimizer
, and set up the model
to use it.
46 47  # Setup an optimizer
optimizer = ch.optimizers.SGD().setup(model)

Updater¶
Now that we have the training iterator
and optimizer
set up, we link them both together into the updater
. The updater
uses the minibatches from the iterator
, does the forward and backward processing of the model, and updates the parameters of the model according to the optimizer
. Setting the device=1
sets the device as the CPU. To use a GPU, set device
equal to the number of the GPU, usually device=0
.
49 50  # Create the updater, using the optimizer
updater = training.StandardUpdater(train_iter, optimizer, device=1)

Finally we create a Trainer
object. The trainer
processes minibatches using the updater
defined above until a certain stop condition is met and allows the use of extensions during the training. We set it to run for 50 epochs and store all files created by the extensions (see below) in the result
directory.
52 53  # Set up a trainer
trainer = training.Trainer(updater, (50, 'epoch'), out='result')

Extensions¶
Extensions can be used to execute code at certain events during the training, such as every epoch or every 1000 iterations. This mechanism is used in Chainer to evaluate models during training, print progress messages, or dump intermediate model files.
First, use the testing iterator
defined above for an Evaluator
extension to the trainer to provide test scores. If using a GPU instead of the CPU, set device
to the ID of the GPU, usually 0
.
54 55  # Evaluate the model with the test dataset for each epoch
trainer.extend(extensions.Evaluator(test_iter, model, device=1))

Save a computational graph from loss
variable at the first iteration. main
refers to the target link of the main
optimizer
. The graph is saved in the Graphviz’s dot format. The output location (directory) to save the graph is set by the out
argument of trainer
.
57 58 59  # Dump a computational graph from 'loss' variable at the first iteration
# The "main" refers to the target link of the "main" optimizer.
trainer.extend(extensions.DumpGraph('main/loss'))

Take a snapshot of the trainer
object every 20 epochs.
61  trainer.extend(extensions.snapshot(), trigger=(20, 'epoch'))

Write a log of evaluation statistics for each epoch.
63 64  # Write a log of evaluation statistics for each epoch
trainer.extend(extensions.LogReport())

Save two plot images to the result directory.
66 67 68 69 70 71 72 73  # Save two plot images to the result dir
trainer.extend(
extensions.PlotReport(['main/loss', 'validation/main/loss'],
'epoch', file_name='loss.png'))
trainer.extend(
extensions.PlotReport(
['main/accuracy', 'validation/main/accuracy'],
'epoch', file_name='accuracy.png'))

Print selected entries of the log to standard output.
75 76 77 78  # Print selected entries of the log to stdout
trainer.extend(extensions.PrintReport(
['epoch', 'main/loss', 'validation/main/loss',
'main/accuracy', 'validation/main/accuracy', 'elapsed_time']))

Main Loop¶
Finally, with the trainer
and all the extensions set up, we can add the line that actually starts the main loop:
80 81  # Run the training
trainer.run()

Inference¶
Once the training is complete, only the model is necessary to make predictions. Let’s check that a random line from the test data set and see if the inference is correct:
83 84 85 86 87 88 89 90 91  x, t = test[np.random.randint(len(test))]
predict = model.predictor(x[None]).array
predict = predict[0][0]
if predict >= 0:
print('Predicted Poisonous, Actual ' + ['Edible', 'Poisonous'][t[0]])
else:
print('Predicted Edible, Actual ' + ['Edible', 'Poisonous'][t[0]])

Output¶
Output for this instance will look like:
epoch main/loss validation/main/loss main/accuracy validation/main/accuracy elapsed_time
1 0.550724 0.502818 0.733509 0.752821 0.215426
2 0.454206 0.446234 0.805439 0.786926 0.902108
3 0.402783 0.395893 0.838421 0.835979 1.50414
4 0.362979 0.359988 0.862807 0.852632 2.24171
5 0.32713 0.329881 0.88 0.874232 2.83247
6 0.303469 0.31104 0.892456 0.887284 3.45173
7 0.284755 0.288553 0.901754 0.903284 3.9877
8 0.26801 0.272033 0.9125 0.907137 4.54794
9 0.25669 0.261355 0.920175 0.917937 5.21672
10 0.241789 0.251821 0.927193 0.917937 5.79541
11 0.232291 0.238022 0.93 0.925389 6.3055
12 0.222805 0.22895 0.934035 0.923389 6.87083
13 0.21276 0.219291 0.93614 0.928189 7.54113
14 0.204822 0.220736 0.938596 0.922589 8.12495
15 0.197671 0.207017 0.938393 0.936042 8.69219
16 0.190285 0.199129 0.941053 0.934842 9.24302
17 0.182827 0.193303 0.944386 0.942695 9.80991
18 0.176776 0.194284 0.94614 0.934042 10.3603
19 0.16964 0.177684 0.945789 0.945242 10.8531
20 0.164831 0.171988 0.949825 0.947347 11.3876
21 0.158394 0.167459 0.952982 0.949747 11.9866
22 0.153353 0.161774 0.956964 0.949347 12.6433
23 0.148209 0.156644 0.957368 0.951747 13.3825
24 0.144814 0.15322 0.957018 0.955495 13.962
25 0.138782 0.148277 0.958947 0.954147 14.6
26 0.135333 0.145225 0.961228 0.956695 15.2284
27 0.129593 0.141141 0.964561 0.958295 15.7413
28 0.128265 0.136866 0.962632 0.960547 16.2711
29 0.123848 0.133444 0.966071 0.961347 16.7772
30 0.119687 0.129579 0.967193 0.964547 17.3311
31 0.115857 0.126606 0.968596 0.966547 17.8252
32 0.113911 0.124272 0.968772 0.962547 18.3121
33 0.111502 0.122548 0.968596 0.965095 18.8973
34 0.107427 0.116724 0.970526 0.969747 19.4723
35 0.104536 0.114517 0.970877 0.969095 20.0804
36 0.099408 0.112128 0.971786 0.970547 20.6509
37 0.0972982 0.107618 0.973158 0.970947 21.2467
38 0.0927064 0.104918 0.973158 0.969347 21.7978
39 0.0904702 0.101141 0.973333 0.969747 22.3328
40 0.0860733 0.0984015 0.975263 0.971747 22.8447
41 0.0829282 0.0942095 0.977544 0.974947 23.5113
42 0.082219 0.0947418 0.975965 0.969347 24.0427
43 0.0773362 0.0906804 0.977857 0.977747 24.5252
44 0.0751769 0.0886449 0.977895 0.972147 25.1722
45 0.072056 0.0916797 0.978246 0.977495 26.0778
46 0.0708111 0.0811359 0.98 0.979347 26.6648
47 0.0671919 0.0783265 0.982456 0.978947 27.2929
48 0.0658817 0.0772342 0.981754 0.977747 27.8119
49 0.0634615 0.0762576 0.983333 0.974947 28.3876
50 0.0622394 0.0710278 0.982321 0.981747 28.9067
Predicted Edible Actual Edible
Our prediction was correct. Success!
The loss function:
And the accuracy
Concepts Walkthrough¶
DefinebyRun¶
As mentioned on the top page, Chainer is a flexible framework for neural networks. One major goal is flexibility, so it must enable us to write complex architectures simply and intuitively.
Most existing deep learning frameworks are based on the “DefineandRun” scheme. That is, first a network is defined and fixed, and then the user periodically feeds it with minibatches of training data. Since the network is statically defined before any forward/backward computation, all the logic must be embedded into the network architecture as data. Consequently, defining a network architecture in such systems (e.g. Caffe) follows a declarative approach. Note that one can still produce such a static network definition using imperative languages (e.g. torch.nn, Theanobased frameworks, and TensorFlow).
In contrast, Chainer adopts a “DefinebyRun” scheme, i.e., the network is defined dynamically via the actual forward computation. More precisely, Chainer stores the history of computation instead of programming logic. This strategy enables us to fully leverage the power of programming logic in Python. For example, Chainer does not need any magic to introduce conditionals and loops into the network definitions. The DefinebyRun scheme is the core concept of Chainer. We will show in this tutorial how to define networks dynamically.
This strategy also makes it easy to write multiGPU parallelization, since logic comes closer to network manipulation. We will review such amenities in later sections of this tutorial.
Variables and Derivatives¶
In the example code of this tutorial, we assume for simplicity that the following symbols are already imported.
import math
import numpy as np
import chainer
from chainer import backend
from chainer import backends
from chainer.backends import cuda
from chainer import Function, FunctionNode, gradient_check, report, training, utils, Variable
from chainer import datasets, initializers, iterators, optimizers, serializers
from chainer import Link, Chain, ChainList
import chainer.functions as F
import chainer.links as L
from chainer.training import extensions
As described previously, Chainer uses the “DefinebyRun” scheme, so forward computation itself defines the network.
In order to start forward computation, we have to set the input array to a chainer.Variable
object.
Here we start with a simple ndarray
with only one element:
>>> x_data = np.array([5], dtype=np.float32)
>>> x = Variable(x_data)
A Variable object supports basic arithmetic operators. In order to compute \(y = x^2  2x + 1\), just write:
>>> y = x**2  2 * x + 1
The resulting y
is also a Variable object, whose value can be extracted by accessing the array
attribute:
>>> y.array
array([16.], dtype=float32)
Note
Variable
has two attributes to represent the underlying array: array
and data
.
There is no difference between the two; both refer to exactly the same object.
However it is not recommended that you use .data
because it might be confused with numpy.ndarray.data
attribute.
What y
holds is not only the result value.
It also holds the history of computation (or computational graph), which enables us to compute its derivative.
This is done by calling its backward()
method:
>>> y.backward()
This runs error backpropagation (a.k.a. backprop or reversemode automatic differentiation).
Then, the gradient is computed and stored in the grad
attribute of the input variable x
:
>>> x.grad
array([8.], dtype=float32)
Also we can compute gradients of intermediate variables.
Note that Chainer, by default, releases the gradient arrays of intermediate variables for memory efficiency.
In order to preserve gradient information, pass the retain_grad
argument to the backward method:
>>> z = 2*x
>>> y = x**2  z + 1
>>> y.backward(retain_grad=True)
>>> z.grad
array([1.], dtype=float32)
All these computations can be generalized to a multielement array input.
While singleelement arrays are automatically initialized to [1]
, to start backward computation from a variable holding a multielement array, we must set the initial error manually.
This is done simply by setting the grad
attribute of the output variable:
>>> x = Variable(np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float32))
>>> y = x**2  2*x + 1
>>> y.grad = np.ones((2, 3), dtype=np.float32)
>>> y.backward()
>>> x.grad
array([[ 0., 2., 4.],
[ 6., 8., 10.]], dtype=float32)
Note
Many functions taking Variable
object(s) are defined in the chainer.functions
module.
You can combine them to realize complicated functions with automatic backward computation.
Note
Instead of using backward()
, you can also calculate gradients of any variables in a computational graph w.r.t. any other variables in the graph using the chainer.grad()
function.
HigherOrder Derivatives¶
Variable
also supports higherorder derivatives (a.k.a. double backpropagation).
Let’s see a simple example.
First calculate the firstorder derivative.
Note that enable_double_backprop=True
is passed to y.backward()
.
>>> x = chainer.Variable(np.array([[0, 2, 3], [4, 5, 6]], dtype=np.float32))
>>> y = x ** 3
>>> y.grad = np.ones((2, 3), dtype=np.float32)
>>> y.backward(enable_double_backprop=True)
>>> x.grad_var
variable([[ 0., 12., 27.],
[ 48., 75., 108.]])
>>> assert x.grad_var.array is x.grad
>>> assert (x.grad == (3 * x**2).array).all()
chainer.Variable.grad_var
is a Variable
for chainer.Variable.grad
(which is an ndarray
).
By passing enable_double_backprop=True
to backward()
, a computational graph for the backward calculation is recorded.
So, you can start backpropagation from x.grad_var
to calculate the secondorder derivative.
>>> gx = x.grad_var
>>> x.cleargrad()
>>> gx.grad = np.ones((2, 3), dtype=np.float32)
>>> gx.backward()
>>> x.grad
array([[ 0., 12., 18.],
[24., 30., 36.]], dtype=float32)
>>> assert (x.grad == (6 * x).array).all()
Links¶
In order to write neural networks, we have to combine functions with parameters and optimize the parameters.
You can use the class Link
to do this.
A Link
is an object that holds parameters (i.e. optimization targets).
The most fundamental ones are links that behave like regular functions while replacing some arguments by their parameters. We will introduce higher level links, but here think of links as simply functions with parameters.
One of the most frequently used links is the Linear
link (a.k.a. fullyconnected layer or affine transformation).
It represents a mathematical function \(f(x) = Wx + b\), where the matrix \(W\) and the vector \(b\) are parameters.
This link corresponds to its pure counterpart linear()
, which accepts \(x, W, b\) as arguments.
A linear link from threedimensional space to twodimensional space is defined by the following line:
>>> f = L.Linear(3, 2)
Note
Most functions and links only accept minibatch input, where the first dimension of the input array is considered as the batch dimension. In the above Linear link case, input must have shape of \((N, 3)\), where \(N\) is the minibatch size.
The parameters of a link are stored as attributes.
Each parameter is an instance of Variable
.
In the case of the Linear link, two parameters, W
and b
, are stored.
By default, the matrix W
is initialized randomly, while the vector b
is initialized with zeros.
This is the preferred way to initialize these parameters.
>>> f.W.array
array([[ 1.0184761 , 0.23103087, 0.5650746 ],
[ 1.2937803 , 1.0782351 , 0.56423163]], dtype=float32)
>>> f.b.array
array([0., 0.], dtype=float32)
An instance of the Linear link acts like a usual function:
>>> x = Variable(np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float32))
>>> y = f(x)
>>> y.array
array([[3.1757617, 1.7575557],
[8.619507 , 7.1809077]], dtype=float32)
Note
Sometimes it is cumbersome to compute the dimension of the input space. The linear link and some of (de)convolution links can omit the input dimension in their instantiation and infer it from the first minibatch.
For example, the following line creates a linear link whose output dimension is two:
>>> f = L.Linear(2)
If we feed a minibatch of shape \((2, M)\), the input dimension will be inferred as M
,
which means l.W
will be a 2 x M matrix.
Note that its parameters are initialized in a lazy manner at the first minibatch.
Therefore, l
does not have W
attribute if no data is put to the link.
Gradients of parameters are computed by the backward()
method.
Note that gradients are accumulated by the method rather than overwritten.
So first you must clear the gradients to renew the computation.
It can be done by calling the cleargrads()
method.
>>> f.cleargrads()
Now we can compute the gradients of parameters by simply calling the backward method and access them via the grad
property.
>>> y.grad = np.ones((2, 2), dtype=np.float32)
>>> y.backward()
>>> f.W.grad
array([[5., 7., 9.],
[5., 7., 9.]], dtype=float32)
>>> f.b.grad
array([2., 2.], dtype=float32)
Define your own function¶
In this section, you will learn about the following things:
How to define a function on variables
Useful tools to write a function using a GPU
How to test the function definition
After reading this section, you will be able to:
Write your own functions
Define simple kernels in the function definition
In the example code of this tutorial, we assume for simplicity that the following symbols are already imported.
import math
import numpy as np
import chainer
from chainer import backend
from chainer import backends
from chainer.backends import cuda
from chainer import Function, FunctionNode, gradient_check, report, training, utils, Variable
from chainer import datasets, initializers, iterators, optimizers, serializers
from chainer import Link, Chain, ChainList
import chainer.functions as F
import chainer.links as L
from chainer.training import extensions
Differentiable Functions¶
Chainer provides a collection of functions in the chainer.functions
module.
It covers typical use cases in deep learning, so many existing works can be implemented with them.
On the other hand, deep learning is evolving rapidly and we cannot cover all possible functions to define unseen architectures.
So it is important to learn how to define your own functions.
NewStyle v.s. OldStyle Functions¶
In Chainer, you can define a function in two ways: newstyle and oldstyle.
Newstyle functions inherit from
chainer.FunctionNode
class (introduced in Chainer v3). Forward computation can be implemented using NumPy/CuPy. Backward computation needs to be implemented by using (possibly a composition of) other newstyle functions.Oldstyle functions inherit from
chainer.Function
class. Forward and backward computation can be implemented using NumPy/CuPy.
The primary advantage of using newstyle functions is that they support computation of higherorder gradients (a.k.a. higherorder derivative or double backpropagation). Higherorder gradients are used in some models e.g., recentlyproposed GAN architectures. Newstyle functions are also better in terms of performance of backward, as the interface allows an implementation to skip the computation of unneeded input gradients.
Currently, most of builtin functions are implemented in newstyle (with a few exceptions listed in #4449). Basically, we recommend you use newstyle when implementing new functions. However, you can still continue to use existing oldstyle functions for the foreseeable future.
In the following sections, we describe steps to implenent userdefiend functions in newstyle. You can also refer to Implementing OldStyle Functions and Migrating From OldStyle Functions To NewStyle Functions if you have interest.
Implementing NewStyle Functions¶
First, suppose we want to define an elementwise function \(f(x, y, z) = x * y + z\).
While it is possible to implement this equation using a combination of the *
and +
functions,
defining it as a single function may reduce memory consumption, so it is not only a toy example.
Here we call this function MulAdd.
Let’s start with defining MulAdd working on the CPU.
Newstyle functions must inherit the chainer.FunctionNode
class.
The skeleton of a function looks like:
class MulAdd(FunctionNode):
def forward_cpu(self, inputs):
# do forward computation on CPU
return some_tuple
def backward(self, target_input_indexes, grad_outputs):
# do backward computation
return some_tuple
We must implement forward_cpu()
and backward()
methods.
In
forward_cpu()
function,inputs
is a tuple of array(s). You need to return a tuple of array(s), which is a result of forward computation.In
backward()
function,grad_outputs
is a tuple ofVariable
(s) which are gradients with regard to each output(s), i.e., the length ofgrad_outputs
tuple equals to the number of outputs returned byforward_cpu
). You need to return a tuple ofVariable
(s) which are gradients with regard to each input(s), i.e., the length of returned tuple equals to the number of inputs toforward_cpu
. You can optionally usetarget_input_indexes
(a tuple of indices required to compute gradients) to omit computing unnecessary gradients. We will show you the usage oftarget_input_indexes
later.
Warning
Be careful to return a tuple even if you have just one array or Variable to return.
Note
Unlike oldstyle functions, inputs and outputs of backward method in newstyle functions are Variable
s.
In other words, the backward method is device agnostic; there are no backward_cpu
or backward_gpu
in FunctionNode
.
MulAdd is simple and can be implemented as follows:
class MulAdd(FunctionNode):
def forward_cpu(self, inputs):
# Unpack input arrays (``numpy.ndarray``).
x, y, z = inputs
# Mark inputs (``x`` and ``y``) as retained so that it can be
# accessed during the backward process.
self.retain_inputs((0, 1))
# Compute results.
w = x * y + z
# Return the result as a tuple.
return w,
def backward(self, target_input_indexes, grad_outputs):
# Unpack inputs retained in the forward process (``Variable``).
x, y = self.get_retained_inputs()
# Get gradients w.r.t. the output (Variable).
gw, = grad_outputs
# Compute gradients w.r.t the inputs.
gx = y * gw
gy = x * gw
gz = gw
# Return the result as a tuple.
return gx, gy, gz
As per the warning above, the forward_cpu()
method returns a tuple of single element.
Note that all arrays appearing in forward_cpu
are numpy.ndarray
.
The forward function is straightforward; it unpacks the input tuple, computes the output, and packs it into a tuple.
The backward function is a bit more complicated.
Recall the rule of differentiation of multiplication.
This example just implements the rule.
Look at the return values, the function just packs the gradient of each input in the same order and returns them.
By just defining the core computation of forward and backward,
FunctionNode
class provides a chaining logic on it (i.e., storing the
history of computation, etc.).
Note
Assuming we implement a (forward) function \(y=f(x)\) which takes as input the
vector \(x \in \mathbb{R}^n\) and produces as output a vector
\(y \in \mathbb{R}^m\). Then the backward
method has to compute
where \(\gamma\) is the grad_outputs
. Note, that the
resulting vector \(\lambda\) must have the same shape as the arguments of the forward
method.
Now let’s define the corresponding GPU method.
You can easily predict that the method we have to write is named forward_gpu()
:
class MulAdd(FunctionNode):
def forward_cpu(self, inputs):
...
def forward_gpu(self, inputs):
# Unpack input arrays (``cupy.ndarray``).
x, y, z = inputs
# Mark inputs (``x`` and ``y``) as retained so that it can be
# accessed during the backward process.
self.retain_inputs((0, 1))
# Compute results.
w = x * y + z
# Return the result as a tuple.
return w,
def backward(self, target_input_indexes, grad_outputs):
...
In forward_gpu
method, arrays are of type cupy.ndarray
.
We use arithmetic operators defined for this class.
These operators implement the basic elementwise arithmetics.
You may find that the definitions of forward_gpu
is exactly same as forward_cpu
.
In that case, we can reduce them io forward()
.
class MulAdd(FunctionNode):
def forward(self, inputs):
# Unpack input arrays (``numpy.ndarray`` or ``cupy.ndarray``).
x, y, z = inputs
# Mark inputs (``x`` and ``y``) as retained so that it can be
# accessed during the backward process.
self.retain_inputs((0, 1))
# Compute results.
w = x * y + z
# Return the result as a tuple.
return w,
def backward(self, inputs, grad_outputs):
x, y, z = inputs
gw, = grad_outputs
gx = y * gw
gy = x * gw
gz = gw
return gx, gy, gz
Since the cupy.ndarray
class implements many methods of numpy.ndarray
, we can write these unified methods in most cases.
The MulAdd function can be used as follows:
x = Variable(np.random.uniform(1, 1, (3, 2)).astype(np.float32))
y = Variable(np.random.uniform(1, 1, (3, 2)).astype(np.float32))
z = Variable(np.random.uniform(1, 1, (3, 2)).astype(np.float32))
w, = MulAdd().apply((x, y, z))
It looks a bit ugly: we have to explicitly instantiate MulAdd before applying it to variables. We also have to be careful that one instance of MulAdd must not be used multiple times, since it acts as a node in the computational graph. In Chainer, we often define a thin wrapper Python function that hide the instantiation:
def muladd(x, y, z):
return MulAdd().apply((x, y, z))
w = muladd(x, y, z)
All functions under chainer.functions
are implemented as wrapper functions like this.
Unified forward/backward methods with NumPy/CuPy functions¶
CuPy implements many functions that are compatible to those of NumPy. We can write unified forward/backward methods with them. Consider that we want to write a backpropable function \(f(x, y) = \exp(x) + \exp(y)\). We name it ExpAdd here. It can be written straightforward as follows:
from chainer.backends import cuda
class ExpAdd(FunctionNode):
def forward_cpu(self, inputs):
self.retain_inputs((0, 1))
x, y = inputs
z = np.exp(x) + np.exp(y)
return z,
def forward_gpu(self, inputs):
self.retain_inputs((0, 1))
cupy = cuda.cupy
x, y = inputs
z = cupy.exp(x) + cupy.exp(y)
return z,
def backward(self, target_input_indexes, grad_outputs):
x, y = self.get_retained_inputs()
gz, = grad_outputs
gx = gz * F.exp(x)
gy = gz * F.exp(y)
return gx, gy
def expadd(x, y):
z, = ExpAdd().apply((x, y))
return z
Note
Here we used chainer.backends.cuda.cupy
instead of directly accessing cupy
.
This is because the cupy
module cannot be imported if the CUDA is not installed.
In order to keep the implementation valid in nonCUDA environment, we have to defer the access to the cupy
module.
Note that the chainer.backends.cuda
module can be imported even if the CUDA is not installed.
Of course, the module in such environment is almost useless, but if the interpreter does not run through the code accessing CUDAdedicated functions, the code is still valid.
The CPU and GPU implementations are almost same, except that numpy
is replaced by cupy
in forward_gpu
.
We can unify these functions using the chainer.backend.get_array_module()
function.
This function accepts arbitrary number of arrays, and returns an appropriate module for them.
See the following code:
class ExpAdd(FunctionNode):
def forward(self, inputs):
self.retain_inputs((0, 1))
xp = backend.get_array_module(*inputs)
x, y = inputs
z = xp.exp(x) + xp.exp(y)
return z,
def backward(self, target_input_indexes, grad_outputs):
x, y = self.get_retained_inputs()
gz, = grad_outputs
gx = gz * F.exp(x)
gy = gz * F.exp(y)
return gx, gy
def expadd(x, y):
z, = ExpAdd().apply((x, y))
return z
Note that this code works correctly even if CUDA is not installed in the environment.
If CUDA is not found, get_array_module()
function always returns numpy
.
We often use the name xp
for the variadic module name, which is analogous to the abbreviation np
for NumPy and cp
for CuPy.
Write an Elementwise Kernel Function¶
Let’s turn back to the MulAdd example.
The GPU implementation of MulAdd as shown above is already fast and parallelized on GPU cores.
However, it invokes two kernels during each of forward (w = x * y + z
) and backward (gx = y * gw
and gy = x * gw
) computations.
It might hurt performance, since the intermediate temporary arrays are read and written by possibly different GPU cores, which consumes much bandwidth.
We can reduce the number of invocations by defining our own kernel.
It also reduce the memory consumption.
CuPy provides a useful tool to define elementwise kernels, the cupy.ElementwiseKernel
class, and Chainer wraps it by chainer.backends.cuda.elementwise()
function.
Our MulAdd implementation can be improved as follows:
class MulAdd(FunctionNode):
def forward_cpu(self, inputs):
self.retain_inputs((0, 1))
x, y, z = inputs
w = x * y + z
return w,
def forward_gpu(self, inputs):
self.retain_inputs((0, 1))
x, y, z = inputs
w = cuda.cupy.elementwise(
'float32 x, float32 y, float32 z',
'float32 w',
'w = x * y + z',
'muladd_fwd')(x, y, z)
return w,
def backward(self, target_input_indexes, grad_outputs):
x, y, z = self.get_retained_inputs()
gw, = grad_outputs
return MulAddGrad().apply((x, y, z, gw))
class MulAddGrad(FunctionNode):
def forward_cpu(self, inputs):
x, y, z, gw = inputs
gx = y * gw
gy = x * gw
gz = gw
return gx, gy, gz
def forward_gpu(self, inputs):
x, y, z, gw = inputs
gx, gy = cuda.elementwise(
'float32 x, float32 y, float32 gw',
'float32 gx, float32 gy',
'''
gx = y * gw;
gy = x * gw;
''',
'muladd_bwd')(x, y, gw)
gz = gw
return gx, gy, gz
def backward(self, target_input_indexes, grad_outputs):
# You can leave this unimplemented unless you need to compute
# higherorder derivative using this function.
raise NotImplementedError()
chainer.backends.cuda.elementwise()
function accepts the essential implementation of the kernel function, and returns a kernel invocation function (actually, it returns ElementwiseKernel
object, which is callable).
In typical usage, we pass four arguments to this function as follows:
Input argument list. This is a commaseparated string each entry of which consists of a type specification and an argument name.
Output argument list in the same format as the input argument list.
Body of parallel loop. We can use the input/output argument names as an element of these arrays.
Name of the kernel function, which is shown in debuggers and profilers.
Above code is not compiled on every forward/backward computation thanks to two caching mechanisms provided by chainer.backends.cuda.elementwise()
.
The first one is binary caching:
chainer.backends.cuda.elementwise()
function caches the compiled binary in the $(HOME)/.cupy/kernel_cache
directory with a hash value of the CUDA code, and reuses it if the given code matches the hash value.
This caching mechanism is actually implemented in CuPy.
The second one is upload caching:
Given a compiled binary code, we have to upload it to the current GPU in order to execute it.
chainer.backends.cuda.elementwise()
function memoizes the arguments and the current device, and if it is called with the same arguments for the same device, it reuses the previously uploaded kernel code.
The above MulAdd code only works for float32 arrays.
The ElementwiseKernel
also supports the typevariadic kernel definition.
In order to define variadic kernel functions, you can use type placeholder by placing a single character as type specifier:
class MulAdd(Function):
def forward_cpu(self, inputs):
...
def backward_cpu(self, inputs, grad_outputs):
...
def forward_gpu(self, inputs):
cupy = cuda.cupy
x, y, z = inputs
w = cuda.elementwise(
'T x, T y, T z',
'T w',
'w = x * y + z',
'muladd_fwd')(x, y, z)
return w,
def backward_gpu(self, inputs, grad_outputs):
x, y, z = inputs
gw, = grad_outputs
gx, gy = cuda.elementwise(
'T x, T y, T gw',
'T gx, T gy',
'''
gx = y * gw;
gy = x * gw;
''',
'muladd_bwd')(x, y, gw)
gz = gw
return gx, gy, gz
The type placeholder T
indicates an arbitrary data type that CuPy supports.
There are more functionalities on userdefined kernels in CuPy. See the CuPy documentation on userdefined kernels for more details.
Advanced Topics¶
Write a function with training/test mode¶
We sometimes want to make a function behave differently in training and test modes.
The training/test mode in Chainer is configured by chainer.config
.
This is a threadlocal configuration object, and users can substitute True or False to its train
attribute.
You can refer to Configuring Chainer to see how to configure this flag as well as other configuration items.
Here, we just show how to use this flag to make a function support training/test mode.
You will need to check the value of the boolean flag chainer.config.train
and branch appropriately.
For example, consider the following simple dropout function:
def dropout(x):
xp = backend.get_array_module(x.array)
mask = 2 * (xp.random.rand(*x.shape) > 0.5).astype(x.dtype)
return x * mask
This function applies dropout to each element and doubles survived elements to preserve the scale. The above implementation applies dropout even in test mode, but it is not a desired behavior. We can fix it as follows:
def dropout(x):
if not chainer.config.train:
return x
xp = backend.get_array_module(x.array)
mask = 2 * (xp.random.rand(*x.shape) > 0.5).astype(x.dtype)
return x * mask
The function now supports test mode.
Note that you usually do not have to implement your own dropout function because dropout()
is officially provided.
Testing Functions¶
In order to isolate the cause of learning failure from implementation bugs, it is important to test function implementations.
Chainer provides simple utilities to help writing unit tests.
They are defined in the gradient_check
module.
The most important test utility is the numerical_grad()
function.
This function computes the numerical gradient of given function using finite differences.
It can be used as follows:
x = np.random.randn(4, 3).astype(np.float32)
gy = np.ones((4, 3), dtype=np.float32)
f = lambda: (x * x,)
gx = gradient_check.numerical_grad(f, (x,), (gy,))
f
is a closure that returns a tuple of array(s) computed from input arrays.
The second and third arguments of numerical_grad()
are tuples of input arrays and output gradient arrays, respectively.
The code above computes the numerical gradients of sum(f(x))
, where sum
indicates the summation over all elements.
The summation can be weighted by changing gy
.
numerical_grad()
function also accepts additional eps
argument, which indicates the quantization width of finite differences.
Note
numerical_grad()
function accepts both CPU and GPU arrays.
Note that we cannot mix CPU and GPU arrays.
Another utility is chainer.testing.assert_allclose()
function.
This is similar to numpy.testing.assert_allclose()
function.
The difference is that Chainer’s version accepts CPU and GPU arrays as inputs.
We can mix them in one invocation of chainer.testing.assert_allclose()
.
The default values of optional arguments are also different.
Here is a typical usage of gradient checking utilities.
This is a test example of functions.relu()
function:
import unittest
from chainer import testing
class TestReLU(unittest.TestCase):
def test_backward_cpu(self):
x = Variable(np.random.randn(3, 2).astype(np.float32))
y = F.relu(x)
y.grad = np.random.randn(3, 2).astype(np.float32)
y.backward(retain_grad=True)
def f():
return F.relu(x).array,
gx, = gradient_check.numerical_grad(f, (x.array,), (y.grad,))
testing.assert_allclose(gx, x.grad)
The first four lines of the test code are simple forward and backward computation of ReLU function. The next two lines compute numerical gradient using the same forward function without backward routine. And at last, we compare these two results elementwise. Note that the above test code can be easily modified to test GPU version just by replacing CPU arrays to GPU arrays.
In most cases, we do not write the code like the above explicitly because Chainer
offers a utility function chainer.gradient_check.check_backward()
that follows this procedure.
import unittest
from chainer import gradient_check
class TestReLU(unittest.TestCase):
def test_backward_cpu(self):
def f(x):
return F.relu(x)
x = np.random.randn(3, 2).astype(np.float32)
y_grad = np.random.randn(3, 2).astype(np.float32)
gradient_check.check_backward(f, x, y_grad, atol=1e4, rtol=1e4)
You can find many examples of function tests under tests/chainer_tests/functions_tests directory.
You can use chainer.gradient_check.check_double_backward()
to run gradient check for the second order gradient computed by newstyle functions.
This function runs two backwpropagations; first to compute the gradient gx
of y
w.r.t. x
, and second to compute the gradient of gx
w.r.t. x
.
It can be used like check_backward()
, but check_double_backward()
expects an additional argument x_grad_grad
, which is an array or a tuple of arrays used for initializing the gradient array of each gradient w.r.t. an input.
In other words, this argument is used to initialize gx.grad
for the second backprop.
Implementing UserDefined Links¶
Some functions are meant to be combined with parameters.
In such case, it is useful to write a small link that wraps the function.
We have already seen how to define a chain that wraps other links (by inheriting Chain
class) in Creating Models.
Here we study how to define a link that does not hold any other links.
As the first example, suppose that we want to implement elementwise product function between the input array and the parameter array. It can be defined as follows:
class EltwiseParamProduct(Link):
def __init__(self, shape):
super(EltwiseParamProduct, self).__init__()
with self.init_scope():
self.W = chainer.Parameter(initializers.Normal(scale=1.), shape)
def __call__(self, x):
return self.W * x
For another example, assume we want to define a simple linear layer.
It is already defined as chainer.links.Linear
, so this is an educational example.
The linear layer is divided into two parts: a function and its wrapper link.
First, we have to define a function on variables:
class LinearFunction(FunctionNode):
def forward(self, inputs):
x, W, b = inputs
return x.dot(W.T) + b,
def backward(self, inputs, grad_outputs):
x, W, b = inputs
gy, = grad_outputs
gx = gy.dot(W)
gW = gy.T.dot(x)
gb = gy.sum(axis=0)
return gx, gW, gb
def linear(x, W, b):
return LinearFunction()(x, W, b)
This function takes three arguments: input, weight, and bias. It can be used as a part of model definition, though is inconvenient since the user have to manage the weight and bias parameters directly. In order to make a convenient module, let’s wrap it into a link:
class Linear(Link):
def __init__(self, in_size, out_size):
super(Linear, self).__init__()
with self.init_scope():
self.W = chainer.Parameter(
initializers.Normal(1. / math.sqrt(in_size)),
(out_size, in_size))
self.b = chainer.Parameter(0, (out_size,))
def __call__(self, x):
return linear(x, self.W, self.b)
This link hides the parameters of the linear layer.
Note
An advanced tip to implement functions: if you want to preserve some information between forward and backward computations (e.g. to cache some arrays), you can store it as attributes. Be careful that it might increase the memory consumption during the whole forwardbackward computation. If you want to train very large networks on a GPU with limited memory, it is not recommended that you cache arrays between forward and backward. There is one exception for this: caching the output arrays does not change the memory consumption, because they are also held by the output Variable objects.
Warning
You should not assume a onetoone match of calls of forward and backward. Some users may call backward more than once after one forward call.
Migrating From OldStyle Functions To NewStyle Functions¶
Here are the key differences between Function
and FunctionNode
.
Implementing forward computation (difference between
chainer.Function.forward()
andchainer.FunctionNode.forward()
)There are no difference between
Function
andFunctionNode
except that the input arrays are NOT retained by default.If you want the inputs to be retained to use them in
backward
, callretain_inputs()
explicitly. In other words,self.retain_inputs(())
has no effect inFunctionNode
.
Implementing backward computation (difference between
chainer.Function.backward()
andchainer.FunctionNode.backward()
)Arguments to the method has been changed.
inputs
argument is no longer passed.You can use
get_retained_inputs()
andget_retained_outputs()
to retrieve the inputs/outputs retained in theforward
method. Note thatgrad_outputs
and these retained inputs/outputs are all given asVariable
objects, andbackward
method must return a tuple ofVariable
objects.target_input_indexes
argument has been added.It contains a sorted indices of the input variables w.r.t. which the gradients are required. You can use it to skip calculation of unneeded gradients. The use of
target_input_indexes
is optional; it is acceptable to calculate and return all gradients.
All inputs (
grad_outputs
) and retained values are given inVariable
inFunctionNode
, whereasndarray
inFunction
.
Invoking forward computation
Function
is a callable, whereasFunctionNode
is not.You need to use
f.apply((x,))
instead off(x)
. Note thatapply()
always returns outputs astuple
even if the function generates only one output value.
When migrating from oldstyle to newstyle, typically you will need to write a new function class that implements the firstorder gradient of the original function.
Here is an example of rewriting oldstyle MyOldFunc
unary function to newstyle MyFunc
function.
class MyOldFunc(chainer.Function):
def forward(self, inputs):
x, = inputs
... # forward computation code
return y,
def backward(self, inputs, grad_outputs):
x, = inputs
gy, = grad_outputs
... # backward computation code
return gx,
class MyFunc(chainer.FunctionNode):
def forward(self, inputs):
self.retain_inputs((0,))
x, = inputs
... # forward computation code in MyOldFunc
return y,
def backward(self, target_input_indexes, grad_outputs):
x, = self.get_retained_inputs()
gy, = grad_outputs
gx, = MyFuncGrad().apply((x, gy))
return gx,
class MyFuncGrad(chainer.FunctionNode):
def forward(self, inputs):
x, gy = inputs
... # backward computation code in MyOldFunc
return gx,
def backward(self, target_input_indexes, grad_outputs):
# You can leave this unimplemented unless you need to compute
# higherorder derivative using this function.
raise NotImplementedError()
Implementing OldStyle Functions¶
Note
As noted in the NewStyle v.s. OldStyle Functions, we recommend that you use newstyle for newly implemented functions. This section uses the same example as in Implementing NewStyle Functions but using oldstyle.
First, suppose we want to define an elementwise function \(f(x, y, z) = x * y + z\).
While it is possible to implement this equation using a combination of the *
and +
functions,
defining it as a single function may reduce memory consumption, so it is not only a toy example.
Here we call this function MulAdd.
Let’s start with defining MulAdd working on the CPU.
Oldstyle functions must inherit the Function
class.
The skeleton of a function looks like:
class MulAdd(Function):
def forward_cpu(self, inputs):
# do forward computation on CPU
return some_tuple
def backward_cpu(self, inputs, grad_outputs):
# do backward computation on CPU
return some_tuple
We must implement forward_cpu()
and backward_cpu()
methods.
The nonself arguments of these functions are tuples of array(s), and these functions must return a tuple of array(s).
Warning
Be careful to return a tuple of arrays even if you have just one array to return.
MulAdd is simple and implemented as follows:
class MulAdd(Function):
def forward_cpu(self, inputs):
x, y, z = inputs
w = x * y + z
return w,
def backward_cpu(self, inputs, grad_outputs):
x, y, z = inputs
gw, = grad_outputs
gx = y * gw
gy = x * gw
gz = gw
return gx, gy, gz
As per the warning above, the forward_cpu
method returns a tuple of single element.
Note that all arrays appearing in CPU functions are numpy.ndarray
.
The forward function is straightforward; it unpacks the input tuple, computes the output, and packs it into a tuple.
The backward function is a bit more complicated.
Recall the rule of differentiation of multiplication.
This example just implements the rule.
Look at the return values, the function just packs the gradient of each input in the same order and returns them.
By just defining the core computation of forward and backward,
Function
class provides a chaining logic on it (i.e., storing the
history of computation, etc.).
Note
Assuming we implement a (forward) function \(y=f(x)\) which takes as input the
vector \(x \in \mathbb{R}^n\) and produces as output a vector
\(y \in \mathbb{R}^m\). Then the backward
method has to compute
where \(\gamma\) is the grad_outputs
. Note, that the
resulting vector \(\lambda\) must have the same shape as the arguments of the forward
method.
Now let’s define the corresponding GPU methods.
You can easily predict that the methods we have to write are named forward_gpu()
and backward_gpu()
:
class MulAdd(Function):
def forward_cpu(self, inputs):
...
def backward_cpu(self, inputs, grad_outputs):
...
def forward_gpu(self, inputs):
x, y, z = inputs
w = x * y + z
return w,
def backward_gpu(self, inputs, grad_outputs):
x, y, z = inputs
gw, = grad_outputs
gx = y * gw
gy = x * gw
gz = gw
return gx, gy, gz
In GPU methods, arrays are of type cupy.ndarray
.
We use arithmetic operators defined for this class.
These operators implement the basic elementwise arithmetics.
You may find that the definitions of GPU methods are exactly same as those of CPU methods.
In that case, we can reduce them to forward()
and backward()
methods.
class MulAdd(Function):
def forward(self, inputs):
x, y, z = inputs
w = x * y + z
return w,
def backward(self, inputs, grad_outputs):
x, y, z = inputs
gw, = grad_outputs
gx = y * gw
gy = x * gw
gz = gw
return gx, gy, gz
Since the cupy.ndarray
class implements many methods of numpy.ndarray
, we can write these unified methods in most cases.
The MulAdd function can be used as follows:
x = Variable(np.random.uniform(1, 1, (3, 2)).astype(np.float32))
y = Variable(np.random.uniform(1, 1, (3, 2)).astype(np.float32))
z = Variable(np.random.uniform(1, 1, (3, 2)).astype(np.float32))
w = MulAdd()(x, y, z)
It looks a bit ugly: we have to explicitly instantiate MulAdd before applying it to variables. We also have to be careful that one instance of MulAdd must not be used multiple times, since it acts as a node in the computational graph. In Chainer, we often define a thin wrapper Python function that hide the instantiation:
def muladd(x, y, z):
return MulAdd()(x, y, z)
w = muladd(x, y, z)
All functions under chainer.functions
are implemented as wrapper functions like this.
Unified forward/backward methods with NumPy/CuPy functions¶
CuPy implements many functions that are compatible to those of NumPy. We can write unified forward/backward methods with them. Consider that we want to write a backpropable function \(f(x, y) = \exp(x) + \exp(y)\). We name it ExpAdd here. It can be written straightforward as follows:
from chainer.backends import cuda
class ExpAdd(Function):
def forward_cpu(self, inputs):
x, y = inputs
z = np.exp(x) + np.exp(y)
return z,
def backward_cpu(self, inputs, grad_outputs):
x, y = inputs
gz, = grad_outputs
gx = gz * np.exp(x)
gy = gz * np.exp(y)
return gx, gy
def forward_gpu(self, inputs):
cupy = cuda.cupy
x, y = inputs
z = cupy.exp(x) + cupy.exp(y)
return z,
def backward_gpu(self, inputs, grad_outputs):
cupy = cuda.cupy
x, y = inputs
gz, = grad_outputs
gx = gz * cupy.exp(x)
gy = gz * cupy.exp(y)
return gx, gy
def expadd(x, y):
return ExpAdd()(x, y)
Note
Here we used chainer.backends.cuda.cupy
instead of directly accessing cupy
.
This is because the cupy
module cannot be imported if the CUDA is not installed.
In order to keep the implementation valid in nonCUDA environment, we have to defer the access to the cupy
module.
Note that the chainer.backends.cuda
module can be imported even if the CUDA is not installed.
Of course, the module in such environment is almost useless, but if the interpreter does not run through the code accessing CUDAdedicated functions, the code is still valid.
The CPU and GPU implementations are almost same, except that numpy
is replaced by cupy
in GPU methods.
We can unify these functions using the chainer.backend.get_array_module()
function.
This function accepts arbitrary number of arrays, and returns an appropriate module for them.
See the following code:
class ExpAdd(Function):
def forward(self, inputs):
xp = backend.get_array_module(*inputs)
x, y = inputs
z = xp.exp(x) + xp.exp(y)
return z,
def backward(self, inputs, grad_outputs):
xp = backend.get_array_module(*inputs)
x, y = inputs
gz, = grad_outputs
gx = gz * xp.exp(x)
gy = gz * xp.exp(y)
return gx, gy
def expadd(x, y):
return ExpAdd()(x, y)
Note that this code works correctly even if CUDA is not installed in the environment.
If CUDA is not found, get_array_module()
function always returns numpy
.
We often use the name xp
for the variadic module name, which is analogous to the abbreviation np
for NumPy and cp
for CuPy.
Write an Elementwise Kernel Function¶
Let’s turn back to the MulAdd example.
The GPU implementation of MulAdd as shown above is already fast and parallelized on GPU cores.
However, it invokes two kernels during each of forward (w = x * y + z
) and backward (gx = y * gw
and gy = x * gw
) computations.
It might hurt performance, since the intermediate temporary arrays are read and written by possibly different GPU cores, which consumes much bandwidth.
We can reduce the number of invocations by defining our own kernel.
It also reduce the memory consumption.
Most functions only require elementwise operations like MulAdd.
CuPy provides a useful tool to define elementwise kernels, the cupy.ElementwiseKernel
class, and Chainer wraps it by chainer.backends.cuda.elementwise()
function.
Our MulAdd implementation can be improved as follows:
class MulAdd(Function):
def forward_cpu(self, inputs):
...
def backward_cpu(self, inputs, grad_outputs):
...
def forward_gpu(self, inputs):
cupy = cuda.cupy
x, y, z = inputs
w = cuda.elementwise(
'float32 x, float32 y, float32 z',
'float32 w',
'w = x * y + z',
'muladd_fwd')(x, y, z)
return w,
def backward_gpu(self, inputs, grad_outputs):
x, y, z = inputs
gw, = grad_outputs
gx, gy = cuda.elementwise(
'float32 x, float32 y, float32 gw',
'float32 gx, float32 gy',
'''
gx = y * gw;
gy = x * gw;
''',
'muladd_bwd')(x, y, gw)
gz = gw
return gx, gy, gz
chainer.backends.cuda.elementwise()
function accepts the essential implementation of the kernel function, and returns a kernel invocation function (actually, it returns ElementwiseKernel
object, which is callable).
In typical usage, we pass four arguments to this function as follows:
Input argument list. This is a commaseparated string each entry of which consists of a type specification and an argument name.
Output argument list in the same format as the input argument list.
Body of parallel loop. We can use the input/output argument names as an element of these arrays.
Name of the kernel function, which is shown in debuggers and profilers.
Above code is not compiled on every forward/backward computation thanks to two caching mechanisms provided by chainer.backends.cuda.elementwise()
.
The first one is binary caching:
chainer.backends.cuda.elementwise()
function caches the compiled binary in the $(HOME)/.cupy/kernel_cache
directory with a hash value of the CUDA code, and reuses it if the given code matches the hash value.
This caching mechanism is actually implemented in CuPy.
The second one is upload caching:
Given a compiled binary code, we have to upload it to the current GPU in order to execute it.
chainer.backends.cuda.elementwise()
function memoizes the arguments and the current device, and if it is called with the same arguments for the same device, it reuses the previously uploaded kernel code.
The above MulAdd code only works for float32 arrays.
The ElementwiseKernel
also supports the typevariadic kernel definition.
In order to define variadic kernel functions, you can use type placeholder by placing a single character as type specifier:
class MulAdd(Function):
def forward_cpu(self, inputs):
...
def backward_cpu(self, inputs, grad_outputs):
...
def forward_gpu(self, inputs):
cupy = cuda.cupy
x, y, z = inputs
w = cuda.elementwise(
'T x, T y, T z',
'T w',
'w = x * y + z',
'muladd_fwd')(x, y, z)
return w,
def backward_gpu(self, inputs, grad_outputs):
x, y, z = inputs
gw, = grad_outputs
gx, gy = cuda.elementwise(
'T x, T y, T gw',
'T gx, T gy',
'''
gx = y * gw;
gy = x * gw;
''',
'muladd_bwd')(x, y, gw)
gz = gw
return gx, gy, gz
The type placeholder T
indicates an arbitrary data type that CuPy supports.
There are more functionalities on userdefined kernels in CuPy. See the CuPy documentation on userdefined kernels for more details.
Creating Models¶
In the example code of this tutorial, we assume for simplicity that the following symbols are already imported.
import math
import numpy as np
import chainer
from chainer import backend
from chainer import backends
from chainer.backends import cuda
from chainer import Function, FunctionNode, gradient_check, report, training, utils, Variable
from chainer import datasets, initializers, iterators, optimizers, serializers
from chainer import Link, Chain, ChainList
import chainer.functions as F
import chainer.links as L
from chainer.training import extensions
Most neural network architectures contain multiple links. For example, a multilayer perceptron consists of multiple linear layers. We can write complex procedures with parameters by combining multiple links like this:
>>> l1 = L.Linear(4, 3)
>>> l2 = L.Linear(3, 2)
>>> def my_forward(x):
... h = l1(x)
... return l2(h)
Here the L
indicates the links
module.
A procedure with parameters defined in this way is hard to reuse.
More Pythonic way is combining the links and procedures into a class:
>>> class MyProc(object):
... def __init__(self):
... self.l1 = L.Linear(4, 3)
... self.l2 = L.Linear(3, 2)
...
... def forward(self, x):
... h = self.l1(x)
... return self.l2(h)
In order to make it more reusable, we want to support parameter management, CPU/GPU migration, robust and flexible save/load features, etc.
These features are all supported by the Chain
class in Chainer.
Then, what we have to do here is just define the above class as a subclass of Chain:
>>> class MyChain(Chain):
... def __init__(self):
... super(MyChain, self).__init__()
... with self.init_scope():
... self.l1 = L.Linear(4, 3)
... self.l2 = L.Linear(3, 2)
...
... def forward(self, x):
... h = self.l1(x)
... return self.l2(h)
It shows how a complex chain is constructed by simpler links.
Links like l1
and l2
are called child links of MyChain
.
Note that Chain itself inherits Link.
It means we can define more complex chains that hold MyChain
objects as their child links.
Note
We often define a single forward method of a link by the forward
operator.
Such links and chains are callable and behave like regular functions of Variables.
Another way to define a chain is using the ChainList
class, which behaves like a list of links:
>>> class MyChain2(ChainList):
... def __init__(self):
... super(MyChain2, self).__init__(
... L.Linear(4, 3),
... L.Linear(3, 2),
... )
...
... def forward(self, x):
... h = self[0](x)
... return self[1](h)
ChainList can conveniently use an arbitrary number of links, however if the number of links is fixed like in the above case, the Chain class is recommended as a base class.
Optimizer¶
In the example code of this tutorial, we assume for simplicity that the following symbols are already imported.
import math
import numpy as np
import chainer
from chainer import backend
from chainer import backends
from chainer.backends import cuda
from chainer import Function, FunctionNode, gradient_check, report, training, utils, Variable
from chainer import datasets, initializers, iterators, optimizers, serializers
from chainer import Link, Chain, ChainList
import chainer.functions as F
import chainer.links as L
from chainer.training import extensions
From the previous guide on Creating Models, let’s use the MyChain
class:
>>> class MyChain(Chain):
... def __init__(self):
... super(MyChain, self).__init__()
... with self.init_scope():
... self.l1 = L.Linear(4, 3)
... self.l2 = L.Linear(3, 2)
...
... def forward(self, x):
... h = self.l1(x)
... return self.l2(h)
To tune parameters values to minimize loss, etc., we have to optimize them by the Optimizer
class.
It runs a numerical optimization algorithm on a given link.
Many algorithms are implemented in the optimizers
module.
Here we use the simplest one, called Stochastic Gradient Descent (SGD):
>>> model = MyChain()
>>> optimizer = optimizers.SGD().setup(model)
The method setup()
prepares for the optimization given a link.
Some parameter/gradient manipulations, e.g. weight decay and gradient clipping, can be done by setting hook functions to the optimizer. Hook functions are called after the gradient computation and right before the actual update of parameters. For example, we can set weight decay regularization by running the next line beforehand:
>>> optimizer.add_hook(chainer.optimizer_hooks.WeightDecay(0.0005))
Of course, you can write your own hook functions. It should be a function or a callable object.
There are two ways to use the optimizer.
One is using it via Trainer
, which we will see in the following sections.
The other way is using it directly.
We here review the latter case.
To use the optimizer in an automated fashion, see the Trainer guide.
There are two further ways to use the optimizer directly.
One is manually computing gradients and then calling the update()
method with no arguments.
Do not forget to clear the gradients beforehand!
>>> x = np.random.uniform(1, 1, (2, 4)).astype(np.float32)
>>> model.cleargrads()
>>> # compute gradient here...
>>> loss = F.sum(model(chainer.Variable(x)))
>>> loss.backward()
>>> optimizer.update()
The other way is just passing a loss function to the update()
method.
In this case, cleargrads()
is automatically called by the update method, so the user does not have to call it manually.
>>> def lossfun(arg1, arg2):
... # calculate loss
... loss = F.sum(model(arg1  arg2))
... return loss
>>> arg1 = np.random.uniform(1, 1, (2, 4)).astype(np.float32)
>>> arg2 = np.random.uniform(1, 1, (2, 4)).astype(np.float32)
>>> optimizer.update(lossfun, chainer.Variable(arg1), chainer.Variable(arg2))
See chainer.Optimizer.update()
for the full specification.
Trainer¶
When we want to train neural networks, we have to run training loops that update the parameters many times. A typical training loop consists of the following procedures:
Iterations over training datasets
Preprocessing of extracted minibatches
Forward/backward computations of the neural networks
Parameter updates
Evaluations of the current parameters on validation datasets
Logging and printing of the intermediate results
Chainer provides a simple yet powerful way to make it easy to write such training processes. The training loop abstraction mainly consists of two components:
Dataset abstraction. It implements 1 and 2 in the above list. The core components are defined in the
dataset
module. There are also many implementations of datasets and iterators indatasets
anditerators
modules, respectively.Trainer. It implements 3, 4, 5, and 6 in the above list. The whole procedure is implemented by
Trainer
. The way to update parameters (3 and 4) is defined byUpdater
, which can be freely customized. 5 and 6 are implemented by instances ofExtension
, which appends an extra procedure to the training loop. Users can freely customize the training procedure by adding extensions. Users can also implement their own extensions.
Trainer Extensions¶
In this section, you will learn about the following topics:
How to create your own trainer extension
In the example code of this tutorial, we assume for simplicity that the following symbols are already imported.
import math
import numpy as np
import chainer
from chainer import backend
from chainer import backends
from chainer.backends import cuda
from chainer import Function, FunctionNode, gradient_check, report, training, utils, Variable
from chainer import datasets, initializers, iterators, optimizers, serializers
from chainer import Link, Chain, ChainList
import chainer.functions as F
import chainer.links as L
from chainer.training import extensions
What is trainer Extension?¶
Extension
is a callable object that takes a Trainer
object as an argument. By adding an Extension
to a Trainer
using the extend()
method, the Extension
will be called according to the schedule specified by using a trigger
object (See the details in 1. trigger)
The Trainer
object contains all information used in a training loop, e.g., models, optimizers, updaters, iterators, and datasets, etc. This makes it possible to change settings such as the learning rate of an optimizer.
Write a simple function¶
You can make a new Extension
by writing a simple function which takes a Trainer
object as its argument. For example, when you want to reduce the learning rate periodically during training, an lr_drop
extension can be written as follows:
def lr_drop(trainer):
trainer.updater.get_optimizer('main').lr *= 0.1
Then you can add this function to a Trainer
object via extend()
method.
trainer.extend(lr_drop, trigger=(10, 'epoch'))
It lowers the learning rate every 10 epochs by multiplying 0.1 with the current learning rate.
Write a function decorated with @make_extension¶
make_extension()
is a decorator that adds some attributes to a given function. For example, the simple extension we created above can be written in this form:
@training.make_extension(trigger=(10, 'epoch'))
def lr_drop(trainer):
trainer.updater.get_optimizer('main').lr *= 0.1
The difference between the above example and this is whether it has a default trigger
or not. In the latter case, lr_drop()
has its default trigger
so that unless another trigger
is specified via extend()
method, the trigger
specified in make_extension()
is used by default. The code below acts the same as the former example, i.e., it reduces the learning rate every 10 epochs.
trainer.extend(lr_drop)
There are several attributes you can add using the make_extension()
decorator.
1. trigger¶
trigger
is an object that takes a Trainer
object as an argument and returns a boolean value. If a tuple in the form (period, unit)
is given as a trigger, it will be considered as an IntervalTrigger
that invokes the extension every period
unit
. For example, when the given tuple is (10, 'epoch')
, the extension will run every 10 epochs.
trigger
can also be given to the extend()
method that adds an extension to a Trainer
object. The priority of trigger
s is as follows:
When both
extend()
and a givenExtension
havetrigger
s, thetrigger
given toextend()
is used.When
None
is given toextend()
as thetrigger
argument and a givenExtension
hastrigger
, thetrigger
given to theExtension
is used.When both
trigger
attributes inextend()
andExtension
areNone
, theExtension
will be fired every iteration.
See the details in the documentation of get_trigger()
for more information.
2. default_name¶
An Extension
is kept in a dictionary which is a property in a Trainer
. This argument gives the name of the Extension
. Users will see this name in the keys of the snapshot which is a dictionary generated by serialization.
3. priority¶
As a Trainer
object can be assigned multiple Extension
objects, the execution order is defined according to the following three values:
PRIORITY_WRITER
: The priority for extensions that write some records to the observation dictionary. It includes cases that the extension directly adds values to the observation dictionary, or the extension uses the chainer.report() function to report values to the observation dictionary. Extensions which write something to reporter should go first because other Extensions which read those values may be added.PRIORITY_EDITOR
: The priority for extensions that edit the observation dictionary based on already reported values. Extensions which edit some values of reported ones should go after the extensions which write values to reporter but before extensions which read the final values.PRIORITY_READER
: The priority for extensions that only read records from the observation dictionary. This is also suitable for extensions that do not use the observation dictionary at all. Extensions which read the reported values should be fired after all the extensions which have other priorities, e.g,PRIORITY_WRITER
andPRIORITY_EDITOR
because it should read the final values.
See the details in the documentation of Trainer
for more information.
Write a class inherited from the Extension class¶
This is the way to define your own extension with the maximum degree of freedom. You can keep any values inside of the extension and serialize them.
As an example, let’s make an extension that drops the learning rate polynomially. It calculates the learning rate by this equation:
The learning rate will be dropped according to the curve below with \({\rm power} = 0.5\):
class PolynomialShift(training.Extension):
def __init__(self, attr, power, stop_trigger, batchsize=None,
len_dataset=None):
self._attr = attr
self._power = power
self._init = None
self._t = 0
self._last_value = 0
if stop_trigger[1] == 'iteration':
self._maxiter = stop_trigger[0]
elif stop_trigger[1] == 'epoch':
if batchsize is None or len_dataset is None:
raise ValueError(
'When the unit of \'stop_trigger\' is \'epoch\', '
'\'batchsize\' and \'len_dataset\' should be '
'specified to calculate the maximum iteration.')
n_iter_per_epoch = len_dataset / float(batchsize)
self._maxiter = float(stop_trigger[0] * n_iter_per_epoch)
def initialize(self, trainer):
optimizer = trainer.updater.get_optimizer('main')
# ensure that _init is set
if self._init is None:
self._init = getattr(optimizer, self._attr)
def __call__(self, trainer):
self._t += 1
optimizer = trainer.updater.get_optimizer('main')
value = self._init * ((1  (self._t / self._maxiter)) ** self._power)
setattr(optimizer, self._attr, value)
self._last_value = value
def serialize(self, serializer):
self._t = serializer('_t', self._t)
self._last_value = serializer('_last_value', self._last_value)
if isinstance(self._last_value, np.ndarray):
self._last_value = self._last_value.item()
stop_trigger = (10000, 'iteration')
trainer.extend(PolynomialShift('lr', 0.5, stop_trigger))
This extension PolynomialShift
takes five arguments.
attr
: The name of the optimizer property you want to update using this extension.power
: The power of the above equation to calculate the learning rate.stop_trigger
: The trigger given to theTrainer
object to specify when to stop the training loop.batchsize
: The training minibatchsize.len_dataset
: The length of the dataset, i.e., the number of data in the training dataset.
This extension calculates the number of iterations which will be performed during training by using stop_trigger
, batchsize
, and len_dataset
, then stores it as a property _maxiter
. This property will be used in the __call__()
method to update the learning rate. The initialize()
method obtains the initial learning rate from the optimizer given to the Trainer
object. The serialize()
method stores or recovers the properties, _t
(number of iterations) and _last_value
(the latest learning rate), belonging to this extension.
Using GPU(s) in Chainer¶
In the example code of this tutorial, we assume for simplicity that the following symbols are already imported.
import math
import numpy as np
import chainer
from chainer import backend
from chainer import backends
from chainer.backends import cuda
from chainer import Function, FunctionNode, gradient_check, report, training, utils, Variable
from chainer import datasets, initializers, iterators, optimizers, serializers
from chainer import Link, Chain, ChainList
import chainer.functions as F
import chainer.links as L
from chainer.training import extensions
In this section, you will learn about the following topics:
Relationship between Chainer and CuPy
Basics of CuPy
SingleGPU usage of Chainer
MultiGPU usage of modelparallel computing
MultiGPU usage of dataparallel computing
After reading this section, you will be able to:
Use Chainer on a CUDAenabled GPU
Write modelparallel computing in Chainer
Write dataparallel computing in Chainer
Relationship between Chainer and CuPy¶
Note
Even if you have CUDA installed in your environment, you have to install CuPy separately to use GPUs. See Working with Custom CUDA Installation for the way to set up CUDA support.
Chainer uses CuPy as its backend for GPU computation.
In particular, the cupy.ndarray
class is the GPU array implementation for Chainer.
CuPy supports a subset of features of NumPy with a compatible interface.
It enables us to write a common code for CPU and GPU.
It also supports PyCUDAlike userdefined kernel generation, which enables us to write fast implementations dedicated to GPU.
Note
The chainer.backends.cuda
module imports many important symbols from CuPy.
For example, the cupy namespace is referred as cuda.cupy
in the Chainer code.
Note that the chainer.backends.cuda
module can be imported even if CUDA is not installed.
Chainer uses a memory pool for GPU memory allocation.
As shown in the previous sections, Chainer constructs and destructs many arrays during learning and evaluating iterations.
It is not well suited for CUDA architecture, since memory allocation and release in CUDA (i.e. cudaMalloc
and cudaFree
functions) synchronize CPU and GPU computations, which hurts performance.
In order to avoid memory allocation and deallocation during the computation, Chainer uses CuPy’s memory pool as the standard memory allocator.
Chainer changes the default allocator of CuPy to the memory pool, so user can use functions of CuPy directly without dealing with the memory allocator.
Basics of cupy.ndarray
¶
See the documentation of CuPy for the basic usage of cupy.ndarray
CuPy is a GPU array backend that implements a subset of NumPy interface.
The cupy.ndarray
class is in its core, which is a compatible GPU alternative of numpy.ndarray
.
CuPy implements many functions on cupy.ndarray
objects.
See the reference for the supported subset of NumPy API.
Understanding NumPy might help utilizing most features of CuPy.
See the NumPy documentation for learning it.
The main difference of cupy.ndarray
from numpy.ndarray
is that the content is allocated on the device memory.
The allocation takes place on the current device by default.
The current device can be changed by cupy.cuda.Device
object as follows:
with cupy.cuda.Device(1):
x_on_gpu1 = cupy.array([1, 2, 3, 4, 5])
Most operations of CuPy is done on the current device. Be careful that it causes an error to process an array on a noncurrent device.
Chainer provides some convenient functions to automatically switch and choose the device.
For example, the chainer.backends.cuda.to_gpu()
function copies a numpy.ndarray
object to a specified device:
x_cpu = np.ones((5, 4, 3), dtype=np.float32)
x_gpu = cuda.to_gpu(x_cpu, device=1)
It is equivalent to the following code using CuPy:
x_cpu = np.ones((5, 4, 3), dtype=np.float32)
with cupy.cuda.Device(1):
x_gpu = cupy.array(x_cpu)
Moving a device array to the host can be done by chainer.backends.cuda.to_cpu()
as follows:
x_cpu = cuda.to_cpu(x_gpu)
It is equivalent to the following code using CuPy:
with x_gpu.device:
x_cpu = x_gpu.get()
Note
The with statements in these codes are required to select the appropriate CUDA device.
If user uses only one device, these device switching is not needed.
chainer.backends.cuda.to_cpu()
and chainer.backends.cuda.to_gpu()
functions automatically switch the current device correctly.
Chainer also provides a convenient function chainer.backends.cuda.get_device_from_id()
and chainer.backends.cuda.get_device_from_array()
to select a device.
The former function accepts an integer or None
.
When None
is given, it returns a dummy device object.
Otherwise, it returns a corresponding device object.
The latter function accepts CuPy array or NumPy array.
When a NumPy array is given, it returns a dummy device object.
Otherwise, it returns a corresponding device object to the give CuPy array.
The dummy device object also supports with statements like the above example but does nothing.
Here are some other examples:
cuda.get_device_from_id(1).use()
x_gpu1 = cupy.empty((4, 3), dtype=cupy.float32)
with cuda.get_device_from_id(1):
x_gpu1 = cupy.empty((4, 3), dtype=cupy.float32)
with cuda.get_device_from_array(x_gpu1):
y_gpu1 = x_gpu + 1
Since it accepts NumPy arrays, we can write a function that accepts both NumPy and CuPy arrays with correct device switching:
def add1(x):
with cuda.get_device_from_array(x):
return x + 1
The compatibility of CuPy with NumPy enables us to write CPU/GPU generic code.
It can be made easy by the chainer.backend.get_array_module()
function.
This function returns the numpy
or cupy
module based on arguments.
A CPU/GPU generic function is defined using it like follows:
# Stable implementation of log(1 + exp(x))
def softplus(x):
xp = backend.get_array_module(x)
return xp.maximum(0, x) + xp.log1p(xp.exp(abs(x)))
Run Neural Networks on a Single GPU¶
SingleGPU usage is very simple.
What you have to do is transferring Link
and input arrays to the GPU beforehand.
In this subsection, the code is based on our first MNIST example in this tutorial.
A Link
object can be transferred to the specified GPU using the to_gpu()
method.
This time, we make the number of input, hidden, and output units configurable.
The to_gpu()
method also accepts a device ID like model.to_gpu(0)
.
In this case, the link object is transferred to the appropriate GPU device.
The current device is used by default.
If we use chainer.training.Trainer
, what we have to do is just let the updater know the device ID to send each minibatch.
updater = training.updaters.StandardUpdater(train_iter, optimizer, device=0)
trainer = training.Trainer(updater, (20, 'epoch'), out='result')
We also have to specify the device ID for an evaluator extension as well.
trainer.extend(extensions.Evaluator(test_iter, model, device=0))
When we write down the training loop by hand, we have to transfer each minibatch to the GPU manually:
model.to_gpu()
batchsize = 100
datasize = len(x_train)
for epoch in range(20):
print('epoch %d' % epoch)
indexes = np.random.permutation(datasize)
for i in range(0, datasize, batchsize):
x = Variable(cuda.to_gpu(x_train[indexes[i : i + batchsize]]))
t = Variable(cuda.to_gpu(y_train[indexes[i : i + batchsize]]))
optimizer.update(model, x, t)
Modelparallel Computation on Multiple GPUs¶
Parallelization of machine learning is roughly classified into two types called “modelparallel” and “dataparallel”. Modelparallel means parallelizations of the computations inside the model. In contrast, dataparallel means parallelizations using data sharding. In this subsection, we show how to use the modelparallel approach on multiple GPUs in Chainer.
Recall the MNIST example. Now suppose that we want to modify this example by expanding the network to 6 layers with 2000 units each using two GPUs. In order to make multiGPU computation efficient, we only make the two GPUs communicate at the third and sixth layer. The overall architecture looks like the following diagram:
(GPU0) input +> l1 > l2 > l3 +> l4 > l5 > l6 +> output
  
(GPU1) +> l1 > l2 > l3 +> l4 > l5 > l6 +
We can use the above MLP chain as following diagram:
(GPU0) input +> mlp1 +> mlp2 +> output
  
(GPU1) +> mlp1 +> mlp2 +
Let’s write a link for the whole network.
class ParallelMLP(Chain):
def __init__(self):
super(ParallelMLP, self).__init__()
with self.init_scope():
# the input size, 784, is inferred
self.mlp1_gpu0 = MLP(1000, 2000).to_gpu(0)
self.mlp1_gpu1 = MLP(1000, 2000).to_gpu(1)
# the input size, 2000, is inferred
self.mlp2_gpu0 = MLP(1000, 10).to_gpu(0)
self.mlp2_gpu1 = MLP(1000, 10).to_gpu(1)
def forward(self, x):
# assume x is on GPU 0
z0 = self.mlp1_gpu0(x)
z1 = self.mlp1_gpu1(F.copy(x, 1))
# sync
h0 = F.relu(z0 + F.copy(z1, 0))
h1 = F.relu(z1 + F.copy(z0, 1))
y0 = self.mlp2_gpu0(h0)
y1 = self.mlp2_gpu1(h1)
# sync
y = y0 + F.copy(y1, 0)
return y # output is on GPU0
Recall that the Link.to_gpu()
method returns the link itself.
The copy()
function copies an input variable to specified GPU device and returns a new variable on the device.
The copy supports backprop, which just reversely transfers an output gradient to the input device.
Note
Above code is not parallelized on CPU, but is parallelized on GPU. This is because all the functions in the above code run asynchronously to the host CPU.
An almost identical example code can be found at examples/mnist/train_mnist_model_parallel.py.
Dataparallel Computation on Multiple GPUs with Trainer¶
Dataparallel computation is another strategy to parallelize online processing. In the context of neural networks, it means that a different device does computation on a different subset of the input data. In this subsection, we review the way to achieve dataparallel learning on two GPUs.
Suppose again our task is the MNIST example. This time we want to directly parallelize the threelayer network. The most simple form of dataparallelization is parallelizing the gradient computation for a distinct set of data. First, define a model and optimizer instances:
model = L.Classifier(MLP(1000, 10)) # the input size, 784, is inferred
optimizer = optimizers.SGD()
optimizer.setup(model)
Recall that the MLP
link implements the multilayer perceptron, and the Classifier
link wraps it to provide a classifier interface.
We used StandardUpdater
in the previous example.
In order to enable dataparallel computation with multiple GPUs, we only have to replace it with ParallelUpdater
.
updater = training.updaters.ParallelUpdater(train_iter, optimizer,
devices={'main': 0, 'second': 1})
The devices
option specifies which devices to use in dataparallel learning.
The device with name 'main'
is used as the main device.
The original model is sent to this device, so the optimization runs on the main device.
In the above example, the model is also cloned and sent to GPU 1.
Half of each minibatch is fed to this cloned model.
After every backward computation, the gradient is accumulated into the main device, the parameter update runs on it, and then the updated parameters are sent to GPU 1 again.
See also the example code in examples/mnist/train_mnist_data_parallel.py.
Dataparallel Computation on Multiple GPUs without Trainer¶
We here introduce a way to write dataparallel computation without the help of Trainer
.
Most users can skip this section.
If you are interested in how to write a dataparallel computation by yourself, this section should be informative.
It is also helpful to, e.g., customize the ParallelUpdater
class.
We again start from the MNIST example.
At this time, we use a suffix like _0
and _1
to distinguish objects on each device.
First, we define a model.
model_0 = L.Classifier(MLP(1000, 10)) # the input size, 784, is inferred
We want to make two copies of this instance on different GPUs.
The Link.to_gpu()
method runs in place, so we cannot use it to make a copy.
In order to make a copy, we can use Link.copy()
method.
model_1 = model_0.copy()
model_0.to_gpu(0)
model_1.to_gpu(1)
The Link.copy()
method copies the link into another instance.
It just copies the link hierarchy, and does not copy the arrays it holds.
Then, set up an optimizer:
optimizer = optimizers.SGD()
optimizer.setup(model_0)
Here we use the first copy of the model as the master model.
Before its update, gradients of model_1
must be aggregated to those of model_0
.
Then, we can write a dataparallel learning loop as follows:
batchsize = 100
datasize = len(x_train)
for epoch in range(20):
print('epoch %d' % epoch)
indexes = np.random.permutation(datasize)
for i in range(0, datasize, batchsize):
x_batch = x_train[indexes[i : i + batchsize]]
y_batch = y_train[indexes[i : i + batchsize]]
x0 = Variable(cuda.to_gpu(x_batch[:batchsize//2], 0))
t0 = Variable(cuda.to_gpu(y_batch[:batchsize//2], 0))
x1 = Variable(cuda.to_gpu(x_batch[batchsize//2:], 1))
t1 = Variable(cuda.to_gpu(y_batch[batchsize//2:], 1))
loss_0 = model_0(x0, t0)
loss_1 = model_1(x1, t1)
model_0.cleargrads()
model_1.cleargrads()
loss_0.backward()
loss_1.backward()
model_0.addgrads(model_1)
optimizer.update()
model_1.copyparams(model_0)
Do not forget to clear the gradients of both model copies!
One half of the minibatch is forwarded to GPU 0, the other half to GPU 1.
Then the gradients are accumulated by the Link.addgrads()
method.
This method adds the gradients of a given link to those of the self.
After the gradients are prepared, we can update the optimizer in usual way.
Note that the update only modifies the parameters of model_0
.
So we must manually copy them to model_1
using Link.copyparams()
method.
Note
If the batch size used in one model remain the same, the scale of the gradient
is roughly proportional to the number of models, when we aggregate
gradients from all models by chainer.Link.addgrads()
. So you need to adjust the batch size
and/or learning rate of the optimizer accordingly.
Now you can use Chainer with GPUs. All examples in the examples directory support GPU computation, so please refer to them if you want to know more practices on using GPUs. In the next section, we will show how to define a differentiable (i.e. backpropable) function on Variable objects. We will also show there how to write a simple (elementwise) CUDA kernel using Chainer’s CUDA utilities.
Type Checks¶
In this section, you will learn about the following things:
Basic usage of type check
Detail of type information
Internal mechanism of type check
More complicated cases
Call functions
Typical type check example
After reading this section, you will be able to:
Write a code to check types of input arguments of your own functions
Basic usage of type check¶
When you call a function with an invalid type of array, you sometimes receive no error, but get an unexpected result by broadcasting. When you use CUDA with an illegal type of array, it causes memory corruption, and you get a serious error. These bugs are hard to fix. Chainer can check preconditions of each function, and helps to prevent such problems. These conditions may help a user to understand specification of functions.
Each implementation of Function
has a method for type check, check_type_forward()
.
This function is called just before the forward()
method of the Function
class.
You can override this method to check the condition on types and shapes of arguments.
check_type_forward()
gets an argument in_types
:
def check_type_forward(self, in_types):
...
in_types
is an instance of TypeInfoTuple
, which is a subclass of tuple
.
To get type information about the first argument, use in_types[0]
.
If the function gets multiple arguments, we recommend to use new variables for readability:
x_type, y_type = in_types
In this case, x_type
represents the type of the first argument, and y_type
represents the second one.
We describe usage of in_types
with an example.
When you want to check if the number of dimension of x_type
equals to 2
, write this code:
utils.type_check.expect(x_type.ndim == 2)
When this condition is true, nothing happens. Otherwise this code throws an exception, and the user gets a message like this:
Traceback (most recent call last):
...
chainer.utils.type_check.InvalidType: Expect: in_types[0].ndim == 2
Actual: 3 != 2
This error message means that “ndim
of the first argument expected to be 2
, but actually it is 3
”.
Detail of type information¶
You can access three information of x_type
.
.shape
is a tuple of ints. Each value is size of each dimension..ndim
isint
value representing the number of dimensions. Note thatndim == len(shape)
.dtype
isnumpy.dtype
representing data type of the value.
You can check all members. For example, the size of the first dimension must be positive, you can write like this:
utils.type_check.expect(x_type.shape[0] > 0)
You can also check data types with .dtype
:
utils.type_check.expect(x_type.dtype == np.float64)
And an error is like this:
Traceback (most recent call last):
...
chainer.utils.type_check.InvalidType: Expect: in_types[0].dtype == <class 'numpy.float64'>
Actual: float32 != <class 'numpy.float64'>
You can also check kind
of dtype
.
This code checks if the type is floating point
utils.type_check.expect(x_type.dtype.kind == 'f')
You can compare between variables. For example, the following code checks if the first argument and the second argument have the same length:
utils.type_check.expect(x_type.shape[1] == y_type.shape[1])
Internal mechanism of type check¶
How does it show an error message like "in_types[0].ndim == 2"
?
If x_type
is an object containing ndim
member variable, we cannot show such an error message because this equation is evaluated as a boolean value by Python interpreter.
Actually x_type
is a Expr
objects, and doesn’t have a ndim
member variable itself.
Expr
represents a syntax tree.
x_type.ndim
makes a Expr
object representing (getattr, x_type, 'ndim')
.
x_type.ndim == 2
makes an object like (eq, (getattr, x_type, 'ndim'), 2)
.
expect()
gets a Expr
object and evaluates it.
When it is True
, it causes no error and shows nothing.
Otherwise, this method shows a readable error message.
If you want to evaluate a Expr
object, call eval()
method:
actual_type = x_type.eval()
actual_type
is an instance of TypeInfo
, while x_type
is an instance of Expr
.
In the same way, x_type.shape[0].eval()
returns an int value.
More powerful methods¶
Expr
class is more powerful.
It supports all mathematical operators such as +
and *
.
You can write a condition that the first dimension of x_type
is the first dimension of y_type
times four:
utils.type_check.expect(x_type.shape[0] == y_type.shape[0] * 4)
When x_type.shape[0] == 3
and y_type.shape[0] == 1
, users can get the error message below:
Traceback (most recent call last):
...
chainer.utils.type_check.InvalidType: Expect: in_types[0].shape[0] == in_types[1].shape[0] * 4
Actual: 3 != 4
To compare a member variable of your function, wrap a value with Variable
to show readable error message:
x_type.shape[0] == utils.type_check.Variable(self.in_size, "in_size")
This code can check the equivalent condition below:
x_type.shape[0] == self.in_size
However, the latter condition doesn’t know the meaning of this value. When this condition is not satisfied, the latter code shows unreadable error message:
chainer.utils.type_check.InvalidType: Expect: in_types[0].shape[0] == 4 # what does '4' mean?
Actual: 3 != 4
Note that the second argument of utils.type_check.Variable
is only for readability.
The former shows this message:
chainer.utils.type_check.InvalidType: Expect: in_types[0].shape[0] == in_size # OK, `in_size` is a value that is given to the constructor
Actual: 3 != 4 # You can also check actual value here
Call functions¶
How to check summation of all values of shape?
Expr
also supports function call:
sum = utils.type_check.Variable(np.sum, 'sum')
utils.type_check.expect(sum(x_type.shape) == 10)
Why do we need to wrap the function numpy.sum
with utils.type_check.Variable
?
x_type.shape
is not a tuple but an object of Expr
as we have seen before.
Therefore, numpy.sum(x_type.shape)
fails.
We need to evaluate this function lazily.
The above example produces an error message like this:
Traceback (most recent call last):
...
chainer.utils.type_check.InvalidType: Expect: sum(in_types[0].shape) == 10
Actual: 7 != 10
More complicated cases¶
How to write a more complicated condition that can’t be written with these operators?
You can evaluate Expr
and get its result value with eval()
method.
Then check the condition and show warning message by hand:
x_shape = x_type.shape.eval() # get actual shape (int tuple)
if not more_complicated_condition(x_shape):
expect_msg = 'Shape is expected to be ...'
actual_msg = 'Shape is ...'
raise utils.type_check.InvalidType(expect_msg, actual_msg)
Please write a readable error message. This code generates the following error message:
Traceback (most recent call last):
...
chainer.utils.type_check.InvalidType: Expect: Shape is expected to be ...
Actual: Shape is ...
Typical type check example¶
We show a typical type check for a function.
First check the number of arguments:
utils.type_check.expect(in_types.size() == 2)
in_types.size()
returns a Expr
object representing the number of arguments.
You can check it in the same way.
And then, get each type:
x_type, y_type = in_types
Don’t get each value before checking in_types.size()
.
When the number of argument is illegal, type_check.expect
might output unuseful error messages.
For example, this code doesn’t work when the size of in_types
is 0:
utils.type_check.expect(
in_types.size() == 2,
in_types[0].ndim == 3,
)
After that, check each type:
utils.type_check.expect(
x_type.dtype == np.float32,
x_type.ndim == 3,
x_type.shape[1] == 2,
)
The above example works correctly even when x_type.ndim == 0
as all conditions are evaluated lazily.
Serializers – saving and loading¶
Serializer is a simple interface to serialize or deserialize an object.
Link
, Optimizer
, and Trainer
support serialization.
Concrete serializers are defined in the serializers
module.
It supports NumPy NPZ and HDF5 formats.
For example, we can serialize a link object into NPZ file by the save_npz()
function:
Assuming we have defined a model
:
>>> from chainer import serializers
>>> serializers.save_npz('my.model', model)
This saves the parameters of model
into the file 'my.model'
in NPZ format.
The saved model can be read back from my.model
back into model
by the load_npz()
function:
>>> serializers.load_npz('my.model', model)
Note
Note that only the parameters and the persistent values are serialized by this serialization code.
Other attributes are not saved automatically.
You can register arrays, scalars, or any serializable objects as persistent values by the add_persistent()
method.
The registered values can be accessed by attributes of the name passed to the add_persistent method.
The state of an optimizer can also be saved by the same functions:
>>> serializers.save_npz('my.state', optimizer)
>>> serializers.load_npz('my.state', optimizer)
Note
Note that serialization of optimizer only saves its internal states including number of iterations, momentum vectors of MomentumSGD, etc.
It does not save the parameters and persistent values of the target link.
We have to explicitly save the target link with the optimizer to resume the optimization from saved states.
This can be done by saving the entire Trainer
object, like this:
>>> serializers.save_npz('my.state', trainer)
Support of the HDF5 format is enabled if the h5py package is installed.
Serialization and deserialization with the HDF5 format are almost identical to those with the NPZ format;
just replace save_npz()
and load_npz()
by save_hdf5()
and load_hdf5()
, respectively.
Customize your own logging¶
In this section, you will learn about the following things:
What is
chainer.Reporter
?How to report logging with
chainer.Reporter
?The naming rule for the reported values.
After reading this section, you will be able to:
Write your own report.
What is Reporter?¶
chainer.Reporter
is used to collect values that users want to watch.
The reporter object manipulates a dictionary from value names to the actually
observed values. We call this dictionary as observation.
See the following example:
>>> from chainer import Reporter, report, report_scope
>>>
>>> reporter = Reporter()
>>> observer = object() # it can be an arbitrary (reference) object
>>> reporter.add_observer('my_observer:', observer)
>>> observation = {}
>>> with reporter.scope(observation):
... reporter.report({'x': 1}, observer)
...
>>> observation
{'my_observer:/x': 1}
When a value is passed to the reporter
, an object called observer
can be
optionally attached. In this case, the name of the observer
is added as the
prefix of the value name. The observer
name should be registered beforehand.
Using reporter.scope
, you can select which observation
to save the
observed values.
There are also a global API chainer.report()
, which reports observed values
with the current reporter object. In this case, current means which with
statement scope the current code line is in. This function calls the
Reporter.report()
method of the current reporter.
>>> observation = {}
>>> with reporter.scope(observation):
... report({'x': 1}, observer)
...
>>> observation
{'my_observer:/x': 1}
Use report in Chain or Link¶
The most important application of Reporter
is to report
observed values from each Link
or Chain
in the training and validation procedures.
But, how to report the observed values from each link or chain? Shold we
prepare the Reporter
? No, you only need to call
report()
in chain or link,
because Trainer
and some extensions prepare their own
Reporter
object with the hierarchy of the target link registered
as observers. We can use report()
function inside any links and chains to
report the observed values (e.g., training loss, accuracy, activation statistics, etc.).
See the following example:
>>> class Classifier(Chain):
... def __init__(self, predictor):
... super(Classifier, self).__init__()
... with self.init_scope():
... self.predictor = predictor
...
... def forward(self, x, t):
... y = self.predictor(x)
... loss = F.softmax_cross_entropy(y, t)
... accuracy = F.accuracy(y, t)
... report({'loss': loss, 'accuracy': accuracy}, self)
... return loss
...
If the link is named 'main'
in the hierarchy (which is the default
name of the target link in the StandardUpdater
),
these reported values are named 'main/loss'
and 'main/accuracy'
.
If these values are reported inside the Evaluator
extension, 'validation/'
is added at the head of the link name, thus
the item names are changed to 'validation/main/loss'
and 'validation/main/accuracy'
('validation'
is the default name of the Evaluator extension).
Naming rule for the reported values¶
So, you know almost everything about Reporter
.
However, there is one more thing. It is what is the naming rule for the reported values,
especially when the values are reported from a link that is not the root of the link hierarchy.
As we explained in the previous section, the root of links is named as 'main'
by the the StandardUpdater
and the names of reported
values in the root have the prefix 'main/'
.
When the values are reported from a link that is not the root of the link hierarchy,
the prefix of the names are determined by the link hierarchy, or
namedlinks()
.
See the following example:
>>> class MLP(Chain):
... def __init__(self, n_units, n_out):
... super(MLP, self).__init__()
... with self.init_scope():
... # the size of the inputs to each layer will be inferred
... self.l1 = L.Linear(None, n_units) # n_in > n_units
... self.l2 = L.Linear(None, n_units) # n_units > n_units
... self.l3 = L.Linear(None, n_out) # n_units > n_out
...
... def forward(self, x):
... h1 = F.relu(self.l1(x))
... h2 = F.relu(self.l2(h1))
... y = self.l3(h2)
... report({'sum_y': F.sum(y)}, self)
... return y
...
>>> model = Classifier(MLP(100, 10))
>>> for name, observer in model.namedlinks(skipself=True):
... print(name)
/predictor
/predictor/l1
/predictor/l2
/predictor/l3
You can get the parameters of the link hierarchy by namedlinks()
.
In this example, we report 'loss'
and 'accuracy'
in the root of links, and
'sum_y'
in the link of '/predictor'
.
So, you can access the reported values by 'main/accuracy'
,
'main/accuracy'
, and 'main/predictor/sum_y'
.
See what we explained is correct:
>>> train, test = datasets.get_mnist()
>>> train_iter = iterators.SerialIterator(train, batch_size=100, shuffle=True)
>>> test_iter = iterators.SerialIterator(test, batch_size=100, repeat=False, shuffle=False)
>>> optimizer = optimizers.SGD()
>>> optimizer.setup(model)
>>> updater = training.StandardUpdater(train_iter, optimizer)
>>> trainer = training.Trainer(updater, (1, 'epoch'), out='result')
>>> trainer.extend(extensions.Evaluator(test_iter, model))
>>> trainer.extend(extensions.LogReport())
>>> trainer.extend(extensions.PrintReport(
... ['epoch', 'main/accuracy', 'main/loss', 'main/predictor/sum_y', 'validation/main/accuracy']))
>>> trainer.run()
epoch main/accuracy main/loss main/predictor/sum_y validation/main/accuracy
1 0.662317 1.38345 47.9927 0.8498
Neural Net Examples¶
MNIST using Trainer¶
In the example code of this tutorial, we assume for simplicity that the following symbols are already imported.
import math
import numpy as np
import chainer
from chainer import backend
from chainer import backends
from chainer.backends import cuda
from chainer import Function, FunctionNode, gradient_check, report, training, utils, Variable
from chainer import datasets, initializers, iterators, optimizers, serializers
from chainer import Link, Chain, ChainList
import chainer.functions as F
import chainer.links as L
from chainer.training import extensions
By using Trainer
, you don’t need to write the training loop explicitly any more. Furthermore, Chainer provides many useful extensions that can be used with Trainer
to visualize your results, evaluate your model, store and manage log files more easily.
This example will show how to use the Trainer
to train a fullyconnected feedforward neural network on the MNIST dataset.
Note
If you would like to know how to write a training loop without using the Trainer
, please check MNIST with a Manual Training Loop instead of this tutorial.
1. Prepare the dataset¶
Load the MNIST dataset, which contains a training set of images and class labels as well as a corresponding test set.
from chainer.datasets import mnist
train, test = mnist.get_mnist()
Note
You can use a Python list as a dataset. That’s because Iterator
can take any object as a dataset whose elements can be accessed via []
accessor and whose length can be obtained with len()
function. For example,
train = [(x1, t1), (x2, t2), ...]
a list of tuples like this can be used as a dataset.
There are many utility dataset classes defined in datasets
. It is recommended that you utilize them in the actual applications.
For example, if your dataset consists of a number of image files, it would take a large amount of memory to load those data into a list like above. In that case, you can use ImageDataset
, which just keeps the paths to image files. The actual image data will be loaded from the disk when the corresponding element is requested via []
accessor. Until then, no images are loaded to the memory to reduce memory use.
2. Prepare the dataset iterations¶
Iterator
creates a minibatch from the given dataset.
batchsize = 128
train_iter = iterators.SerialIterator(train, batchsize)
test_iter = iterators.SerialIterator(test, batchsize, False, False)
3. Prepare the model¶
Here, we are going to use the same model as the one defined in MNIST with a Manual Training Loop.
class MLP(Chain):
def __init__(self, n_mid_units=100, n_out=10):
super(MLP, self).__init__()
with self.init_scope():
self.l1 = L.Linear(None, n_mid_units)
self.l2 = L.Linear(None, n_mid_units)
self.l3 = L.Linear(None, n_out)
def forward(self, x):
h1 = F.relu(self.l1(x))
h2 = F.relu(self.l2(h1))
return self.l3(h2)
gpu_id = 0 # Set to 1 if you use CPU
model = MLP()
if gpu_id >= 0:
model.to_gpu(gpu_id)
4. Prepare the Updater¶
Trainer
is a class that holds all of the necessary components needed for training. The main components are shown below.
Basically, all you need to pass to Trainer
is an Updater
. However, Updater
contains an Iterator
and Optimizer
. Since Iterator
can access the dataset and Optimizer
has references to the model, Updater
can access to the model to update its parameters.
So, Updater
can perform the training procedure as shown below:
Retrieve the data from dataset and construct a minibatch (
Iterator
)Pass the minibatch to the model and calculate the loss
Update the parameters of the model (
Optimizer
)
Now let’s create the Updater
object !
max_epoch = 10
# Wrap your model by Classifier and include the process of loss calculation within your model.
# Since we do not specify a loss function here, the default 'softmax_cross_entropy' is used.
model = L.Classifier(model)
# selection of your optimizing method
optimizer = optimizers.MomentumSGD()
# Give the optimizer a reference to the model
optimizer.setup(model)
# Get an updater that uses the Iterator and Optimizer
updater = training.updaters.StandardUpdater(train_iter, optimizer, device=gpu_id)
Note
Here, the model defined above is passed to Classifier
and changed to a new Chain
. Classifier
, which in fact inherits from the Chain
class, keeps the given Chain
model in its predictor
attribute. Once you give the input data and the corresponding class labels to the model by the ()
operator,
forward()
of the model is invoked. The data is then given topredictor
to obtain the outputy
.Next, together with the given labels, the output
y
is passed to the loss function which is determined bylossfun
argument in the constructor ofClassifier
.The loss is returned as a
Variable
.
In Classifier
, the lossfun
is set to
softmax_cross_entropy()
as default.
StandardUpdater
is the simplest class among several updaters. There are also the ParallelUpdater
and the MultiprocessParallelUpdater
to utilize multiple GPUs. The MultiprocessParallelUpdater
uses the NVIDIA NCCL library, so you need to install NCCL and reinstall CuPy before using it.
5. Setup Trainer¶
Lastly, we will setup Trainer
. The only requirement for creating a Trainer
is to pass the Updater
object that we previously created above. You can also pass a stop_trigger
to the second trainer argument as a tuple like (length, unit)
to tell the trainer when to stop the training. The length
is given as an integer and the unit
is given as a string which should be either epoch
or iteration
. Without setting stop_trigger
, the training will never be stopped.
# Setup a Trainer
trainer = training.Trainer(updater, (max_epoch, 'epoch'), out='mnist_result')
The out
argument specifies an output directory used to save the
log files, the image files of plots to show the time progress of loss, accuracy, etc. when you use PlotReport
extension. Next, we will explain how to display or save those information by using trainer Extension
.
6. Add Extensions to the Trainer object¶
The Trainer
extensions provide the following capabilities:
Save log files automatically (
LogReport
)Display the training information to the terminal periodically (
PrintReport
)Visualize the loss progress by plotting a graph periodically and save it as an image file (
PlotReport
)Automatically serialize the state periodically (
snapshot()
/snapshot_object()
)Display a progress bar to the terminal to show the progress of training (
ProgressBar
)Save the model architecture as a Graphviz’s dot file (
DumpGraph()
)
To use these wide variety of tools for your training task, pass Extension
objects to the extend()
method of your Trainer
object.
from chainer.training import extensions
trainer.extend(extensions.LogReport())
trainer.extend(extensions.snapshot(filename='snapshot_epoch{.updater.epoch}'))
trainer.extend(extensions.snapshot_object(model.predictor, filename='model_epoch{.updater.epoch}'))
trainer.extend(extensions.Evaluator(test_iter, model, device=gpu_id))
trainer.extend(extensions.PrintReport(['epoch', 'main/loss', 'main/accuracy', 'validation/main/loss', 'validation/main/accuracy', 'elapsed_time']))
trainer.extend(extensions.PlotReport(['main/loss', 'validation/main/loss'], x_key='epoch', file_name='loss.png'))
trainer.extend(extensions.PlotReport(['main/accuracy', 'validation/main/accuracy'], x_key='epoch', file_name='accuracy.png'))
trainer.extend(extensions.DumpGraph('main/loss'))
LogReport
¶
Collect loss
and accuracy
automatically every epoch
or iteration
and store the information under the log
file in the directory specified by the out
argument when you create a Trainer
object.
snapshot()
¶
The snapshot()
method saves the Trainer
object at the designated timing (default: every epoch) in the directory specified by out
. The Trainer
object, as mentioned before, has an Updater
which contains an Optimizer
and a model inside. Therefore, as long as you have the snapshot file, you can use it to come back to the training or make inferences using the previously trained model later.
snapshot_object()
¶
However, when you keep the whole Trainer
object, in some cases, it is very tedious to retrieve only the inside of the model. By using snapshot_object()
, you can save the particular object (in this case, the model wrapped by Classifier
) as a separate snapshot. Classifier
is a Chain
object which keeps the model that is also a Chain
object as its predictor
property, and all the parameters are under the predictor
, so taking the snapshot of predictor
is enough to keep all the trained parameters.
This is a list of commonly used trainer extensions:
LogReport
This extension collects the loss and accuracy values every epoch or iteration and stores in a log file. The log file will be located under the output directory (specified by
out
argument of theTrainer
object).snapshot()
This extension saves the
Trainer
object at the designated timing (defaut: every epoch) in the output directory. TheTrainer
object, as mentioned before, has anUpdater
which contains anOptimizer
and a model inside. Therefore, as long as you have the snapshot file, you can use it to come back to the training or make inferences using the previously trained model later.snapshot_object()
snapshot()
extension above saves the wholeTrainer
object. However, in some cases, it is tedious to retrieve only the inside of the model. By usingsnapshot_object()
, you can save the particular object (in the example above, the model wrapped byClassifier
) as a separeted snapshot. Taking the snapshot ofpredictor
is enough to keep all the trained parameters, becauseClassifier
(which is a subclass ofChain
) keeps the model as itspredictor
property, and all the parameters are under this property.DumpGraph()
This extension saves the structure of the computational graph of the model. The graph is saved in Graphviz dot format under the output directory of the
Trainer
.Evaluator
Iterator
s that use the evaluation dataset and the model object are required to useEvaluator
extension. It evaluates the model using the given dataset (typically it’s a validation dataset) at the specified timing interval.PrintReport
This extension outputs the spcified values to the standard output.
PlotReport
This extension plots the values specified by its arguments and saves it as a image file.
This is not an exhaustive list of builtin extensions. Please take a look at Extensions for more of them.
7. Start Training¶
Just call run()
method from
Trainer
object to start training.
trainer.run()
epoch main/loss main/accuracy validation/main/loss validation/main/accuracy elapsed_time
1 1.53241 0.638409 0.74935 0.835839 4.93409
2 0.578334 0.858059 0.444722 0.882812 7.72883
3 0.418569 0.886844 0.364943 0.899229 10.4229
4 0.362342 0.899089 0.327569 0.905558 13.148
5 0.331067 0.906517 0.304399 0.911788 15.846
6 0.309019 0.911964 0.288295 0.917722 18.5395
7 0.292312 0.916128 0.272073 0.921776 21.2173
8 0.278291 0.92059 0.261351 0.923457 23.9211
9 0.266266 0.923541 0.253195 0.927314 26.6612
10 0.255489 0.926739 0.242415 0.929094 29.466
Let’s see the plot of loss progress saved in the mnist_result
directory.
How about the accuracy?
Furthermore, let’s visualize the computational graph saved with DumpGraph()
using Graphviz.
% dot Tpng mnist_result/cg.dot o mnist_result/cg.png
From the top to the bottom, you can see the data flow in the computational graph. It basically shows how data and parameters are passed to the Function
s.
8. Evaluate a pretrained model¶
Evaluation using the snapshot of a model is as easy as what explained in the MNIST with a Manual Training Loop.
import matplotlib.pyplot as plt
model = MLP()
serializers.load_npz('mnist_result/model_epoch10', model)
# Show the output
x, t = test[0]
plt.imshow(x.reshape(28, 28), cmap='gray')
plt.show()
print('label:', t)
y = model(x[None, ...])
print('predicted_label:', y.array.argmax(axis=1)[0])
label: 7
predicted_label: 7
The prediction looks correct. Success!
MNIST with a Manual Training Loop¶
In the example code of this tutorial, we assume for simplicity that the following symbols are already imported.
import math
import numpy as np
import chainer
from chainer import backend
from chainer import backends
from chainer.backends import cuda
from chainer import Function, FunctionNode, gradient_check, report, training, utils, Variable
from chainer import datasets, initializers, iterators, optimizers, serializers
from chainer import Link, Chain, ChainList
import chainer.functions as F
import chainer.links as L
from chainer.training import extensions
In this tutorial section, we will learn how to train a deep neural network to classify images of handwritten digits in the popular MNIST dataset. This dataset contains 50,000 training examples and 10,000 test examples. Each example is a set of a 28 x 28 greyscale image and a corresponding class label. Since the digits from 0 to 9 are used, there are 10 classes for the labels.
Chainer provides a feature called Trainer
that can simplify the training procedure of your model. However, it is also good to know how the training works in Chainer before starting to use the useful Trainer
class that hides the actual processes. Writing your own training loop can be useful for learning how Trainer
works or for implementing features not included in the standard trainer.
The complete training procedure consists of the following steps:

Retrieve a set of examples (minibatch) from the training dataset.
Feed the minibatch to your network.
Run a forward pass of the network and compute the loss.
Just call the
backward()
method from the lossVariable
to compute the gradients for all trainable parameters.Run the optimizer to update those parameters.
Perform classification by the saved model and check the network performance on validation/test sets.
1. Prepare a dataset¶
Chainer contains some builtin functions to use some popular datasets like MNIST, CIFAR10/100, etc. Those can automatically download the data from servers and provide dataset objects which are easy to use.
The code below shows how to retrieve the MNIST dataset from the server and save an image from its training split to make sure the images are correctly obtained.
from __future__ import print_function
import matplotlib.pyplot as plt
from chainer.datasets import mnist
# Download the MNIST data if you haven't downloaded it yet
train, test = mnist.get_mnist(withlabel=True, ndim=1)
# Display an example from the MNIST dataset.
# `x` contains the input image array and `t` contains that target class
# label as an integer.
x, t = train[0]
plt.imshow(x.reshape(28, 28), cmap='gray')
plt.savefig('5.png')
print('label:', t)
label: 5
The saved image 5.png
will look like:
2. Create a dataset iterator¶
Although this is an optional step, we’d like to introduce the Iterator
class that retrieves a set of data and labels from the given dataset to easily make a minibatch. There are some subclasses that can perform the same thing in different ways, e.g., using multiprocessing to parallelize the data loading part, etc.
Here, we use SerialIterator
, which is also a subclass of Iterator
in the example code below. The SerialIterator
can provide minibatches with or without shuffling the order of data in the given dataset.
All Iterator
s produce a new minibatch by calling its next()
method. All
Iterator
s also have properties to know how many times we have taken all the data from the given dataset (epoch
) and whether the next minibatch will be the start of a new epoch (is_new_epoch
), and so on.
The code below shows how to create a SerialIterator
object from a dataset object.
from chainer import iterators
# Choose the minibatch size.
batchsize = 128
train_iter = iterators.SerialIterator(train, batchsize)
test_iter = iterators.SerialIterator(test, batchsize,
repeat=False, shuffle=False)
Note
Iterator
s can take a builtin Python list as a given dataset. It means that the example code below is able to work,
train = [(x1, t1), (x2, t2), ...] # A list of tuples
train_iter = iterators.SerialIterator(train, batchsize)
where x1, x2, ...
denote the input data and t1, t2, ...
denote the corresponding labels.
Details of SerialIterator¶
SerialIterator
is a builtin subclass ofIterator
that can retrieve a minibatch from a given dataset in either sequential or shuffled order.The
Iterator
’s constructor takes two arguments: a dataset object and a minibatch size.If you want to use the same dataset repeatedly during the training process, set the
repeat
argument toTrue
(default). Otherwise, the dataset will be used only one time. The latter case is actually for the evaluation.If you want to shuffle the training dataset every epoch, set the
shuffle
argument toTrue
. Otherwise, the order of each data retrieved from the dataset will be always the same at each epoch.
In the example code shown above, we set batchsize = 128
in both train_iter
and test_iter
. So, these iterators will provide 128 images and corresponding labels at a time.
3. Define a network¶
Now let’s define a neural network that we will train to classify the MNIST images. For simplicity, we use a threelayer perceptron here. We set each hidden layer to have 100 units and set the output layer to have 10 units, which is corresponding to the number of class labels of the MNIST.
Create your network as a subclass of Chain¶
You can create your network by writing a new subclass of Chain
.
The main steps are twofold:
Register the network components which have trainable parameters to the subclass. Each of them must be instantiated and assigned to a property in the scope specified by
init_scope()
:Define a
forward()
method that represents the actual forward computation of your network. This method takes one or moreVariable
,numpy.ndarray
, orcupy.ndarray
as its inputs and calculates the forward pass using them.class MyNetwork(Chain): def __init__(self, n_mid_units=100, n_out=10): super(MyNetwork, self).__init__() with self.init_scope(): self.l1 = L.Linear(None, n_mid_units) self.l2 = L.Linear(n_mid_units, n_mid_units) self.l3 = L.Linear(n_mid_units, n_out) def forward(self, x): h = F.relu(self.l1(x)) h = F.relu(self.l2(h)) return self.l3(h) model = MyNetwork() gpu_id = 0 # Set to 1 if you use CPU if gpu_id >= 0: model.to_gpu(gpu_id)
Link
, Chain
, ChainList
, and those subclass objects which contain trainable parameters should be registered to the model by assigning it as a property inside the init_scope()
. For example, a FunctionNode
does not contain any trainable parameters, so there is no need to keep the object as a property of your network. When you want to use relu()
in your network, using it as a function in forward()
works correctly.
In Chainer, the Python code that implements the forward computation itself represents the network. In other words, we can conceptually think of the computation graph for our network being constructed dynamically as this forward computation code executes. This allows Chainer to describe networks in which different computations can be performed in each iteration, such as branched networks, intuitively and with a high degree of flexibility. This is the key feature of Chainer that we call DefinebyRun.
4. Select an optimization algorithm¶
Chainer provides a wide variety of optimization algorithms that can be used to optimize the network parameters during training. They are located in optimizers
module.
Here, we are going to use the stochastic gradient descent (SGD) method with momentum, which is implemented by MomentumSGD
. To use the optimizer, we give the network object (typically it’s a Chain
or ChainList
) to the setup()
method of the optimizer object to register it. In this way, the Optimizer
can automatically find the model parameters and update them during training.
You can easily try out other optimizers as well. Please test and observe the results of various optimizers. For example, you could try to change MomentumSGD
to Adam
,
RMSprop
, etc.
from chainer import optimizers
# Choose an optimizer algorithm
optimizer = optimizers.MomentumSGD(lr=0.01, momentum=0.9)
# Give the optimizer a reference to the model so that it
# can locate the model's parameters.
optimizer.setup(model)
Note
In the above example, we set lr
to 0.01 in the constructor. This value is known as the “learning rate”, one of the most important hyperparameters that need to be adjusted in order to obtain the best performance. The various optimizers may each have different hyperparameters and so be sure to check the documentation for the details.
5. Write a training loop¶
We now show how to write the training loop. Since we are working on a digit classification problem, we will use
softmax_cross_entropy()
as the loss function for the optimizer to minimize. For other types of problems, such as regression models, other loss functions might be more appropriate. See the Chainer documentation for detailed information on the various loss functions for more details.
Our training loop will be structured as follows.
We will first get a minibatch of examples from the training dataset.
We will then feed the batch into our network by calling it (a
Chain
object) like a function. This will execute the forwardpass code that are written in theforward()
method.This will return the network output that represents class label predictions. We supply it to the loss function along with the true (that is, target) values. The loss function will output the loss as a
Variable
object.We then clear any previous gradients in the network and perform the backward pass by calling the
backward()
method on the loss variable which computes the parameter gradients. We need to clear the gradients first because thebackward()
method accumulates gradients instead of overwriting the previous values.Since the optimizer already has a reference to the network, it has access to the parameters and the computed gradients so that we can now call the
update()
method of the optimizer which will update the model parameters.
In addition to the above steps, you might want to check the performance of the network with a validation dataset. This allows you to observe how well it is generalized to new data so far, namely, you can check whether it is overfitting to the training data. The code below checks the performance on the test set at the end of each epoch. The code has the same structure as the training code except that no backpropagation is performed and we also compute the accuracy on the test data using the accuracy()
function.
The training loop code is as follows:
import numpy as np
from chainer.dataset import concat_examples
from chainer.backends.cuda import to_cpu
max_epoch = 10
while train_iter.epoch < max_epoch:
#  One iteration of the training loop 
train_batch = train_iter.next()
image_train, target_train = concat_examples(train_batch, gpu_id)
# Calculate the prediction of the network
prediction_train = model(image_train)
# Calculate the loss with softmax_cross_entropy
loss = F.softmax_cross_entropy(prediction_train, target_train)
# Calculate the gradients in the network
model.cleargrads()
loss.backward()
# Update all the trainable parameters
optimizer.update()
#  until here 
# Check the validation accuracy of prediction after every epoch
if train_iter.is_new_epoch: # If this iteration is the final iteration of the current epoch
# Display the training loss
print('epoch:{:02d} train_loss:{:.04f} '.format(
train_iter.epoch, float(to_cpu(loss.array))), end='')
test_losses = []
test_accuracies = []
for test_batch in test_iter:
image_test, target_test = concat_examples(test_batch, gpu_id)
# Forward the test data
prediction_test = model(image_test)
# Calculate the loss
loss_test = F.softmax_cross_entropy(prediction_test, target_test)
test_losses.append(to_cpu(loss_test.array))
# Calculate the accuracy
accuracy = F.accuracy(prediction_test, target_test)
accuracy.to_cpu()
test_accuracies.append(accuracy.array)
test_iter.reset()
print('val_loss:{:.04f} val_accuracy:{:.04f}'.format(
np.mean(test_losses), np.mean(test_accuracies)))
Output¶
epoch:01 train_loss:0.8072 val_loss:0.7592 val_accuracy:0.8289
epoch:02 train_loss:0.5021 val_loss:0.4467 val_accuracy:0.8841
epoch:03 train_loss:0.3539 val_loss:0.3673 val_accuracy:0.9007
epoch:04 train_loss:0.2524 val_loss:0.3307 val_accuracy:0.9067
epoch:05 train_loss:0.4232 val_loss:0.3076 val_accuracy:0.9136
epoch:06 train_loss:0.3033 val_loss:0.2910 val_accuracy:0.9167
epoch:07 train_loss:0.2004 val_loss:0.2773 val_accuracy:0.9222
epoch:08 train_loss:0.2885 val_loss:0.2679 val_accuracy:0.9239
epoch:09 train_loss:0.2818 val_loss:0.2579 val_accuracy:0.9266
epoch:10 train_loss:0.2403 val_loss:0.2484 val_accuracy:0.9307
6. Save the trained model¶
Chainer provides two types of serializers
that can be used to save and restore model state. One supports the HDF5 format and the other supports the NumPy NPZ format. For this example, we are going to use the NPZ
format to save our model since it is easy to use with NumPy and doesn’t need to install any additional dependencies or libraries.
serializers.save_npz('my_mnist.model', model)
7. Perform classification by the saved model¶
Let’s use the saved model to classify a new image. In order to load the trained model parameters, we need to perform the following two steps:
Instantiate the same network as what you trained.
Overwrite all parameters in the model instance with the saved weights using the
load_npz()
function.
Once the model is restored, it can be used to predict image labels on new input data.
from chainer import serializers
# Create an instance of the network you trained
model = MyNetwork()
# Load the saved parameters into the instance
serializers.load_npz('my_mnist.model', model)
# Get a test image and label
x, t = test[0]
plt.imshow(x.reshape(28, 28), cmap='gray')
plt.savefig('7.png')
print('label:', t)
label: 7
The saved test image looks like:
# Change the shape of the minibatch.
# In this example, the size of minibatch is 1.
# Inference using any minibatch size can be performed.
print(x.shape, end=' > ')
x = x[None, ...]
print(x.shape)
# Forward calculation of the model by sending X
y = model(x)
# The result is given as Variable, then we can take a look at the contents by the attribute, .array.
y = y.array
# Look up the most probable digit number using argmax
pred_label = y.argmax(axis=1)
print('predicted label:', pred_label[0])
(784,) > (1, 784)
predicted label: 7
The prediction result looks correct. Yay!
Convolutional Network for Visual Recognition Tasks¶
In this section, you will learn how to write
A small convolutional network with a model class that is inherited from
Chain
,A large convolutional network that has several building block networks with
ChainList
.
After reading this section, you will be able to:
Write your own original convolutional network in Chainer
A convolutional network (ConvNet) is mainly comprised of convolutional layers. This type of network is commonly used for various visual recognition tasks, e.g., classifying handwritten digits or natural images into given object classes, detecting objects from an image, and labeling all pixels of an image with the object classes (semantic segmentation), and so on.
In such tasks, a typical ConvNet takes a set of images whose shape is \((N, C, H, W)\), where
\(N\) denotes the number of images in a minibatch,
\(C\) denotes the number of channels of those images,
\(H\) and \(W\) denote the height and width of those images,
respectively. Then, it typically outputs a fixedsized vector as membership probabilities over the target object classes. It also can output a set of feature maps that have the corresponding size to the input image for a pixel labeling task, etc.
In the example code of this tutorial, we assume for simplicity that the following symbols are already imported.
import math
import numpy as np
import chainer
from chainer import backend
from chainer import backends
from chainer.backends import cuda
from chainer import Function, FunctionNode, gradient_check, report, training, utils, Variable
from chainer import datasets, initializers, iterators, optimizers, serializers
from chainer import Link, Chain, ChainList
import chainer.functions as F
import chainer.links as L
from chainer.training import extensions
LeNet5¶
Here, let’s start by defining LeNet5 [LeCun98] in Chainer. In this example, we show a simplified version of LeNet5 introduced in Deep Learning Tutorials. This is a ConvNet model that has 5 layers comprised of 3 convolutional layers and 2 fullyconnected layers. This was proposed to classify handwritten digit images in 1998. In Chainer, the model can be written as follows:
class LeNet5(Chain):
def __init__(self):
super(LeNet5, self).__init__()
with self.init_scope():
self.conv1 = L.Convolution2D(
in_channels=1, out_channels=6, ksize=5, stride=1)
self.conv2 = L.Convolution2D(
in_channels=6, out_channels=16, ksize=5, stride=1)
self.conv3 = L.Convolution2D(
in_channels=16, out_channels=120, ksize=4, stride=1)
self.fc4 = L.Linear(None, 84)
self.fc5 = L.Linear(84, 10)
def forward(self, x):
h = F.sigmoid(self.conv1(x))
h = F.max_pooling_2d(h, 2, 2)
h = F.sigmoid(self.conv2(h))
h = F.max_pooling_2d(h, 2, 2)
h = F.sigmoid(self.conv3(h))
h = F.sigmoid(self.fc4(h))
if chainer.config.train:
return self.fc5(h)
return F.softmax(self.fc5(h))
A typical way to write your network is creating a new class inherited from
Chain
class. When defining your model in this way, typically,
all the layers which have trainable parameters are registered to the model
by assigning the objects of Link
as an attribute.
The model class is instantiated before the forward and backward computations.
To give input images and label vectors simply by calling the model object
like a function, forward()
is usually defined in the model class.
This method performs the forward computation of the model. Chainer uses
the powerful autograd system for any computational graphs written with
FunctionNode
s and Link
s (actually a
Link
calls a corresponding FunctionNode
inside of it), so that you don’t need to explicitly write the code for backward
computations in the model. Just prepare the data, then give it to the model.
The way this works is the resulting output Variable
from the
forward computation has a backward()
method to perform
autograd. In the above model, forward()
has a if
statement at the
end to switch its behavior by the Chainer’s running mode, i.e., training mode or
not. Chainer presents the running mode as a global variable chainer.config.train
.
When it’s in training mode, forward()
returns the output value of the
last layer as is to compute the loss later on, otherwise it returns a
prediction result by calculating softmax()
.
It is recommended that you use the global configuration chainer.config.train
to switch the running mode.
If you don’t want to write conv1
and the other layers more than once, you
can also write the same model like in this way:
from functools import partial
class LeNet5(Chain):
def __init__(self):
super(LeNet5, self).__init__()
net = [('conv1', L.Convolution2D(1, 6, 5, 1))]
net += [('_sigm1', F.sigmoid)]
net += [('_mpool1', partial(F.max_pooling_2d, ksize=2, stride=2))]
net += [('conv2', L.Convolution2D(6, 16, 5, 1))]
net += [('_sigm2', F.sigmoid)]
net += [('_mpool2', partial(F.max_pooling_2d, ksize=2, stride=2))]
net += [('conv3', L.Convolution2D(16, 120, 4, 1))]
net += [('_sigm3', F.sigmoid)]
net += [('_mpool3', partial(F.max_pooling_2d, ksize=2, stride=2))]
net += [('fc4', L.Linear(None, 84))]
net += [('_sigm4', F.sigmoid)]
net += [('fc5', L.Linear(84, 10))]
net += [('_sigm5', F.sigmoid)]
with self.init_scope():
for n in net:
if not n[0].startswith('_'):
setattr(self, n[0], n[1])
self.layers = net
def forward(self, x):
for n, f in self.layers:
if not n.startswith('_'):
x = getattr(self, n)(x)
else:
x = f(x)
if chainer.config.train:
return x
return F.softmax(x)
Note
You can also use Sequential
to write the above model
more simply. Please note that Sequential
is an
experimental feature introduced in Chainer v4 and its interface may be
changed in the future versions.
This code creates a list of pairs of component name (e.g., conv1
, _sigm1
, etc.) and all Link
s and functions (e.g., F.sigmoid
, which internally invokes FunctionNode
) after calling its superclass’s constructor.
In this case, components whose name start with _
are functions (FunctionNode
), which doesn’t have any trainable parameters, so that we don’t register (setattr
) it to the model.
Others (conv1
, fc4
, etc.) are Link
s, which are trainable layers that hold parameters.
This operation can be freely replaced with many other ways because those component names are just designed to select Link
s only from the list net
easily.
The list net
is stored as an attribute layers
to refer it in
forward()
. In forward()
, it retrieves all layers in the network
from self.forward
sequentially and gives the
input variable or the intermediate output from the previous layer to the
current layer. The last part of the forward()
to switch its behavior
by the training/inference mode is the same as the former way.
Ways to calculate loss¶
When you train the model with label vector t
, the loss should be calculated
using the output from the model. There also are several ways to calculate the
loss:
model = LeNet5()
# Input data and label
x = np.random.rand(32, 1, 28, 28).astype(np.float32)
t = np.random.randint(0, 10, size=(32,)).astype(np.int32)
# Forward computation
y = model(x)
# Loss calculation
loss = F.softmax_cross_entropy(y, t)
This is a primitive way to calculate a loss value from the output of the model.
On the other hand, the loss computation can be included in the model itself by
wrapping the model object (Chain
or
ChainList
object) with a class inherited from
Chain
. The outer Chain
should take the
model defined above and register it with init_scope()
.
Chain
is actually
inherited from Link
, so that Chain
itself
can also be registered as a trainable Link
to another
Chain
. Actually, Classifier
class to
wrap the model and add the loss computation to the model already exists.
Actually, there is already a Classifier
class that can
be used to wrap the model and include the loss computation as well.
It can be used like this:
model = L.Classifier(LeNet5())
# Foward & Loss calculation
loss = model(x, t)
This class takes a model object as an input argument and registers it to
a predictor
property as a trained parameter. As shown above, the returned
object can then be called like a function in which we pass x
and t
as
the input arguments and the resulting loss value (which we recall is a
Variable
) is returned.
See the detailed implementation of Classifier
from
here: chainer.links.Classifier
and check the implementation by looking
at the source.
From the above examples, we can see that Chainer provides the flexibility to write our original network in many different ways. Such flexibility intends to make it intuitive for users to design new and complex models.
VGG16¶
Next, let’s write some larger models in Chainer. When you write a large network
consisting of several building block networks, ChainList
is
useful. First, let’s see how to write a VGG16 [Simonyan14] model.
class VGG16(chainer.ChainList):
def __init__(self):
super(VGG16, self).__init__(
VGGBlock(64),
VGGBlock(128),
VGGBlock(256, 3),
VGGBlock(512, 3),
VGGBlock(512, 3, True))
def forward(self, x):
for f in self.children():
x = f(x)
if chainer.config.train:
return x
return F.softmax(x)
class VGGBlock(chainer.Chain):
def __init__(self, n_channels, n_convs=2, fc=False):
w = chainer.initializers.HeNormal()
super(VGGBlock, self).__init__()
with self.init_scope():
self.conv1 = L.Convolution2D(None, n_channels, 3, 1, 1, initialW=w)
self.conv2 = L.Convolution2D(
n_channels, n_channels, 3, 1, 1, initialW=w)
if n_convs == 3:
self.conv3 = L.Convolution2D(
n_channels, n_channels, 3, 1, 1, initialW=w)
if fc:
self.fc4 = L.Linear(None, 4096, initialW=w)
self.fc5 = L.Linear(4096, 4096, initialW=w)
self.fc6 = L.Linear(4096, 1000, initialW=w)
self.n_convs = n_convs
self.fc = fc
def forward(self, x):
h = F.relu(self.conv1(x))
h = F.relu(self.conv2(h))
if self.n_convs == 3:
h = F.relu(self.conv3(h))
h = F.max_pooling_2d(h, 2, 2)
if self.fc:
h = F.dropout(F.relu(self.fc4(h)))
h = F.dropout(F.relu(self.fc5(h)))
h = self.fc6(h)
return h
That’s it. VGG16 is a model which won the 1st place in
classification + localization task at ILSVRC 2014,
and since then, has become one of the standard models for many different tasks
as a pretrained model. This has 16layers, so it’s called “VGG16”, but we can
write this model without writing all layers independently. Since this model
consists of several building blocks that have the same architecture, we can
build the whole network by reusing the building block definition. Each part
of the network is consisted of 2 or 3 convolutional layers and activation
function (relu()
) following them, and
max_pooling_2d()
operations. This block is written as
VGGBlock
in the above example code. And the whole network just calls
this block one by one in sequential manner.
ResNet152¶
How about ResNet? ResNet [He16] came in the following year’s ILSVRC. It is a much deeper model than VGG16, having up to 152 layers. This sounds super laborious to build, but it can be implemented in almost same manner as VGG16. In the other words, it’s easy. One possible way to write ResNet152 is:
class ResNet152(chainer.Chain):
def __init__(self, n_blocks=[3, 8, 36, 3]):
w = chainer.initializers.HeNormal()
super(ResNet152, self).__init__()
with self.init_scope():
self.conv1 = L.Convolution2D(None, 64, 7, 2, 3, initialW=w, nobias=True)
self.bn1 = L.BatchNormalization(64)
self.res2 = ResBlock(n_blocks[0], 64, 64, 256, 1)
self.res3 = ResBlock(n_blocks[1], 256, 128, 512)
self.res4 = ResBlock(n_blocks[2], 512, 256, 1024)
self.res5 = ResBlock(n_blocks[3], 1024, 512, 2048)
self.fc6 = L.Linear(2048, 1000)
def forward(self, x):
h = self.bn1(self.conv1(x))
h = F.max_pooling_2d(F.relu(h), 2, 2)
h = self.res2(h)
h = self.res3(h)
h = self.res4(h)
h = self.res5(h)
h = F.average_pooling_2d(h, h.shape[2:], stride=1)
h = self.fc6(h)
if chainer.config.train:
return h
return F.softmax(h)
class ResBlock(chainer.ChainList):
def __init__(self, n_layers, n_in, n_mid, n_out, stride=2):
super(ResBlock, self).__init__()
self.add_link(BottleNeck(n_in, n_mid, n_out, stride, True))
for _ in range(n_layers  1):
self.add_link(BottleNeck(n_out, n_mid, n_out))
def forward(self, x):
for f in self.children():
x = f(x)
return x
class BottleNeck(chainer.Chain):
def __init__(self, n_in, n_mid, n_out, stride=1, proj=False):
w = chainer.initializers.HeNormal()
super(BottleNeck, self).__init__()
with self.init_scope():
self.conv1x1a = L.Convolution2D(
n_in, n_mid, 1, stride, 0, initialW=w, nobias=True)
self.conv3x3b = L.Convolution2D(
n_mid, n_mid, 3, 1, 1, initialW=w, nobias=True)
self.conv1x1c = L.Convolution2D(
n_mid, n_out, 1, 1, 0, initialW=w, nobias=True)
self.bn_a = L.BatchNormalization(n_mid)
self.bn_b = L.BatchNormalization(n_mid)
self.bn_c = L.BatchNormalization(n_out)
if proj:
self.conv1x1r = L.Convolution2D(
n_in, n_out, 1, stride, 0, initialW=w, nobias=True)
self.bn_r = L.BatchNormalization(n_out)
self.proj = proj
def forward(self, x):
h = F.relu(self.bn_a(self.conv1x1a(x)))
h = F.relu(self.bn_b(self.conv3x3b(h)))
h = self.bn_c(self.conv1x1c(h))
if self.proj:
x = self.bn_r(self.conv1x1r(x))
return F.relu(h + x)
In the BottleNeck
class, depending on the value of the proj argument
supplied to the initializer, it will conditionally compute a convolutional
layer conv1x1r
which will extend the number of channels of the input x
to be equal to the number of channels of the output of conv1x1c
, and
followed by a batch normalization layer before the final ReLU layer.
Writing the building block in this way improves the reusability of a class.
It switches not only the behavior in __class__()
by flags but also the
parameter registration. In this case, when proj
is False
, the
BottleNeck
doesn’t have conv1x1r and bn_r layers, so the memory
usage would be efficient compared to the case when it registers both anyway and
just ignore them if proj
is False
.
Using nested Chain
s and ChainList
for
sequential part enables us to write complex and very deep models easily.
Use Pretrained Models¶
Various ways to write your models were described above. It turns out that VGG16 and ResNet are very useful as general feature extractors for many kinds of tasks, including but not limited to image classification. So, Chainer provides you with the pretrained VGG16 and ResNet50/101/152 models with a simple API. You can use these models as follows:
from chainer.links import VGG16Layers
model = VGG16Layers()
When VGG16Layers
is instantiated, the pretrained
parameters are automatically downloaded from the author’s server. So you can
immediately start to use VGG16 with pretrained weight as a good image feature
extractor. See the details of this model here:
chainer.links.VGG16Layers
.
In the case of ResNet models, there are three variations differing in the number
of layers. We have chainer.links.ResNet50Layers
,
chainer.links.ResNet101Layers
, and chainer.links.ResNet152Layers
models
with easy parameter loading feature. ResNet’s pretrained parameters are not
available for direct downloading, so you need to download the weight from the
author’s web page first, and then place it into the dir
$CHAINER_DATSET_ROOT/pfnet/chainer/models
or your favorite place. Once
the preparation is finished, the usage is the same as VGG16:
from chainer.links import ResNet152Layers
model = ResNet152Layers()
Traceback (most recent call last):
OSError: The pretrained caffemodel does not exist. Please download it from 'https://github.com/KaimingHe/deepresidualnetworks', and place it on ...
Please see the details of usage and how to prepare the pretrained weights for
ResNet here: chainer.links.ResNet50Layers
References¶
 LeCun98
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324, 1998.
 Simonyan14
Simonyan, K. and Zisserman, A., Very Deep Convolutional Networks for LargeScale Image Recognition. arXiv preprint arXiv:1409.1556, 2014.
 He16
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Deep Residual Learning for Image Recognition. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770778, 2016.
DCGAN: Generate images with Deep Convolutional GAN¶
0. Introduction¶
In this tutorial, we generate images with generative adversarial networks (GAN). GAN are kinds of deep neural network for generative modeling that are often applied to image generation. GANbased models are also used in PaintsChainer, an automatic colorization service.
In this tutorial, you will learn the following things:
Generative Adversarial Networks (GAN)
Implementation of DCGAN in Chainer
1. Generarive Adversarial Networks (GAN)¶
1.1 What are GAN?¶
As explained in GAN tutorial in NIPS 2016 [1], generative models can be classified into the categories as shown in the following figure:
Besides GAN, other famous generative models include Fully visible belief networks (FVBNs) and Variational autoencoder (VAE). Unlike FVBNs and VAE, GAN do not explicitly model the probability distribution \(p({\bf s})\) that generates training data. Instead, we model a generator \(G: {\bf z} \mapsto {\bf s}\). The generator \(G\) samples \({\bf s} \sim p({\bf s})\) from the latent variable \({\bf z}\). Apart from the generator \(G\), we create a discriminator \(D({\bf x})\) which discriminates between samples from the generator G and examples from training data. While training the discriminator \(D\), the generator \(G\) tries to maximize the probability of the discriminator \(D\) making a mistake. So, the generator \(G\) tries to create samples that seem to be drawn from the same distribution as the training data.
The advantages of GAN are low sampling cost and its stateoftheart performance in image generation. The disadvantage is that we cannot calculate the likelihood \(p_{\mathrm {model}}({\bf s})\) because we do not model any probability distribution, and we cannot infer the latent variable \({\bf z}\) from a sample.
1.2 How GAN work?¶
As explained above, GAN use the two models, the generator and the discriminator. When training the networks, we should match the data distribution \(p({\bf s})\) with the distribution of the samples \({\bf s} = G ({\bf z})\) generated from the generator.
The generator \(G\) learns the target distribution, and ideally eventually reaches a Nash equilibrium [2] of game theory. In detail, while training the discriminator \(D\), the generator \(G\) is also trained, so that the discriminator \(D\) makes a mistake.
As an intuitive example, the relationship between counterfeiters of banknotes and the police is frequently used. The counterfeiters try to make counterfeit notes that look like real banknotes. The police try to distinguish real bank notes from counterfeit notes. It is supposed that the ability of the police gradually rises, so that real banknotes and counterfeit notes can be recognized well. Then, the counterfeiters will not be able to use counterfeit banknotes, so they will create counterfeit banknotes that appear more realistic. As the police improve their skill further, they can distinguish real and counterfeit notes… Eventually, the counterfeiter will be able to produce counterfeit banknotes look as real as genuine ones.
The training process is explained by the following mathematical expressions. First, since the discriminator \(D({\bf s})\) is the probability that a sample \({\bf s}\) is generated from the data distribution at, it can be expressed as follows:
Then, when we match the data distribution \({\bf s} \sim p({\bf s})\) and the distribution of generated samples by \(G\), it means that we should minimize the dissimilarity between the two distributions. It is common to use JensenShannon Divergence \(D_{\mathrm{JS}}\) to measure the dissimilarity between distributions[3].
The \(D_{\mathrm{JS}}\) of \(p_{\mathrm{model}}({\bf s})\) and \(p({\bf s})\) can be written as follows by using \(D({\bf s})\):
where \(\bar{p}({\bf s}) = \frac{p({\bf s}) + p_{\rm model}({\bf s})}{2}\). The \(D_{\mathrm{JS}}\) will be maximized by the discriminator \(D\) and minimized by the generator \(G\), namely, \(p_{\mathrm{model}}\). And the distribution \(p_{\mathrm model}({\bf s})\) generated by \(G({\bf {\bf s}})\) can match the data distribution \(p({\bf s})\).
When we actually train the model, the above minmax problem is solved by alternately updating the discriminator \(D({\bf s})\) and the generator \(G({\bf z})\) [4]. The actual training procedures are described as follows:
1.3 What are DCGAN?¶
In this section, we will introduce the model called DCGAN(Deep Convolutional GAN) proposed by Radford et al.[5]. As shown below, it is a model using CNN(Convolutional Neural Network) as its name suggests.
In addition, although GAN are known for its difficulty in training, this paper introduces various techniques for successful training:
Convert maxpooling layers to convolution layers with larger or fractional strides
Convert fully connected layers to global average pooling layers in the discriminator
Use batch normalization layers in the generator and the discriminator
Use leaky ReLU activation functions in the discriminator
2. Implementation of DCGAN in Chainer¶
There is an example of DCGAN in the official repository of Chainer, so we will explain how to implement DCGAN based on this: chainer/examples/dcgan
2.1 Define the generator model¶
First, let’s define a network for the generator.
class Generator(chainer.Chain):
def __init__(self, n_hidden, bottom_width=4, ch=512, wscale=0.02):
super(Generator, self).__init__()
self.n_hidden = n_hidden
self.ch = ch
self.bottom_width = bottom_width
with self.init_scope():
w = chainer.initializers.Normal(wscale)
self.l0 = L.Linear(self.n_hidden, bottom_width * bottom_width * ch,
initialW=w)
self.dc1 = L.Deconvolution2D(ch, ch // 2, 4, 2, 1, initialW=w)
self.dc2 = L.Deconvolution2D(ch // 2, ch // 4, 4, 2, 1, initialW=w)
self.dc3 = L.Deconvolution2D(ch // 4, ch // 8, 4, 2, 1, initialW=w)
self.dc4 = L.Deconvolution2D(ch // 8, 3, 3, 1, 1, initialW=w)
self.bn0 = L.BatchNormalization(bottom_width * bottom_width * ch)
self.bn1 = L.BatchNormalization(ch // 2)
self.bn2 = L.BatchNormalization(ch // 4)
self.bn3 = L.BatchNormalization(ch // 8)
def make_hidden(self, batchsize):
dtype = chainer.get_dtype()
return numpy.random.uniform(1, 1, (batchsize, self.n_hidden, 1, 1))\
.astype(dtype)
def forward(self, z):
h = F.reshape(F.relu(self.bn0(self.l0(z))),
(len(z), self.ch, self.bottom_width, self.bottom_width))
h = F.relu(self.bn1(self.dc1(h)))
h = F.relu(self.bn2(self.dc2(h)))
h = F.relu(self.bn3(self.dc3(h)))
x = F.sigmoid(self.dc4(h))
return x
When we make a network in Chainer, there are some conventions:
Define a network class which inherits
Chain
.Make
chainer.links
’s instances in theinit_scope():
of the initializer__init__
.Define network connections in the
__call__
operator by using thechainer.links
’s instances andchainer.functions
.
If you are not familiar with constructing a new network, please refer to this tutorial.
As we can see from the initializer __init__
, the Generator
uses deconvolution layers Deconvolution2D
and batch normalization layers BatchNormalization
.
In __call__
, each layer is called and followed by
relu
except the last layer.
Because the first argument of L.Deconvolution
is the channel size of input and
the second is the channel size of output, we can find that each layer halves the
channel size. When we construct Generator
with ch=1024
, the network
is same as the above image.
Note
Be careful when passing the output of a fully connected layer to a convolution
layer, because the convolutional layer needs additional dimensions for inputs.
As we can see the 1st line of __call__
,
the output of the fully connected layer is reshaped by reshape
to add the dimensions of the channel, the width and the height of images.
2.2 Define the discriminator model¶
In addition, let’s define the network for the discriminator.
class Discriminator(chainer.Chain):
def __init__(self, bottom_width=4, ch=512, wscale=0.02):
w = chainer.initializers.Normal(wscale)
super(Discriminator, self).__init__()
with self.init_scope():
self.c0_0 = L.Convolution2D(3, ch // 8, 3, 1, 1, initialW=w)
self.c0_1 = L.Convolution2D(ch // 8, ch // 4, 4, 2, 1, initialW=w)
self.c1_0 = L.Convolution2D(ch // 4, ch // 4, 3, 1, 1, initialW=w)
self.c1_1 = L.Convolution2D(ch // 4, ch // 2, 4, 2, 1, initialW=w)
self.c2_0 = L.Convolution2D(ch // 2, ch // 2, 3, 1, 1, initialW=w)
self.c2_1 = L.Convolution2D(ch // 2, ch // 1, 4, 2, 1, initialW=w)
self.c3_0 = L.Convolution2D(ch // 1, ch // 1, 3, 1, 1, initialW=w)
self.l4 = L.Linear(bottom_width * bottom_width * ch, 1, initialW=w)
self.bn0_1 = L.BatchNormalization(ch // 4, use_gamma=False)
self.bn1_0 = L.BatchNormalization(ch // 4, use_gamma=False)
self.bn1_1 = L.BatchNormalization(ch // 2, use_gamma=False)
self.bn2_0 = L.BatchNormalization(ch // 2, use_gamma=False)
self.bn2_1 = L.BatchNormalization(ch // 1, use_gamma=False)
self.bn3_0 = L.BatchNormalization(ch // 1, use_gamma=False)
def forward(self, x):
device = self.device
h = add_noise(device, x)
h = F.leaky_relu(add_noise(device, self.c0_0(h)))
h = F.leaky_relu(add_noise(device, self.bn0_1(self.c0_1(h))))
h = F.leaky_relu(add_noise(device, self.bn1_0(self.c1_0(h))))
h = F.leaky_relu(add_noise(device, self.bn1_1(self.c1_1(h))))
h = F.leaky_relu(add_noise(device, self.bn2_0(self.c2_0(h))))
h = F.leaky_relu(add_noise(device, self.bn2_1(self.c2_1(h))))
h = F.leaky_relu(add_noise(device, self.bn3_0(self.c3_0(h))))
return self.l4(h)
The Discriminator
network is almost mirrors of the Generator
network.
However, there are minor different points:
Use
leaky_relu
as activation functionsDeeper than
Generator
Add some noise to every intermediate outputs before giving them to the next layers
def add_noise(device, h, sigma=0.2):
if chainer.config.train:
xp = device.xp
# TODO(niboshi): Support random.randn in ChainerX
if device.xp is chainerx:
fallback_device = device.fallback_device
with chainer.using_device(fallback_device):
randn = device.send(fallback_device.xp.random.randn(*h.shape))
else:
randn = xp.random.randn(*h.shape)
return h + sigma * randn
else:
return h
2.3 Prepare dataset and iterator¶
Let’s retrieve the CIFAR10 dataset by using Chainer’s dataset utility function
get_cifar10
. CIFAR10 is a set of small natural images.
Each example is an RGB color image of size 32x32. In the original images,
each of R, G, B of pixels is represented by onebyte unsigned integer
(i.e. from 0 to 255).
This function changes the scale of pixel values into [0, scale]
float values.
train, _ = chainer.datasets.get_cifar10(withlabel=False, scale=255.)
train_iter = chainer.iterators.SerialIterator(train, args.batchsize)
2.4 Prepare model and optimizer¶
Let’s make the instances of the generator and the discriminator.
gen = Generator(n_hidden=args.n_hidden)
dis = Discriminator()
gen.to_device(device) # Copy the model to the device
dis.to_device(device)
# Setup an optimizer
def make_optimizer(model, alpha=0.0002, beta1=0.5):
optimizer = chainer.optimizers.Adam(alpha=alpha, beta1=beta1)
optimizer.setup(model)
optimizer.add_hook(
chainer.optimizer_hooks.WeightDecay(0.0001), 'hook_dec')
return optimizer
opt_gen = make_optimizer(gen)
opt_dis = make_optimizer(dis)
Next, let’s make optimizers for the models created above.
def make_optimizer(model, alpha=0.0002, beta1=0.5):
optimizer = chainer.optimizers.Adam(alpha=alpha, beta1=beta1)
optimizer.setup(model)
optimizer.add_hook(
chainer.optimizer_hooks.WeightDecay(0.0001), 'hook_dec')
return optimizer
opt_gen = make_optimizer(gen)
opt_dis = make_optimizer(dis)
2.5 Prepare updater¶
GAN need the two models: the generator and the discriminator. Usually, the default updaters predefined in Chainer take only one model. So, we need to define a custom updater for GAN training.
The definition of DCGANUpdater
is a little complicated. However, it just
minimizes the loss of the discriminator and that of the generator alternately.
As you can see in the class definition, DCGANUpdater
inherits
StandardUpdater
. In this case,
almost all necessary functions are defined in
StandardUpdater
,
we just override the functions of __init__
and update_core
.
Note
We do not need to define loss_dis
and loss_gen
because the functions
are called only in update_core
. It aims at improving readability.
class DCGANUpdater(chainer.training.updaters.StandardUpdater):
def __init__(self, *args, **kwargs):
self.gen, self.dis = kwargs.pop('models')
super(DCGANUpdater, self).__init__(*args, **kwargs)
def loss_dis(self, dis, y_fake, y_real):
batchsize = len(y_fake)
L1 = F.sum(F.softplus(y_real)) / batchsize
L2 = F.sum(F.softplus(y_fake)) / batchsize
loss = L1 + L2
chainer.report({'loss': loss}, dis)
return loss
def loss_gen(self, gen, y_fake):
batchsize = len(y_fake)
loss = F.sum(F.softplus(y_fake)) / batchsize
chainer.report({'loss': loss}, gen)
return loss
def update_core(self):
gen_optimizer = self.get_optimizer('gen')
dis_optimizer = self.get_optimizer('dis')
batch = self.get_iterator('main').next()
device = self.device
x_real = Variable(self.converter(batch, device)) / 255.
gen, dis = self.gen, self.dis
batchsize = len(batch)
y_real = dis(x_real)
z = Variable(device.xp.asarray(gen.make_hidden(batchsize)))
x_fake = gen(z)
y_fake = dis(x_fake)
dis_optimizer.update(self.loss_dis, dis, y_fake, y_real)
gen_optimizer.update(self.loss_gen, gen, y_fake)
In the initializer __init__
, an additional keyword argument models
is
required as you can see the code below. Also, we use keyword arguments
iterator
, optimizer
and device
.
It should be noted that the optimizer
augment takes a dictionary.
The two different models require two different optimizers.
To specify the different optimizers for the models, we give a dictionary,
{'gen': opt_gen, 'dis': opt_dis}
, to the optimizer
argument.
we should input
optimizer
as a dictionary {'gen': opt_gen, 'dis': opt_dis}
.
In the DCGANUpdater
, you can access the iterator with self.get_iterator('main')
.
Also, you can access the optimizers with
self.get_optimizer('gen')
and self.get_optimizer('dis')
.
In update_core
, the two loss functions loss_dis
and loss_gen
are minimized by the optimizers. At first two lines, we access
the optimizers. Then, we create next minibatch of training data by
self.get_iterator('main').next()
, copy batch
to the device
by self.converter
, and make it a Variable
object.
After that, we minimize the loss functions with
the optimizers.
Note
When defining update_core
, we may want to manipulate the
underlying array
of a Variable
with numpy
or cupy
library.
Note that the type of arrays on CPU is numpy.ndarray
, while the type
of arrays on GPU is cupy.ndarray
. However, users do not need to write
if
condition explicitly, because the appropriate array module can be
obtained by xp = chainer.backend.get_array_module(variable.array)
.
If variable
is on GPU, cupy
is assigned to xp
, otherwise
numpy
is assigned to xp
.
updater = DCGANUpdater(
models=(gen, dis),
iterator=train_iter,
optimizer={
'gen': opt_gen, 'dis': opt_dis},
device=device)
2.6 Prepare trainer and run¶
trainer = training.Trainer(updater, (args.epoch, 'epoch'), out=args.out)
snapshot_interval = (args.snapshot_interval, 'iteration')
display_interval = (args.display_interval, 'iteration')
trainer.extend(
extensions.snapshot(filename='snapshot_iter_{.updater.iteration}.npz'),
trigger=snapshot_interval)
trainer.extend(extensions.snapshot_object(
gen, 'gen_iter_{.updater.iteration}.npz'), trigger=snapshot_interval)
trainer.extend(extensions.snapshot_object(
dis, 'dis_iter_{.updater.iteration}.npz'), trigger=snapshot_interval)
trainer.extend(extensions.LogReport(trigger=display_interval))
trainer.extend(extensions.PrintReport([
'epoch', 'iteration', 'gen/loss', 'dis/loss',
]), trigger=display_interval)
trainer.extend(extensions.ProgressBar(update_interval=10))
trainer.extend(
out_generated_image(
gen, dis,
10, 10, args.seed, args.out),
trigger=snapshot_interval)
trainer.run()
2.7 Start training¶
We can run the example as follows.
$ pwd
/root2chainer/chainer/examples/dcgan
$ python train_dcgan.py gpu 0
GPU: 0
# Minibatchsize: 50
# n_hidden: 100
# epoch: 1000
epoch iteration gen/loss dis/loss ................] 0.01%
0 100 1.2292 1.76914
total [..................................................] 0.02%
this epoch [#########.........................................] 19.00%
190 iter, 0 epoch / 1000 epochs
10.121 iters/sec. Estimated time to finish: 1 day, 3:26:26.372445.
The results will be saved in the directory /root2chainer/chainer/examples/dcgan/result/
.
The image is generated by the generator trained for 1000 epochs, and the GIF image
on the top of this page shows generated images after every 10 epochs.
Recurrent Nets and their Computational Graph¶
In the example code of this tutorial, we assume for simplicity that the following symbols are already imported.
import math
import numpy as np
import chainer
from chainer import backend
from chainer import backends
from chainer.backends import cuda
from chainer import Function, FunctionNode, gradient_check, report, training, utils, Variable
from chainer import datasets, initializers, iterators, optimizers, serializers
from chainer import Link, Chain, ChainList
import chainer.functions as F
import chainer.links as L
from chainer.training import extensions
In this section, you will learn how to write
recurrent nets with full backprop,
recurrent nets with truncated backprop,
evaluation of networks with few memory.
After reading this section, you will be able to:
Handle input sequences of variable length
Truncate upper stream of the network during forward computation
Use nobackprop mode to prevent network construction
Recurrent Nets¶
Recurrent nets are neural networks with loops. They are often used to learn from sequential input/output. Given an input stream \(x_1, x_2, \dots, x_t, \dots\) and the initial state \(h_0\), a recurrent net iteratively updates its state by \(h_t = f(x_t, h_{t1})\), and at some or every point in time \(t\), it outputs \(y_t = g(h_t)\). If we expand the procedure along the time axis, it looks like a regular feedforward network except that same parameters are repeatedly used within the network.
Here we learn how to write a simple onelayer recurrent net. The task is language modeling: given a finite sequence of words, we want to predict the next word at each position without peeking the successive words. Suppose there are 1,000 different word types, and that we use 100 dimensional real vectors to represent each word (a.k.a. word embedding).
Let’s start from defining the recurrent neural net language model (RNNLM) as a chain.
We can use the chainer.links.LSTM
link that implements a fullyconnected stateful LSTM layer.
This link looks like an ordinary fullyconnected layer.
On construction, you pass the input and output size to the constructor:
>>> l = L.LSTM(100, 50)
Then, call on this instance l(x)
executes one step of LSTM layer:
>>> l.reset_state()
>>> x = Variable(np.random.randn(10, 100).astype(np.float32))
>>> y = l(x)
Do not forget to reset the internal state of the LSTM layer before the forward computation! Every recurrent layer holds its internal state (i.e. the output of the previous call). At the first application of the recurrent layer, you must reset the internal state. Then, the next input can be directly fed to the LSTM instance:
>>> x2 = Variable(np.random.randn(10, 100).astype(np.float32))
>>> y2 = l(x2)
Based on this LSTM link, let’s write our recurrent network as a new chain:
class RNN(Chain):
def __init__(self):
super(RNN, self).__init__()
with self.init_scope():
self.embed = L.EmbedID(1000, 100) # word embedding
self.mid = L.LSTM(100, 50) # the first LSTM layer
self.out = L.Linear(50, 1000) # the feedforward output layer
def reset_state(self):
self.mid.reset_state()
def forward(self, cur_word):
# Given the current word ID, predict the next word.
x = self.embed(cur_word)
h = self.mid(x)
y = self.out(h)
return y
rnn = RNN()
model = L.Classifier(rnn)
optimizer = optimizers.SGD()
optimizer.setup(model)
Here EmbedID
is a link for word embedding.
It converts input integers into corresponding fixeddimensional embedding vectors.
The last linear link out
represents the feedforward output layer.
The RNN
chain implements a onestepforward computation.
It does not handle sequences by itself, but we can use it to process sequences by just feeding items in a sequence straight to the chain.
Suppose we have a list of word variables x_list
.
Then, we can compute loss values for the word sequence by simple for
loop.
def compute_loss(x_list):
loss = 0
for cur_word, next_word in zip(x_list, x_list[1:]):
loss += model(cur_word, next_word)
return loss
Of course, the accumulated loss is a Variable object with the full history of computation.
So we can just call its backward()
method to compute gradients of the total loss according to the model parameters:
# Suppose we have a list of word variables x_list.
rnn.reset_state()
model.cleargrads()
loss = compute_loss(x_list)
loss.backward()
optimizer.update()
Or equivalently we can use the compute_loss
as a loss function:
rnn.reset_state()
optimizer.update(compute_loss, x_list)
Truncate the Graph by Unchaining¶
Learning from very long sequences is also a typical use case of recurrent nets. Suppose the input and state sequence is too long to fit into memory. In such cases, we often truncate the backpropagation into a short time range. This technique is called truncated backprop. It is heuristic, and it makes the gradients biased. However, this technique works well in practice if the time range is long enough.
How to implement truncated backprop in Chainer?
Chainer has a smart mechanism to achieve truncation, called backward unchaining.
It is implemented in the Variable.unchain_backward()
method.
Backward unchaining starts from the Variable object, and it chops the computation history backwards from the variable.
The chopped variables are disposed automatically (if they are not referenced explicitly from any other user object).
As a result, they are no longer a part of computation history, and are not involved in backprop anymore.
Let’s write an example of truncated backprop. Here we use the same network as the one used in the previous subsection. Suppose we are given a very long sequence, and we want to run backprop truncated at every 30 time steps. We can write truncated backprop using the model defined above:
loss = 0
count = 0
seqlen = len(x_list[1:])
rnn.reset_state()
for cur_word, next_word in zip(x_list, x_list[1:]):
loss += model(cur_word, next_word)
count += 1
if count % 30 == 0 or count == seqlen:
model.cleargrads()
loss.backward()
loss.unchain_backward()
optimizer.update()
State is updated at model()
, and the losses are accumulated to loss
variable.
At each 30 steps, backprop takes place at the accumulated loss.
Then, the unchain_backward()
method is called, which deletes the computation history backward from the accumulated loss.
Note that the last state of model
is not lost, since the RNN instance holds a reference to it.
The implementation of truncated backprop is simple, and since there is no complicated trick on it, we can generalize this method to different situations. For example, we can easily extend the above code to use different schedules between backprop timing and truncation length.
Network Evaluation without Storing the Computation History¶
On evaluation of recurrent nets, there is typically no need to store the computation history. While unchaining enables us to walk through unlimited length of sequences with limited memory, it is a bit of a workaround.
As an alternative, Chainer provides an evaluation mode of forward computation which does not store the computation history.
This is enabled by just calling no_backprop_mode()
context:
with chainer.no_backprop_mode():
x_list = [Variable(...) for _ in range(100)] # list of 100 words
loss = compute_loss(x_list)
Note that we cannot call loss.backward()
to compute the gradient here, since the variable created in the nobackprop context does not remember the computation history.
Nobackprop context is also useful to evaluate feedforward networks to reduce the memory footprint.
We can combine a fixed feature extractor network and a trainable predictor network using no_backprop_mode()
.
For example, suppose we want to train a feedforward network predictor_func
, which is located on top of another fixed pretrained network fixed_func
.
We want to train predictor_func
without storing the computation history for fixed_func
.
This is simply done by following code snippets (suppose x_data
and y_data
indicate input data and label, respectively):
with chainer.no_backprop_mode():
x = Variable(x_data)
feat = fixed_func(x)
y = predictor_func(feat)
y.backward()
At first, the input variable x
is in nobackprop mode, so fixed_func
does not memorize the computation history.
Then predictor_func
is executed in backprop mode, i.e., with memorizing the history of computation.
Since the history of computation is only memorized between variables feat
and y
, the backward computation stops at the feat
variable.
Making it with Trainer¶
The above codes are written with plain Function/Variable APIs. When we write a training loop, it is better to use Trainer, since we can then easily add functionalities by extensions.
Before implementing it on Trainer, let’s clarify the training settings.
We here use Penn Tree Bank dataset as a set of sentences.
Each sentence is represented as a word sequence.
We concatenate all sentences into one long word sequence, in which each sentence is separated by a special word <eos>
, which stands for “End of Sequence”.
This dataset is easily obtained by chainer.datasets.get_ptb_words()
.
This function returns train, validation, and test dataset, each of which is represented as a long array of integers.
Each integer represents a word ID.
Our task is to learn a recurrent neural net language model from the long word sequence. We use words in different locations to form minibatches. It means we maintain \(B\) indices pointing to different locations in the sequence, read from these indices at each iteration, and increment all indices after the read. Of course, when one index reaches the end of the whole sequence, we turn the index back to 0.
In order to implement this training procedure, we have to customize the following components of Trainer:
Iterator. Builtin iterators do not support reading from different locations and aggregating them into a minibatch.
Update function. The default update function does not support truncated BPTT.
When we write a dataset iterator dedicated to the dataset, the dataset implementation can be arbitrary; even the interface is not fixed.
On the other hand, the iterator must support the Iterator
interface.
The important methods and attributes to implement are batch_size
, epoch
, epoch_detail
, is_new_epoch
, iteration
, __next__
, and serialize
.
Following is a code from the official example in the examples/ptb directory.
from __future__ import division
class ParallelSequentialIterator(chainer.dataset.Iterator):
def __init__(self, dataset, batch_size, repeat=True):
self.dataset = dataset
self.batch_size = batch_size
self.epoch = 0
self.is_new_epoch = False
self.repeat = repeat
self.offsets = [i * len(dataset) // batch_size for i in range(batch_size)]
self.iteration = 0
def __next__(self):
length = len(self.dataset)
if not self.repeat and self.iteration * self.batch_size >= length:
raise StopIteration
cur_words = self.get_words()
self.iteration += 1
next_words = self.get_words()
epoch = self.iteration * self.batch_size // length
self.is_new_epoch = self.epoch < epoch
if self.is_new_epoch:
self.epoch = epoch
return list(zip(cur_words, next_words))
@property
def epoch_detail(self):
return self.iteration * self.batch_size / len(self.dataset)
def get_words(self):
return [self.dataset[(offset + self.iteration) % len(self.dataset)]
for offset in self.offsets]
def serialize(self, serializer):
self.iteration = serializer('iteration', self.iteration)
self.epoch = serializer('epoch', self.epoch)
train_iter = ParallelSequentialIterator(train, 20)
val_iter = ParallelSequentialIterator(val, 1, repeat=False)
Although the code is slightly long, the idea is simple.
First, this iterator creates offsets
pointing to positions equally spaced within the whole sequence.
The ith examples of minibatches refer the sequence with the ith offset.
The iterator returns a list of tuples of the current words and the next words.
Each minibatch is converted to a tuple of integer arrays by the concat_examples
function in the standard updater (see the previous tutorial).
Backprop Through Time is implemented as follows.
class BPTTUpdater(training.updaters.StandardUpdater):
def __init__(self, train_iter, optimizer, bprop_len):
super(BPTTUpdater, self).__init__(train_iter, optimizer)
self.bprop_len = bprop_len
# The core part of the update routine can be customized by overriding.
def update_core(self):
loss = 0
# When we pass one iterator and optimizer to StandardUpdater.__init__,
# they are automatically named 'main'.
train_iter = self.get_iterator('main')
optimizer = self.get_optimizer('main')
# Progress the dataset iterator for bprop_len words at each iteration.
for i in range(self.bprop_len):
# Get the next batch (a list of tuples of two word IDs)
batch = train_iter.__next__()
# Concatenate the word IDs to matrices and send them to the device
# self.converter does this job
# (it is chainer.dataset.concat_examples by default)
x, t = self.converter(batch)
# Compute the loss at this time step and accumulate it
loss += optimizer.target(chainer.Variable(x), chainer.Variable(t))
optimizer.target.cleargrads() # Clear the parameter gradients
loss.backward() # Backprop
loss.unchain_backward() # Truncate the graph
optimizer.update() # Update the parameters
updater = BPTTUpdater(train_iter, optimizer, bprop_len) # instantiation
In this case, we update the parameters on every bprop_len
consecutive words.
The call of unchain_backward
cuts the history of computation accumulated to the LSTM links.
The rest of the code for setting up Trainer is almost same as one given in the previous tutorial.
In this section we have demonstrated how to write recurrent nets in Chainer and some fundamental techniques to manage the history of computation (a.k.a. computational graph). The example in the examples/ptb directory implements truncated backprop learning of a LSTM language model from the Penn Treebank corpus. In the next section, we will review how to use GPU(s) in Chainer.
RNN Language Models¶
0. Introduction¶
The language model is modeling the probability of generating natural language sentences or documents. You can use the language model to estimate how natural a sentence or a document is. Also, with the language model, you can generate new sentences or documents.
Let’s start with modeling the probability of generating sentences. We represent a sentence as \({\bf X} = ({\bf x}_0, {\bf x}_1, ..., {\bf x}_T)\), in which \({\bf x}_t\) is a onehot vector. Generally, \({\bf x}_0\) is the onehot vector of BOS (beginning of sentence), and \({\bf x}_T\) is that of EOS (end of sentence).
A language model models the probability of a word occurrence under the condition of its previous words in a sentence. Let \({\bf X}_{[i, j]}\) be \(({\bf x}_i, {\bf x}_{i+1}, ..., {\bf x}_j)\), the occurrence probability of sentence \(\bf X\) can be represented as follows:
So, the language model \(P({\bf X})\) can be decomposed into word probabilities conditioned with its previous words. In this tutorial, we model \(P({\bf x}_t{\bf X}_{[0, t1]})\) with a recurrent neural network to obtain a language model \(P({\bf X})\).
1. Basic Idea of Recurrent Neural Net Language Model¶
1.1 Recurrent Neural Net Language Model¶
Recurrent Neural Net Language Model (RNNLM) is a type of neural net language models which contains the RNNs in the network. Since an RNN can deal with the variable length inputs, it is suitable for modeling the sequential data such as sentences in natural language.
We show one layer of an RNNLM with these parameters.
Symbol 
Definition 

\({\bf x}_t\) 
the onehot vector of \(t\)th word 
\({\bf y}_t\) 
the \(t\)th output 
\({\bf h}_t^{(i)}\) 
the \(t\)th hidden layer of \(i\)th layer 
\({\bf p}_t\) 
the next word’s probability of \(t\)th word 
\({\bf E}\) 
Embedding matrix 
\({\bf W}_h\) 
Hidden layer matrix 
\({\bf W}_o\) 
Output layer matrix 
The process to get a next word prediction from \(i\)th input word \({\bf x}_t\)¶
Get the embedding vector: \({\bf h}_t^{(0)} = {\bf E} {\bf x}_t\)
Calculate the hidden layer: \({\bf h}_t^{(1)} = {\rm tanh} \left( {\bf W}_h \left[ \begin{array}{cc} {\bf h}_t^{(0)} \\ {\bf h}_{t1}^{(1)} \end{array} \right] \right)\)
Calculate the output layer: \({\bf y}_t = {\bf W}_o {\bf h}_t^{(1)}\)
Transform to probability: \({\bf p}_t = {\rm softmax}({\bf y}_t)\)
Note
Note that \(\rm tanh\) in the above equation is applied to the input vector in elementwise manner.
Note that \(\left[ \begin{array}{cc} {\bf a} \\ {\bf b} \end{array} \right]\) denotes a concatenated vector of \({\bf a}\) and \({\bf b}\).
Note that \({\rm softmax}\) in the above equation converts an arbitrary real vector to a probability vector which the summation over all elements is \(1\).
1.2 Perplexity (Evaluation of the language model)¶
Perplexity is the common evaluation metric for a language model. Generally, it measures how well the proposed probability model \(P_{\rm model}({\bf X})\) represents the target data \(P^*({\bf X})\). Let a validation dataset be \(D = \{{\bf X}^{(n)}\}_{n=1}^{D}\), which is a set of sentences, where the \(n\)th sentence length is \(T^{(n)}\), and the vocabulary size of this dataset is \(\mathcal{V}\), the perplexity is represented as follows:
We usually use \(b = 2\) or \(b = e\). The perplexity shows how much varied the predicted distribution for the next word is. When a language model represents the dataset well, it should show a high probability only for the correct next word, so that the entropy should be high. In the above equation, the sign is reversed, so that smaller perplexity means better model.
During training, we minimize the below cross entropy:
where \(\hat P\) is the empirical distribution of a sequence in the training dataset.
2. Implementation of Recurrent Neural Net Language Model¶
There is an example of RNN language model in the official repository, so we will explain how to implement a RNNLM in Chainer based on that: examples/ptb
2.1 Model Overview¶
The RNNLM used in this notebook is depicted in the above figure. The symbols appeared in the figure are defined as follows:
Symbol 
Definition 

\({\bf x}_t\) 
the onehot vector of \(t\)th word 
\({\bf y}_t\) 
the \(t\)th output 
\({\bf h}_t^{(i)}\) 
the \(t\)th hidden layer of \(i\)th layer 
\({\bf p}_t\) 
the next word’s probability of \(t\)th word 
\({\bf E}\) 
Embedding matrix 
\({\bf W}_h\) 
Hidden layer matrix 
\({\bf W}_o\) 
Output layer matrix 
LSTMs (long shortterm memory) are used for the connection of hidden layers. A LSTM is one of major recurrent neural net modules. It is designed for remembering the longterm memory, so that it should be able to consider relationships of distant words, such that a word at beginning of sentence and it at the end. We also use Dropout before both LSTMs and linear transformations. Dropout is one of regularization techniques for preventing overfitting on training dataset.
2.2 Stepbystep Implementation¶
2.2.1 Import Package¶
First, let’s import necessary packages.
"""
from __future__ import division
import argparse
import sys
import numpy as np
2.2.2 Define Training Settings¶
Define all training settings here.
parser.add_argument('batchsize', 'b', type=int, default=20,
help='Number of examples in each minibatch')
parser.add_argument('bproplen', 'l', type=int, default=35,
help='Number of words in each minibatch '
'(= length of truncated BPTT)')
parser.add_argument('epoch', 'e', type=int, default=39,
help='Number of sweeps over the dataset to train')
parser.add_argument('device', 'd', type=str, default='1',
help='Device specifier. Either ChainerX device '
'specifier or an integer. If nonnegative integer, '
'CuPy arrays with specified device id are used. If '
'negative integer, NumPy arrays are used')
parser.add_argument('gradclip', 'c', type=float, default=5,
help='Gradient norm threshold to clip')
parser.add_argument('out', 'o', default='result',
help='Directory to output the result')
parser.add_argument('resume', 'r', type=str,
help='Resume the training from snapshot')
parser.add_argument('test', action='store_true',
help='Use tiny datasets for quick tests')
parser.set_defaults(test=False)
parser.add_argument('unit', 'u', type=int, default=650,
help='Number of LSTM units in each layer')
parser.add_argument('model', 'm', default='model.npz',
help='Model file name to serialize')
2.2.3 Define Network Structure¶
An RNNLM written in Chainer is shown below. It implements the model depicted in the above figure.
class RNNForLM(chainer.Chain):
def __init__(self, n_vocab, n_units):
super(RNNForLM, self).__init__()
with self.init_scope():
self.embed = L.EmbedID(n_vocab, n_units)
self.l1 = L.LSTM(n_units, n_units)
self.l2 = L.LSTM(n_units, n_units)
self.l3 = L.Linear(n_units, n_vocab)
for param in self.params():
param.array[...] = np.random.uniform(0.1, 0.1, param.shape)
def reset_state(self):
self.l1.reset_state()
self.l2.reset_state()
def forward(self, x):
h0 = self.embed(x)
h1 = self.l1(F.dropout(h0))
h2 = self.l2(F.dropout(h1))
y = self.l3(F.dropout(h2))
return y
When we instantiate this class for making a model, we give the vocabulary size to
n_vocab
and the size of hidden vectors ton_units
.This network uses
chainer.links.LSTM
,chainer.links.Linear
, andchainer.functions.dropout
as its building blocks. All the layers are registered and initialized in the context withself.init_scope()
.You can access all the parameters in those layers by calling
self.params()
.In the constructor, it initializes all parameters with values sampled from a uniform distribution \(U(1, 1)\).
The
forward
method takes an word IDx
, and calculates the word probability vector for the next word by forwarding it through the network, and returns the output.Note that the word ID
x
is automatically converted to a \(\mathcal{V}\)dimensional onehot vector and then multiplied with the input embedding matrix inself.embed(x)
to obtain an embed vectorh0
at the first line offorward
.
2.2.4 Load the Penn Tree Bank Long Word Sequence Dataset¶
In this notebook, we use Penn Tree Bank dataset that contains number of sentences.
Chainer provides an utility function to obtain this dataset from server and convert
it to a long single sequence of word IDs. chainer.datasets.get_ptb_words()
actually returns three separated datasets which are for train, validation, and test.
Let’s download and make dataset objects using it:
# Load the Penn Tree Bank long word sequence dataset
train, val, test = chainer.datasets.get_ptb_words()
2.2.5 Define Iterator for Making a Minibatch from the Dataset¶
Dataset iterator creates a minibatch of couple of words at different positions, namely, pairs of current word and its next word. Each example is a part of sentences starting from different offsets equally spaced within the whole sequence.
class ParallelSequentialIterator(chainer.dataset.Iterator):
def __init__(self, dataset, batch_size, repeat=True):
super(ParallelSequentialIterator, self).__init__()
self.dataset = dataset
self.batch_size = batch_size # batch size
self.repeat = repeat
length = len(dataset)
# Offsets maintain the position of each sequence in the minibatch.
self.offsets = [i * length // batch_size for i in range(batch_size)]
self.reset()
def reset(self):
# Number of completed sweeps over the dataset. In this case, it is
# incremented if every word is visited at least once after the last
# increment.
self.epoch = 0
# True if the epoch is incremented at the last iteration.
self.is_new_epoch = False
# NOTE: this is not a count of parameter updates. It is just a count of
# calls of ``__next__``.
self.iteration = 0
# use 1 instead of None internally
self._previous_epoch_detail = 1.
def __next__(self):
# This iterator returns a list representing a minibatch. Each item
# indicates a different position in the original sequence. Each item is
# represented by a pair of two word IDs. The first word is at the
# "current" position, while the second word at the next position.
# At each iteration, the iteration count is incremented, which pushes
# forward the "current" position.
length = len(self.dataset)
if not self.repeat and self.iteration * self.batch_size >= length:
# If not self.repeat, this iterator stops at the end of the first
# epoch (i.e., when all words are visited once).
raise StopIteration
cur_words = self.get_words()
self._previous_epoch_detail = self.epoch_detail
self.iteration += 1
next_words = self.get_words()
epoch = self.iteration * self.batch_size // length
self.is_new_epoch = self.epoch < epoch
if self.is_new_epoch:
self.epoch = epoch
return list(zip(cur_words, next_words))
@property
def epoch_detail(self):
# Floating point version of epoch.
return self.iteration * self.batch_size / len(self.dataset)
@property
def previous_epoch_detail(self):
if self._previous_epoch_detail < 0:
return None
return self._previous_epoch_detail
def get_words(self):
# It returns a list of current words.
return [self.dataset[(offset + self.iteration) % len(self.dataset)]
for offset in self.offsets]
def serialize(self, serializer):
# It is important to serialize the state to be recovered on resume.
self.iteration = serializer('iteration', self.iteration)
self.epoch = serializer('epoch', self.epoch)
try:
self._previous_epoch_detail = serializer(
'previous_epoch_detail', self._previous_epoch_detail)
except KeyError:
# guess previous_epoch_detail for older version
self._previous_epoch_detail = self.epoch + \
(self.current_position  self.batch_size) / len(self.dataset)
if self.epoch_detail > 0:
self._previous_epoch_detail = max(
self._previous_epoch_detail, 0.)
else:
self._previous_epoch_detail = 1.
2.2.6 Define Updater¶
We use Backpropagation through time (BPTT) for optimize the RNNLM. BPTT can be implemented by
overriding update_core()
method of StandardUpdater
. First,
in the constructor of the BPTTUpdater
, it takes bprop_len
as an argument in addition
to other arguments StandardUpdater
needs. bprop_len
defines the
length of sequence \(T\) to calculate the loss:
where \(\hat{P}({\bf x}_t^n)\) is a probability for \(n\)th word in the vocabulary at the position \(t\) in the training data sequence.
class BPTTUpdater(training.updaters.StandardUpdater):
def __init__(self, train_iter, optimizer, bprop_len, device):
super(BPTTUpdater, self).__init__(
train_iter, optimizer, device=device)
self.bprop_len = bprop_len
# The core part of the update routine can be customized by overriding.
def update_core(self):
loss = 0
# When we pass one iterator and optimizer to StandardUpdater.__init__,
# they are automatically named 'main'.
train_iter = self.get_iterator('main')
optimizer = self.get_optimizer('main')
# Progress the dataset iterator for bprop_len words at each iteration.
for i in range(self.bprop_len):
# Get the next batch (a list of tuples of two word IDs)
batch = train_iter.__next__()
# Concatenate the word IDs to matrices and send them to the device
# self.converter does this job
# (it is chainer.dataset.concat_examples by default)
x, t = self.converter(batch, self.device)
# Compute the loss at this time step and accumulate it
loss += optimizer.target(x, t)
optimizer.target.cleargrads() # Clear the parameter gradients
loss.backward() # Backprop
loss.unchain_backward() # Truncate the graph
optimizer.update() # Update the parameters
2.2.7 Define Evaluation Function (Perplexity)¶
Define a function to calculate the perplexity from the loss value. If we take \(e\) as \(b\) in the above definition of perplexity, calculating the perplexity is just to give the loss value to the power of \(e\):
def compute_perplexity(result):
result['perplexity'] = np.exp(result['main/loss'])
if 'validation/main/loss' in result:
result['val_perplexity'] = np.exp(result['validation/main/loss'])
2.2.8 Create Iterator¶
Here, the code below just creates iterator objects from dataset splits (train/val/test).
train_iter = ParallelSequentialIterator(train, args.batchsize)
val_iter = ParallelSequentialIterator(val, 1, repeat=False)
test_iter = ParallelSequentialIterator(test, 1, repeat=False)
2.2.9 Create RNN and Classification Model¶
Instantiate RNNLM model and wrap it with chainer.links.Classifier
because it calculates softmax cross entropy as the loss.
rnn = RNNForLM(n_vocab, args.unit)
model = L.Classifier(rnn)
model.compute_accuracy = False # we only want the perplexity
Note that Classifier
computes not only the loss but also accuracy based on a given
input/label pair. To learn the RNN language model, we only need the loss (cross entropy) in the
Classifier
because we calculate the perplexity instead of classification accuracy to check
the performance of the model. So, we turn off computing the accuracy by giving False to
model.compute_accuracy
attribute.
2.2.10 Setup Optimizer¶
Prepare an optimizer. Here, we use GradientClipping
to prevent gradient explosion. It automatically clips
the gradient to be used to update the parameters in the model with given constant
gradclip
.
optimizer = chainer.optimizers.SGD(lr=1.0)
optimizer.setup(model)
optimizer.add_hook(chainer.optimizer_hooks.GradientClipping(args.gradclip))
2.2.11 Setup and Run Trainer¶
Let’s make a trainer object and start the training! Note that we add an
eval_hook
to the Evaluator
extension to reset the internal states before starting evaluation process. It can prevent to use
training data during evaluating the model.
updater = BPTTUpdater(train_iter, optimizer, args.bproplen, device)
trainer = training.Trainer(updater, (args.epoch, 'epoch'), out=args.out)
eval_model = model.copy() # Model with shared params and distinct states
eval_rnn = eval_model.predictor
trainer.extend(extensions.Evaluator(
val_iter, eval_model, device=device,
# Reset the RNN state at the beginning of each evaluation
eval_hook=lambda _: eval_rnn.reset_state()))
interval = 10 if args.test else 500
trainer.extend(extensions.LogReport(postprocess=compute_perplexity,
trigger=(interval, 'iteration')))
trainer.extend(extensions.PrintReport(
['epoch', 'iteration', 'perplexity', 'val_perplexity']
), trigger=(interval, 'iteration'))
trainer.extend(extensions.ProgressBar(
update_interval=1 if args.test else 10))
trainer.extend(extensions.snapshot())
trainer.extend(extensions.snapshot_object(
model, 'model_iter_{.updater.iteration}'))
if args.resume is not None:
chainer.serializers.load_npz(args.resume, trainer)
trainer.run()
2.2.12 Evaluate the trained model on test dataset¶
Let’s see the perplexity on the test split. Trainer
’s extension can be used as just a normal function
outside of Trainer
.
print('test')
eval_rnn.reset_state()
evaluator = extensions.Evaluator(test_iter, eval_model, device=device)
result = evaluator()
print('test perplexity: {}'.format(np.exp(float(result['main/loss']))))
2.3 Run Example¶
2.3.1 Training the model¶
You can train the model with the script: examples/ptb/train_ptb.py
$ pwd
/root2chainer/chainer/examples/ptb
$ python train_ptb.py test # run by test mode. If you want to use all data, remove "test".
Downloading from https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.train.txt...
Downloading from https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.valid.txt...
Downloading from https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.test.txt...
#vocab = 10000
test
test perplexity: 29889.9857364
2.3.2 Generating sentences¶
You can generate the sentence which starts with a word in the vocabulary. In this example, we generate a sentence which starts with the word apple. We use the script in the PTB example of the official repository: examples/ptb/gentxt.py
$ pwd
/root2chainer/chainer/examples/ptb
$ python gentxt.py m model.npz p apple
apple a new u.s. economist with <unk> <unk> fixed more than to N the company said who is looking back to
Word2Vec: Obtain word embeddings¶
0. Introduction¶
Word2vec is the tool for generating the distributed representation of words, which is proposed by Mikolov et al[1]. When the tool assigns a realvalued vector to each word, the closer the meanings of the words, the greater similarity the vectors will indicate.
Distributed representation means assigning a realvalued vector for each word and representing the word by the vector. When representing a word by distributed representation, we call the word embeddings. In this tutorial, we aim at explaining how to get the word embeddings from Penn Tree Bank dataset.
Let’s think about what the meaning of word is. Since we are human, we can understand that the words “animal” and “dog” are deeply related each other. But what information will Word2vec use to learn the vectors for words? The words “animal” and “dog” should have similar vectors, but the words “food” and “dog” should be far from each other. How to know the features of those words automatically?
1. Basic Idea¶
Word2vec learns the similarity of word meanings from simple information. It learns the representation of words from sentences. The core idea is based on the assumption that the meaning of a word is affected by the words around it. This idea follows distributional hypothesis[2].
The word we focus on to learn its representation is called center word, and the words around it are called context words. The window size \(C\) determines the number of context words which is considered.
Here, let’s see the algorithm by using an example sentence: “The cute cat jumps over the lazy dog.”.
All of the following figures consider “cat” as the center word.
According to the window size \(C\), you can see that the number of context words is changed.
2. Main Algorithm¶
Word2vec, the tool for creating the word embeddings, is actually built with two models, which are called Skipgram and CBoW.
To explain the models with the figures below, we will use the following symbols.
Symbol 
Definition 

\(\mathcal{V}\) 
The size of vocabulary 
\(D\) 
The size of embedding vector 
\({\bf v}_t\) 
A onehot center word vector 
\(V_{t \pm C}\) 
A set of \(2C\) context vectors around \({\bf v}_t\), namely, \(\{{\bf v}_{t+c}\}_{c=C}^C \backslash {\bf v}_t\) 
\({\bf l}_H\) 
An embedding vector of an input word vector 
\({\bf l}_O\) 
An output vector of the network 
\({\bf W}_H\) 
The embedding matrix for inputs 
\({\bf W}_O\) 
The embedding matrix for outputs 
Note
Using negative sampling or hierarchical softmax for the loss function is very common, however, in this tutorial, we will use the softmax over all words and skip the other variants for the sake of simplicity.
2.1 Skipgram¶
This model learns to predict context words \(V_{t \pm C}\) when a center word \({\bf v}_t\) is given. In the model, each row of the embedding matrix for input \({\bf W}_H\) becomes a word embedding of each word.
When you input a center word \({\bf v}_t\) into the network, you can predict one of context words \(\hat {\bf v}_{t+c} \in V_{t \pm C}\) as follows:
Calculate an embedding vector of the input center word vector: \({\bf l}_H = {\bf W}_H {\bf v}_t\)
Calculate an output vector of the embedding vector: \({\bf l}_O = {\bf W}_O {\bf l}_H\)
Calculate a probability vector of a context word: \(\hat {\bf v}_{t+c} = \text{softmax}({\bf l}_O)\)
Each element of the \(\mathcal{V}\)dimensional vector \(\hat {\bf v}_{t+c}\) is a probability that a word in the vocabulary turns out to be a context word at position \(c\). So, the probability \(p({\bf v}_{t+c}{\bf v}_t)\) can be estimated by a dot product of the onehot vector \({\bf v}_{t+c}\) which represents the actual word at the position \(c\) and the output vector \(\hat {\bf v}_{t+c}\).
The loss function to predict all the context words \(V_{t \pm C}\) given a center word \({\bf v}_t\) is defined as follows:
2.2 Continuous Bag of Words (CBoW)¶
This model learns to predict center word \({\bf v}_t\) when context words \(V_{t \pm C}\) is given. When you give a set of context words \(V_{t \pm C}\) to the network, you can estimate the probability of the center word \(\hat {\bf v}_t\) as follows:
Calculate a mean embedding vector over all context words: \({\bf l}_H = \frac{1}{2C} \sum_{V_{t \pm C}} {\bf W}_H {\bf v}_{t+c}\)
Calculate an output vector of the embedding vector: \({\bf l}_O = {\bf W}_O {\bf l}_H\)
Calculate a probability vector of a center word: \(\hat {\bf v}_t = \text{softmax}({\bf l}_O)\)
Each element of the \(\mathcal{V}\)dimensional vector \(\hat {\bf v}_t\) is a probability that a word in the vocabulary turns out to be a center word. So, the probability \(p({\bf v}_tV_{t \pm C})\) can be estimated by a dot product of the onehot vector \({\bf v}_t\) which represents the actual center word and the output vector \(\hat {\bf v}_t\).
The loss function to predict the center word \({\bf v}_t\) given context words \(V_{t \pm C}\) is defined as follows:
3. Details of Skipgram¶
In this tutorial, we mainly explain Skipgram model because
It is easier to understand the algorithm than CBoW.
Even if the number of words increases, the accuracy is largely maintained. So, it is more scalable.
So, let’s think about a concrete example of calculating Skipgram under this setup:
The size of vocabulary \(\mathcal{V}\) is 10.
The size of embedding vector \(D\) is 2.
Center word is “dog”.
Context word is “animal”.
Since there should be more than one context word, repeat the following process for each context word.
The onehot vector of “dog” is
[0 0 1 0 0 0 0 0 0 0]
and you input it as the center word.The third row of embedding matrix \({\bf W}_H\) is used for the word embedding of “dog” \({\bf l}_H\).
Then, multiply \({\bf W}_O\) with \({\bf l}_H\) to obtain the output vector \({\bf l}_O\).
Give \({\bf l}_O\) to the softmax function to make it a predicted probability vector \(\hat {\bf v}_{t+c}\) for a context word at the position \(c\).
Calculate the error between \(\hat {\bf v}_{t+c}\) and the onehot vector of “animal”;
[1 0 0 0 0 0 0 0 0 0 0]
.Propagate the error back to the network to update the parameters.
4. Implementation of Skipgram in Chainer¶
There is an example of Word2vec in the official repository of Chainer, so we will explain how to implement Skipgram based on this: examples/word2vec
4.1 Preparation¶
First, let’s import necessary packages:
import argparse
import collections
import os
import six
import warnings
import numpy as np
import chainer
from chainer.backends import cuda
import chainer.functions as F
import chainer.initializers as I
import chainer.links as L
import chainer.optimizers as O
from chainer import reporter
4.2 Define a Skipgram model¶
Next, let’s define a network for Skipgram.
class SkipGram(chainer.Chain):
"""Definition of Skipgram Model"""
def __init__(self, n_vocab, n_units, loss_func):
super(SkipGram, self).__init__()
with self.init_scope():
self.embed = L.EmbedID(
n_vocab, n_units, initialW=I.Uniform(1. / n_units))
self.loss_func = loss_func
def forward(self, x, contexts):
e = self.embed(contexts)
batch_size, n_context, n_units = e.shape
x = F.broadcast_to(x[:, None], (batch_size, n_context))
e = F.reshape(e, (batch_size * n_context, n_units))
x = F.reshape(x, (batch_size * n_context,))
loss = self.loss_func(e, x)
reporter.report({'loss': loss}, self)
return loss
class SoftmaxCrossEntropyLoss(chainer.Chain):
"""Softmax cross entropy loss function preceded by linear transformation.
"""
def __init__(self, n_in, n_out):
super(SoftmaxCrossEntropyLoss, self).__init__()
with self.init_scope():
self.out = L.Linear(n_in, n_out, initialW=0)
def forward(self, x, t):
return F.softmax_cross_entropy(self.out(x), t)
Note
The weight matrix
self.embed.W
is the embedding matrix for input vectorx
.The function call
forward
takes the word ID of a center wordx
and word IDs of context words contexts as inputs, and outputs the error calculated by the loss functionloss_func
s.t.SoftmaxCrossEntropyLoss
.Note that the initial shape of
x
and contexts are(batch_size,)
and(batch_size, n_context)
, respectively.The
batch_size
means the size of minibatch, andn_context
means the number of context words.
First, we obtain the embedding vectors of contexts by e = self.embed(contexts)
.
Then F.broadcast_to(x[:, None], (batch_size, n_context))
performs broadcasting of
x
(its shape is (batch_size,)
) to (batch_size, n_context)
by copying the
same value n_context
time to fill the second axis, and then the broadcasted x
is reshaped into 1D vector (batchsize * n_context,)
while e
is reshaped to
(batch_size * n_context, n_units)
.
In Skipgram model, predicting a context word from the center word is the same as
predicting the center word from a context word because the center word is always
a context word when considering the context word as a center word. So, we create
batch_size * n_context
center word predictions by applying self.out
linear
layer to the embedding vectors of context words. Then, calculate softmax cross
entropy between the broadcasted center word ID x and the predictions.
4.3 Prepare dataset and iterator¶
Let’s retrieve the Penn Tree Bank (PTB) dataset by using Chainer’s dataset utility
get_ptb_words()
method.
train, val, _ = chainer.datasets.get_ptb_words()
counts = collections.Counter(train)
Then define an iterator to make minibatches that contain a set of center words with their context words.
train
and val
means training data and validation data. Each data contains
the list of Document IDs:
>>> train array([ 0, 1, 2, ..., 39, 26, 24], dtype=int32) >>> val array([2211, 396, 1129, ..., 108, 27, 24], dtype=int32)
class WindowIterator(chainer.dataset.Iterator):
"""Dataset iterator to create a batch of sequences at different positions.
This iterator returns a pair of the current words and the context words.
"""
def __init__(self, dataset, window, batch_size, repeat=True):
self.dataset = np.array(dataset, np.int32)
self.window = window # size of context window
self.batch_size = batch_size
self._repeat = repeat
# order is the array which is shuffled ``[window, window + 1, ...,
# len(dataset)  window  1]``
self.order = np.random.permutation(
len(dataset)  window * 2).astype(np.int32)
self.order += window
self.current_position = 0
# Number of completed sweeps over the dataset. In this case, it is
# incremented if every word is visited at least once after the last
# increment.
self.epoch = 0
# True if the epoch is incremented at the last iteration.
self.is_new_epoch = False
def __next__(self):
"""This iterator returns a list representing a minibatch.
Each item indicates a different position in the original sequence.
"""
if not self._repeat and self.epoch > 0:
raise StopIteration
i = self.current_position
i_end = i + self.batch_size
position = self.order[i:i_end]
w = np.random.randint(self.window  1) + 1
offset = np.concatenate([np.arange(w, 0), np.arange(1, w + 1)])
pos = position[:, None] + offset[None, :]
contexts = self.dataset.take(pos)
center = self.dataset.take(position)
if i_end >= len(self.order):
np.random.shuffle(self.order)
self.epoch += 1
self.is_new_epoch = True
self.current_position = 0
else:
self.is_new_epoch = False
self.current_position = i_end
return center, contexts
@property
def epoch_detail(self):
return self.epoch + float(self.current_position) / len(self.order)
def serialize(self, serializer):
self.current_position = serializer('current_position',
self.current_position)
self.epoch = serializer('epoch', self.epoch)
self.is_new_epoch = serializer('is_new_epoch', self.is_new_epoch)
if self.order is not None:
serializer('order', self.order)
In the constructor, we create an array
self.order
which denotes shuffled indices of[window, window + 1, ..., len(dataset)  window  1]
in order to choose a center word randomly from dataset in a minibatch.The iterator definition
__next__
returnsbatch_size
sets of center word and context words.The code
self.order[i:i_end]
returns the indices for a set of center words from the randomordered arrayself.order
. The center word IDs center at the random indices are retrieved byself.dataset.take
.np.concatenate([np.arange(w, 0), np.arange(1, w + 1)])
creates a set of offsets to retrieve context words from the dataset.The code
position[:, None] + offset[None, :]
generates the indices of context words for each center word index in position. The context word IDs context are retrieved byself.dataset.take
.
4.4 Prepare model, optimizer, and updater¶
model = SkipGram(n_vocab, args.unit, loss_func)
optimizer = O.Adam()
optimizer.setup(model)
train_iter = WindowIterator(train, args.window, args.batchsize)
val_iter = WindowIterator(val, args.window, args.batchsize, repeat=False)
# Set up an updater
updater = training.updaters.StandardUpdater(
train_iter, optimizer, converter=convert, device=device)
trainer = training.Trainer(updater, (args.epoch, 'epoch'), out=args.out)
trainer.extend(extensions.Evaluator(
val_iter, model, converter=convert, device=device))
trainer.extend(extensions.LogReport())
trainer.extend(extensions.PrintReport(
['epoch', 'main/loss', 'validation/main/loss']))
trainer.extend(extensions.ProgressBar())
trainer.extend(
extensions.snapshot(filename='snapshot_epoch_{.updater.epoch}'),
trigger=(args.snapshot_interval, 'epoch'))
if args.resume is not None:
chainer.serializers.load_npz(args.resume, trainer)
trainer.run()
4.5 Start training¶
$ pwd
/root2chainer/chainer/examples/word2vec
$ python train_word2vec.py test # run by test mode. If you want to use all data, remove "test".
GPU: 1
# unit: 100
Window: 5
Minibatchsize: 1000
# epoch: 20
Training model: skipgram
Output type: hsm
n_vocab: 10000
data length: 100
epoch main/loss validation/main/loss
1 4233.75 2495.33
2 1411.14 4990.66
3 4233.11 1247.66
4 2821.66 4990.65
5 4231.94 1247.66
6 5642.04 2495.3
7 5640.82 4990.64
8 5639.31 2495.28
9 2817.89 4990.62
10 1408.03 3742.94
11 5633.11 1247.62
12 4221.71 2495.21
13 4219.3 4990.56
14 4216.57 2495.16
15 4213.52 2495.12
16 5616.03 1247.55
17 5611.34 3742.78
18 2800.31 3742.74
19 1397.79 2494.95
20 2794.1 3742.66
4.5 Search the similar words¶
$ pwd
/root2chainer/chainer/examples/word2vec
$ python search.py
>> apple
query: apple
compaq: 0.6169619560241699
chip: 0.49579331278800964
retailer: 0.4904134273529053
maker: 0.4684058427810669
computer: 0.4652436673641205
>> animal
query: animal
beauty: 0.5680124759674072
human: 0.5404794216156006
insulin: 0.5365156531333923
cell: 0.5186758041381836
photographs: 0.5077002048492432
Write a Sequence to Sequence (seq2seq) Model¶
0. Introduction¶
The sequence to sequence (seq2seq) model[1][2] is a learning model that converts an input sequence into an output sequence. In this context, the sequence is a list of symbols, corresponding to the words in a sentence. The seq2seq model has achieved great success in fields such as machine translation, dialogue systems, question answering, and text summarization. All of these tasks can be regarded as the task to learn a model that converts an input sequence into an output sequence.
1. Basic Idea of Seq2seq Model¶
1.1 Overview of Seq2seq Model¶
The Notations of Sequence¶
The seq2seq model converts an input sequence into an output sequence. Let the input sequence and the output sequence be \(\bf X\) and \(\bf Y\). The \(i\)th element of the input sequence is represented as \({\bf x}_i\), and the \(j\)th element of the output sequence is also represented as \({\bf y}_j\). Generally, each of the \({\bf x}_i\) and the \({\bf y}_j\) is the onehot vector of the symbols. For example, in natural language processing(NLP), the onehot vector represents the word and its size becomes the vocabulary size.
Let’s think about the seq2seq model in the context of NLP. Let the vocabulary of the inputs and the outputs be \({\mathcal V}^{(s)}\) and \({\mathcal V}^{(t)}\), all the elements \({\bf x}_i\) and \({\bf y}_j\) satisfy \({\bf x}_i \in \mathbb{R}^{{\mathcal V}^{(s)}}\) and \({\bf y}_i \in \mathbb{R}^{{\mathcal V}^{(t)}}\). The input sequence \(\bf X\) and the output sequence \(\bf Y\) are represented as the following equations:
\(I\) and \(J\) are the length of the input sequence and the output sequence. Using the typical NLP notation, \({\bf y}_0\) is the onehot vector of BOS, which is the virtual word representing the beginning of the sentence, and \({\bf y}_{J+1}\) is that of EOS, which is the virtual word representing the end of the sentence.
The Notations of Conditional Probability \(P({\bf Y}{\bf X})\)¶
Next, let’s think about the conditional probability \(P({\bf Y}{\bf X})\) generating the output sequence \(\bf Y\) when the input sequence \(\bf X\) is given. The purpose of seq2seq model is modeling the probability \(P({\bf Y}{\bf X})\). However, the seq2seq model does not model the probability \(P({\bf Y}{\bf X})\) directly. Actually, it models the probability \(P({\bf y}_j{\bf Y}_{<j}, {\bf X})\), which is the probability of generating the \(j\)th element of the output sequence \({\bf y}_j\) given the \({\bf Y}_{<j}\) and \({\bf X}\). \({\bf Y}_{<j}\) means the output sequence from \(1\) to \(j1\), or \(({\bf y}_j)_{j=1}^{j1}\). In this notation, you can write the model \(P_{\theta}({\bf Y}{\bf X})\) with the product of \(P_{\theta}({\bf y}_j{\bf Y}_{<j}, {\bf X})\):
Processing Steps in Seq2seq Model¶
Now, let’s think about the processing steps in seq2seq model. The feature of seq2seq model is that it consists of the two processes:
The process that generates the fixed size vector \(\bf z\) from the input sequence \(\bf X\)
The process that generates the output sequence \(\bf Y\) from \(\bf z\)
In other words, the information of \(\bf X\) is conveyed by \(\bf z\), and \(P_{\theta}({\bf y}_j{\bf Y}_{<j}, {\bf X})\) is actually calculated by \(P_{\theta}({\bf y}_j{\bf Y}_{<j}, {\bf z})\).
First, we represent the process which generating \(\bf z\) from \(\bf X\) by the function \(\Lambda\):
The function \(\Lambda\) may be the recurrent neural net such as LSTMs.
Second, we represent the process which generating \(\bf Y\) from \(\bf z\) by the following formula:
\(\Psi\) is the function to generate the hidden vectors \({\bf h}_j^{(t)}\), and \(\Upsilon\) is the function to calculate the generative probability of the onehot vector \({\bf y}_j\). When \(j=1\), \({\bf h}_{j1}^{(t)}\) or \({\bf h}_0^{(t)}\) is \(\bf z\) generated by \(\Lambda({\bf X})\), and \({\bf y}_{j1}\) or \({\bf y}_0\) is the onehot vector of BOS.
1.2 Model Architecture of Seq2seq Model¶
In this section, we describe the architecture of seq2seq model. To simplify the explanation, we use the most basic architecture. The architecture of seq2seq model can be separated to the five major roles.
Encoder Embedding Layer
Encoder Recurrent Layer
Decoder Embedding Layer
Decoder Recurrent Layer
Decoder Output Layer
The encoder consists of two layers: the embedding layer and the recurrent layer, and the decoder consists of three layers: the embedding layer, the recurrent layer, and the output layer.
In the explanation, we use the following symbols:
Symbol 
Definition 

\(H\) 
the size of the hidden vector 
\(D\) 
the size of the embedding vector 
\({\bf x}_i\) 
the onehot vector of \(i\)th word in the input sentence 
\({\bf \bar x}_i\) 
the embedding vector of \(i\)th word in the input sentence 
\({\bf E}^{(s)}\) 
Embedding matrix of the encoder 
\({\bf h}_i^{(s)}\) 
the \(i\)th hidden vector of the encoder 
\({\bf y}_j\) 
the onehot vector of \(j\)th word in the output sentence 
\({\bf \bar y}_j\) 
the embedding vector of \(j\)th word in the output sentence 
\({\bf E}^{(t)}\) 
Embedding matrix of the decoder 
\({\bf h}_j^{(t)}\) 
the \(j\)th hidden vector of the decoder 
1.2.1 Encoder Embedding Layer¶
The first layer, or the encoder embedding layer converts the each word in the input sentence to the embedding vector. When processing the \(i\)th word in the input sentence, the input and the output of the layer are the following:
The input is \({\bf x}_i\) : the onehot vector which represents \(i\)th word
The output is \({\bf \bar x}_i\) : the embedding vector which represents \(i\)th word
Each embedding vector is calculated by the following equation:
\({\bf E}^{(s)} \in {\mathbb R}^{D \times {\mathcal V}^{(s)}}\) is the embedding matrix of the encoder.
1.2.2 Encoder Recurrent Layer¶
The encoder recurrent layer generates the hidden vectors from the embedding vectors. When processing the \(i\)th embedding vector, the input and the output of the layer are the following:
The input is \({\bf \bar x}_i\) : the embedding vector which represents the \(i\)th word
The output is \({\bf h}_i^{(s)}\) : the hidden vector of the \(i\)th position
For example, when using the unidirectional RNN of one layer, the process can be represented as the following function \(\Psi^{(s)}\):
In this case, we use the \({\rm tanh}\) as the activation function.
1.2.3 Decoder Embedding Layer¶
The decoder embedding layer converts the each word in the output sentence to the embedding vector. When processing the \(j\)th word in the output sentence, the input and the output of the layer are the following:
The input is \({\bf y}_{j1}\) : the onehot vector which represents the \((j1)\)th word generated by the decoder output layer
The output is \({\bf \bar y}_j\) : the embedding vector which represents the \((j1)\)th word
Each embedding vector is calculated by the following equation:
\({\bf E}^{(t)} \in {\mathbb R}^{D \times {\mathcal V}^{(t)}}\) is the embedding matrix of the encoder.
1.2.4 Decoder Recurrent Layer¶
The decoder recurrent layer generates the hidden vectors from the embedding vectors. When processing the \(j\)th embedding vector, the input and the output of the layer are the following:
The input is \({\bf \bar y}_j\) : the embedding vector
The output is \({\bf h}_j^{(t)}\) : the hidden vector of \(j\)th position
For example, when using the unidirectional RNN of one layer, the process can be represented as the following function \(\Psi^{(t)}\):
In this case, we use the \({\rm tanh}\) as the activation function. And we must use the encoder’s hidden vector of the last position as the decoder’s hidden vector of first position as following:
1.2.5 Decoder Output Layer¶
The decoder output layer generates the probability of the \(j\)th word of the output sentence from the hidden vector. When processing the \(j\)th embedding vector, the input and the output of the layer are the following:
The input is \({\bf h}_j^{(t)}\) : the hidden vector of \(j\)th position
The output is \(p_j\) : the probability of generating the onehot vector \({\bf y}_j\) of the \(j\)th word
Note
There are a lot of varieties of seq2seq models. We can use the different RNN models in terms of: (1) directionality (unidirectional or bidirectional), (2) depth (singlelayer or multilayer), (3) type (a vanilla RNN, a Long Shortterm Memory (LSTM), or a gated recurrent unit (GRU)), and (4) additional functionality (s.t. Attention Mechanism).
2. Implementation of Seq2seq Model¶
The official Chainer repository includes a neural machine translation example using the seq2seq model. We will now provide an overview of the example and explain its implementation in detail. chainer/examples/seq2seq
2.1 Model Overview¶
In this simple example, an input sequence is processed by a stacked LSTMRNN (long shortterm memory recurrent neural networks) and it is encoded as a fixedsize vector. The output sequence is also processed by another stacked LSTMRNN. At decoding time, an output sequence is generated using argmax.
2.2 Stepbystep Implementation¶
2.2.1 Import Package¶
First, let’s import necessary packages.
import io
from nltk.translate import bleu_score
import numpy
import progressbar
import six
import chainer
import chainer.functions as F
import chainer.links as L
from chainer import training
2.2.2 Define Training Settings¶
Define all training settings here.
parser.add_argument('SOURCE', help='source sentence list')
parser.add_argument('TARGET', help='target sentence list')
parser.add_argument('SOURCE_VOCAB', help='source vocabulary file')
parser.add_argument('TARGET_VOCAB', help='target vocabulary file')
parser.add_argument('validationsource',
help='source sentence list for validation')
parser.add_argument('validationtarget',
help='target sentence list for validation')
parser.add_argument('batchsize', 'b', type=int, default=64,
help='number of sentence pairs in each minibatch')
parser.add_argument('epoch', 'e', type=int, default=20,
help='number of sweeps over the dataset to train')
parser.add_argument('resume', 'r', type=str,
help='resume the training from snapshot')
parser.add_argument('save', 's', type=str,
help='save a snapshot of the training')
parser.add_argument('unit', 'u', type=int, default=1024,
help='number of units')
parser.add_argument('layer', 'l', type=int, default=3,
help='number of layers')
parser.add_argument('usedatasetapi', default=False,
action='store_true',
help='use TextDataset API to reduce CPU memory usage')
parser.add_argument('minsourcesentence', type=int, default=1,
help='minimium length of source sentence')
parser.add_argument('maxsourcesentence', type=int, default=50,
help='maximum length of source sentence')
parser.add_argument('mintargetsentence', type=int, default=1,
help='minimium length of target sentence')
parser.add_argument('maxtargetsentence', type=int, default=50,
help='maximum length of target sentence')
parser.add_argument('loginterval', type=int, default=200,
help='number of iteration to show log')
parser.add_argument('validationinterval', type=int, default=4000,
help='number of iteration to evlauate the model '
'with validation dataset')
parser.add_argument('device', 'd', type=str, default='1',
help='Device specifier. Either ChainerX device '
'specifier or an integer. If nonnegative integer, '
'CuPy arrays with specified device id are used. If '
'negative integer, NumPy arrays are used')
parser.add_argument('out', 'o', default='result',
help='directory to output the result')
group = parser.add_argument_group('deprecated arguments')
group.add_argument('gpu', 'g', dest='device',
type=int, nargs='?', const=0,
help='GPU ID (negative value indicates CPU)')
2.2.3 Define Network Structure¶
The Chainer implementation of seq2seq is shown below. It implements the model depicted in the above figure.
class Seq2seq(chainer.Chain):
def __init__(self, n_layers, n_source_vocab, n_target_vocab, n_units):
super(Seq2seq, self).__init__()
with self.init_scope():
self.embed_x = L.EmbedID(n_source_vocab, n_units)
self.embed_y = L.EmbedID(n_target_vocab, n_units)
self.encoder = L.NStepLSTM(n_layers, n_units, n_units, 0.1)
self.decoder = L.NStepLSTM(n_layers, n_units, n_units, 0.1)
self.W = L.Linear(n_units, n_target_vocab)
self.n_layers = n_layers
self.n_units = n_units
def forward(self, xs, ys):
xs = [x[::1] for x in xs]
eos = self.xp.array([EOS], numpy.int32)
ys_in = [F.concat([eos, y], axis=0) for y in ys]
ys_out = [F.concat([y, eos], axis=0) for y in ys]
# Both xs and ys_in are lists of arrays.
exs = sequence_embed(self.embed_x, xs)
eys = sequence_embed(self.embed_y, ys_in)
batch = len(xs)
# None represents a zero vector in an encoder.
hx, cx, _ = self.encoder(None, None, exs)
_, _, os = self.decoder(hx, cx, eys)
# It is faster to concatenate data before calculating loss
# because only one matrix multiplication is called.
concat_os = F.concat(os, axis=0)
concat_ys_out = F.concat(ys_out, axis=0)
loss = F.sum(F.softmax_cross_entropy(
self.W(concat_os), concat_ys_out, reduce='no')) / batch
chainer.report({'loss': loss}, self)
n_words = concat_ys_out.shape[0]
perp = self.xp.exp(loss.array * batch / n_words)
chainer.report({'perp': perp}, self)
return loss
def translate(self, xs, max_length=100):
batch = len(xs)
with chainer.no_backprop_mode(), chainer.using_config('train', False):
xs = [x[::1] for x in xs]
exs = sequence_embed(self.embed_x, xs)
h, c, _ = self.encoder(None, None, exs)
ys = self.xp.full(batch, EOS, numpy.int32)
result = []
for i in range(max_length):
eys = self.embed_y(ys)
eys = F.split_axis(eys, batch, 0)
h, c, ys = self.decoder(h, c, eys)
cys = F.concat(ys, axis=0)
wy = self.W(cys)
ys = self.xp.argmax(wy.array, axis=1).astype(numpy.int32)
result.append(ys)
# Using `xp.concatenate(...)` instead of `xp.stack(result)` here to
# support NumPy 1.9.
result = chainer.get_device('@numpy').send(
self.xp.concatenate([x[None, :] for x in result]).T)
# Remove EOS taggs
outs = []
for y in result:
inds = numpy.argwhere(y == EOS)
if len(inds) > 0:
y = y[:inds[0, 0]]
outs.append(y)
return outs
In
Seq2seq
, three functions are defined: the constructor__init__
, the function callforward
, and the function for translationtranslate
.
def __init__(self, n_layers, n_source_vocab, n_target_vocab, n_units):
super(Seq2seq, self).__init__()
with self.init_scope():
self.embed_x = L.EmbedID(n_source_vocab, n_units)
self.embed_y = L.EmbedID(n_target_vocab, n_units)
self.encoder = L.NStepLSTM(n_layers, n_units, n_units, 0.1)
self.decoder = L.NStepLSTM(n_layers, n_units, n_units, 0.1)
self.W = L.Linear(n_units, n_target_vocab)
self.n_layers = n_layers
self.n_units = n_units
When we instantiate this class for making a model, we give the number of stacked lstms to
n_layers
, the vocabulary size of the source language ton_source_vocab
, the vocabulary size of the target language ton_target_vocab
, and the size of hidden vectors ton_units
.This network uses
chainer.links.NStepLSTM
,chainer.links.EmbedID
, andchainer.links.Linear
as its building blocks. All the layers are registered and initialized in the context withself.init_scope()
.You can access all the parameters in those layers by calling
self.params()
.In the constructor, it initializes all parameters with values sampled from a uniform distribution \(U(1, 1)\).
def forward(self, xs, ys):
xs = [x[::1] for x in xs]
eos = self.xp.array([EOS], numpy.int32)
ys_in = [F.concat([eos, y], axis=0) for y in ys]
ys_out = [F.concat([y, eos], axis=0) for y in ys]
# Both xs and ys_in are lists of arrays.
exs = sequence_embed(self.embed_x, xs)
eys = sequence_embed(self.embed_y, ys_in)
batch = len(xs)
# None represents a zero vector in an encoder.
hx, cx, _ = self.encoder(None, None, exs)
_, _, os = self.decoder(hx, cx, eys)
# It is faster to concatenate data before calculating loss
# because only one matrix multiplication is called.
concat_os = F.concat(os, axis=0)
concat_ys_out = F.concat(ys_out, axis=0)
loss = F.sum(F.softmax_cross_entropy(
self.W(concat_os), concat_ys_out, reduce='no')) / batch
chainer.report({'loss': loss}, self)
n_words = concat_ys_out.shape[0]
perp = self.xp.exp(loss.array * batch / n_words)
chainer.report({'perp': perp}, self)
return loss
The
forward
method takes sequences of source language’s word IDsxs
and sequences of target language’s word IDsys
. Each sequence represents a sentence, and the size ofxs
is minibatch size.Note that the sequences of word IDs
xs
andys
are converted to a vocabularysize onehot vectors and then multiplied with the embedding matrix insequence_embed
to obtain embedding vectorsexs
andeys
.def sequence_embed(embed, xs): x_len = [len(x) for x in xs] x_section = numpy.cumsum(x_len[:1]) ex = embed(F.concat(xs, axis=0)) exs = F.split_axis(ex, x_section, 0) return exs
self.encoder
andself.decoder
are the encoder and the decoder of the seq2seq model. Each element of the decoder outputos
is \(h_{[1:J]}^{(t)}\) in the figure above.After calculating the recurrent layer output, the loss
loss
and the perplexityperp
are calculated, and the values are logged bychainer.report
.
Note
It is well known that the seq2seq model learns much better when the source
sentences are reversed.
The paper[1] says that “While the LSTM is capable of solving problems with
long term dependencies, we discovered that the LSTM learns much better when
the source sentences are reversed (the target sentences are not reversed).
By doing so, the LSTM’s test perplexity dropped from 5.8 to 4.7, and the test
BLEU scores of its decoded translations increased from 25.9 to 30.6.”
So, at the first line in the forward
, the input sentences are reversed
xs = [x[::1] for x in xs]
.
def translate(self, xs, max_length=100):
batch = len(xs)
with chainer.no_backprop_mode(), chainer.using_config('train', False):
xs = [x[::1] for x in xs]
exs = sequence_embed(self.embed_x, xs)
h, c, _ = self.encoder(None, None, exs)
ys = self.xp.full(batch, EOS, numpy.int32)
result = []
for i in range(max_length):
eys = self.embed_y(ys)
eys = F.split_axis(eys, batch, 0)
h, c, ys = self.decoder(h, c, eys)
cys = F.concat(ys, axis=0)
wy = self.W(cys)
ys = self.xp.argmax(wy.array, axis=1).astype(numpy.int32)
result.append(ys)
# Using `xp.concatenate(...)` instead of `xp.stack(result)` here to
# support NumPy 1.9.
result = chainer.get_device('@numpy').send(
self.xp.concatenate([x[None, :] for x in result]).T)
# Remove EOS taggs
outs = []
for y in result:
inds = numpy.argwhere(y == EOS)
if len(inds) > 0:
y = y[:inds[0, 0]]
outs.append(y)
return outs
After the model learned the parameters, the function
translate
is called to generate the translated sentencesouts
from the source sentencesxs
.So as not to change the parameters, the codes for the translation are nested in the scope
chainer.no_backprop_mode()
andchainer.using_config('train', False)
.
2.2.4 Load FrenchEnglish Corpus from WMT15 Dataset¶
In this tutorial, we use FrenchEnglish corpus from WMT15 website that contains 10^9 documents. We must prepare additional libraries, dataset, and parallel corpus. To understand the preprocessing, see 2.3.1 Requirements.
After the preprocessing the dataset, let’s make dataset objects:
# Load preprocessed dataset
print('[{}] Loading dataset... (this may take several minutes)'.format(
datetime.datetime.now()))
source_ids = load_vocabulary(args.SOURCE_VOCAB)
target_ids = load_vocabulary(args.TARGET_VOCAB)
if args.use_dataset_api:
# By using TextDataset, you can avoid loading whole dataset on memory.
# This significantly reduces the host memory usage.
def _filter_func(s, t):
sl = len(s.strip().split()) # number of words in source line
tl = len(t.strip().split()) # number of words in target line
return (
args.min_source_sentence <= sl <= args.max_source_sentence and
args.min_target_sentence <= tl <= args.max_target_sentence)
train_data = load_data_using_dataset_api(
source_ids, args.SOURCE,
target_ids, args.TARGET,
_filter_func,
)
else:
# Load all records on memory.
train_source = load_data(source_ids, args.SOURCE)
train_target = load_data(target_ids, args.TARGET)
assert len(train_source) == len(train_target)
train_data = [
(s, t)
for s, t in six.moves.zip(train_source, train_target)
if (args.min_source_sentence <= len(s) <= args.max_source_sentence
and
args.min_target_sentence <= len(t) <= args.max_target_sentence)
]
print('[{}] Dataset loaded.'.format(datetime.datetime.now()))
if not args.use_dataset_api:
# Skip printing statistics when using TextDataset API, as it is slow.
train_source_unknown = calculate_unknown_ratio(
[s for s, _ in train_data])
train_target_unknown = calculate_unknown_ratio(
[t for _, t in train_data])
print('Source vocabulary size: %d' % len(source_ids))
print('Target vocabulary size: %d' % len(target_ids))
print('Train data size: %d' % len(train_data))
print('Train source unknown ratio: %.2f%%' % (
train_source_unknown * 100))
print('Train target unknown ratio: %.2f%%' % (
train_target_unknown * 100))
target_words = {i: w for w, i in target_ids.items()}
source_words = {i: w for w, i in source_ids.items()}
This code uses utility functions below:
def load_vocabulary(path): with io.open(path, encoding='utf8') as f: # +2 for UNK and EOS word_ids = {line.strip(): i + 2 for i, line in enumerate(f)} word_ids['<UNK>'] = 0 word_ids['<EOS>'] = 1 return word_ids
def load_data(vocabulary, path): n_lines = count_lines(path) bar = progressbar.ProgressBar() data = [] print('loading...: %s' % path) with io.open(path, encoding='utf8') as f: for line in bar(f, max_value=n_lines): words = line.strip().split() array = numpy.array([vocabulary.get(w, UNK) for w in words], numpy.int32) data.append(array) return data
def calculate_unknown_ratio(data): unknown = sum((s == UNK).sum() for s in data) total = sum(s.size for s in data) return unknown / total
2.2.5 Define Evaluation Function (Bleu Score)¶
BLEU[3] (bilingual evaluation understudy) is the evaluation metric for the quality of text which has been machinetranslated from one natural language to another.
class CalculateBleu(chainer.training.Extension):
trigger = 1, 'epoch'
priority = chainer.training.PRIORITY_WRITER
def __init__(
self, model, test_data, key, device, batch=100, max_length=100):
self.model = model
self.test_data = test_data
self.key = key
self.batch = batch
self.device = device
self.max_length = max_length
def __call__(self, trainer):
device = self.device
with chainer.no_backprop_mode():
references = []
hypotheses = []
for i in range(0, len(self.test_data), self.batch):
sources, targets = zip(*self.test_data[i:i + self.batch])
references.extend([[t.tolist()] for t in targets])
sources = [device.send(x) for x in sources]
ys = [y.tolist()
for y in self.model.translate(sources, self.max_length)]
hypotheses.extend(ys)
bleu = bleu_score.corpus_bleu(
references, hypotheses,
smoothing_function=bleu_score.SmoothingFunction().method1)
chainer.report({self.key: bleu})
2.2.6 Create Iterator¶
Here, the code below just creates iterator objects.
train_iter = chainer.iterators.SerialIterator(train_data, args.batchsize)
2.2.7 Create RNN and Classification Model¶
Instantiate Seq2seq
model.
model = Seq2seq(args.layer, len(source_ids), len(target_ids), args.unit)
2.2.8 Setup Optimizer¶
Prepare an optimizer. We use chainer.optimizers.Adam
.
optimizer = chainer.optimizers.Adam()
optimizer.setup(model)
2.2.9 Setup and Run Trainer¶
Let’s make a trainer object.
updater = training.updaters.StandardUpdater(
train_iter, optimizer, converter=convert, device=device)
trainer = training.Trainer(updater, (args.epoch, 'epoch'), out=args.out)
trainer.extend(extensions.LogReport(
trigger=(args.log_interval, 'iteration')))
trainer.extend(extensions.PrintReport(
['epoch', 'iteration', 'main/loss', 'main/perp',
'validation/main/bleu', 'elapsed_time']),
trigger=(args.log_interval, 'iteration'))
trainer.extend(
extensions.snapshot(filename='snapshot_epoch_{.updater.iteration}'),
trigger=(args.validation_interval, 'iteration'))
Setup the trainer’s extension to see the BLEU score on the test data.
test_source = load_data(source_ids, args.validation_source)
test_target = load_data(target_ids, args.validation_target)
assert len(test_source) == len(test_target)
test_data = list(six.moves.zip(test_source, test_target))
test_data = [(s, t) for s, t in test_data if 0 < len(s) and 0 < len(t)]
test_source_unknown = calculate_unknown_ratio(
[s for s, _ in test_data])
test_target_unknown = calculate_unknown_ratio(
[t for _, t in test_data])
print('Validation data: %d' % len(test_data))
print('Validation source unknown ratio: %.2f%%' %
(test_source_unknown * 100))
print('Validation target unknown ratio: %.2f%%' %
(test_target_unknown * 100))
@chainer.training.make_extension()
def translate(trainer):
source, target = test_data[numpy.random.choice(len(test_data))]
result = model.translate([model.xp.array(source)])[0]
source_sentence = ' '.join([source_words[x] for x in source])
target_sentence = ' '.join([target_words[y] for y in target])
result_sentence = ' '.join([target_words[y] for y in result])
print('# source : ' + source_sentence)
print('# result : ' + result_sentence)
print('# expect : ' + target_sentence)
trainer.extend(
translate, trigger=(args.validation_interval, 'iteration'))
trainer.extend(
CalculateBleu(
model, test_data, 'validation/main/bleu', device),
trigger=(args.validation_interval, 'iteration'))
if args.resume is not None:
# Resume from a snapshot
chainer.serializers.load_npz(args.resume, trainer)
Let’s start the training!
trainer.run()
if args.save is not None:
# Save a snapshot
chainer.serializers.save_npz(args.save, trainer)
2.3 Run Example¶
2.3.1 Requirements¶
Before running the example, you must prepare additional libraries, dataset, and parallel corpus.
See the detail description: chainer/examples/seq2seq/README.md
2.3.1 Training the model¶
You can train the model with the script: chainer/examples/seq2seq/seq2seq.py
$ pwd
/root2chainer/chainer/examples/seq2seq
$ python seq2seq.py gpu=0 gigafren.preprocess.en gigafren.preprocess.fr \
vocab.en vocab.fr \
validationsource newstest2013.preprocess.en \
validationtarget newstest2013.preprocess.fr > log
100% (22520376 of 22520376) ############# Elapsed Time: 0:09:20 Time: 0:09:20
100% (22520376 of 22520376) ############# Elapsed Time: 0:10:36 Time: 0:10:36
100% (3000 of 3000) ##################### Elapsed Time: 0:00:00 Time: 0:00:00
100% (3000 of 3000) ##################### Elapsed Time: 0:00:00 Time: 0:00:00
epoch iteration main/loss validation/main/loss main/perp validation/main/perp validation/main/bleu elapsed_time
0 200 171.449 991.556 85.6739
0 400 143.918 183.594 172.473
0 600 133.48 126.945 260.315
0 800 128.734 104.127 348.062
0 1000 124.741 91.5988 436.536
...
Note
Before running the script, be careful the locale and the python’s encoding. Please setup them to use utf8 encoding.
2.3.1 Validate the model¶
While you are training the model, you can get the validation results:
...
# source : We knew the Government had tried many things , like launching <UNK> with <UNK> or organising speed dating evenings .
# result : Nous savions que le gouvernement avait <UNK> plusieurs fois , comme le <UNK> <UNK> , le <UNK> ou le <UNK> <UNK> .
# expect : Nous savions que le gouvernement avait tenté plusieurs choses comme lancer des parfums aux <UNK> ou organiser des soirées de <UNK>
...
API Reference¶
Variable and Parameter¶
Variable classes and utilities¶
Array with a structure to keep track of computation. 

Returns the underlying array from a variable or an array. 

Converts an array or a variable into 

Runs backpropagation from variables simultaneously. 

Parameter variable that can be registered to a link. 

Node in the backward computational graph representing a variable. 
Ndimensional array¶
chainer.Variable
holds its value as an ndimensional array (ndarray).
Chainer supports the following classes:
numpy.ndarray
, includingideep4py.mdarray
Note
Python scalars (float
, etc.) and NumPy scalars (numpy.float16
, numpy.float32
, etc.) cannot be used as chainer.Variable.array
.
See also chainer.utils.force_array()
.
Functions¶
Chainer provides variety of builtin function implementations in chainer.functions
package.
These functions usually return a Variable
object or a tuple of multiple Variable
objects.
For a Variable
argument of a function, an Ndimensional array can be passed if you do not need its gradient.
Some functions additionally supports scalar arguments.
Note
Functions implemented in Chainer consists of the following two parts:
A class that inherits
FunctionNode
, which defines forward/backward computation.A “wrapper” function around the class.
APIs listed in this page are “wrapper” of FunctionNode
implementations.
In most cases, you don’t have to use FunctionNode
classes directly.
For example, chainer.functions.sum()
is a wrapper function defined as def sum(...):
in chainer/functions/math/sum.py, and it calls its corresponding FunctionNode
implementation, Sum
.
Some functions may not have the corresponding FunctionNode
implementation; one example is chainer.functions.average()
, which is defined in chainer/functions/math/average.py, which calls other wrapper functions to calculate average.
If you are implementing your own functions, please see Define your own function.
Arithmetic functions¶
Basic arithmetic operations for Variable
s are implemented as operators.
Refer to the Notes section of Variable
for details.
chainer.functions.add()
provides better performance when accumulating three or more Variable
s at once.
Elementwise addition. 
Activation functions¶
Clipped Rectifier Unit function. 

Concatenated Rectified Linear Unit function. 

Exponential Linear Unit function. 

Elementwise hardsigmoid function. 

Leaky Rectified Linear Unit function. 

Channelwise logsoftmax function. 

Long ShortTerm Memory units as an activation function. 

Maxout activation function. 

Parametric ReLU function. 

Randomized Leaky Rectified Liner Unit function. 

Rectified Linear Unit function. 

Rectifier Unit function clipped at 6. 

Scaled Exponential Linear Unit function. 

Elementwise sigmoid logistic function. 

SLSTM units as an activation function. 

Softmax function. 

Elementwise softplus function. 

Swish activation function. 

Elementwise hyperbolic tangent function. 

TreeLSTM unit as an activation function. 
Array manipulations¶
Create a new view of array with the given shape, strides, and offset. 

Broadcast given variables. 

Broadcast a given variable to a given shape. 

Cast an input variable to a given type. 

Concatenates given variables along an axis. 

Copies the input variable onto the specified device. 

Computes the depth2space transformation for subpixel calculations. 

Take diagonal 

Concatenate variables along third axis (depth wise). 

Expands dimensions of an input variable without copy. 

Flatten a given array into one dimension. 

Flips an input variable in reverse order along the given axis. 

Flip array in the left/right direction. 

Flip array in the up/down direction. 

Extract elements from array with specified shape, axes and offsets. 

Concatenate variables horizontally (column wise). 

Extract patches from an image based on the filter. 

Move the source axes to the destination. 

Pad an input variable. 

Pad given arrays to make a matrix. 

Permutates a given variable along an axis. 

Construct an array by repeating a given array. 

Reshapes an input variable without copy. 

Resize images to the given shape. 

Roll the axis backwards to the given position. 

Adds given values to specified elements of an array. 

Select elements stored in given indices. 

Separates an array along a given axis. 

Computes the space2depth transformation for subpixel calculations. 

2D Spatial Transformer grid. 

2D Spatial Transformer sampler. 

Splits given variables along an axis. 

Remove dimensions of size one from the shape of a ndarray. 

Concatenate variables along a new axis. 

Swap two axes of a variable. 

Construct an array by tiling a given array. 

Permute the dimensions of an input variable without copy. 

Transpose a list of Variables. 

Concatenate variables vertically (row wise). 

Choose elements depending on condition. 
Neural network connections¶
Applies a bilinear function based on given parameters. 

1dimensional convolution function. 

Twodimensional convolution function. 

3dimensional convolution function. 

Ndimensional convolution function. 

1dimensional deconvolution function. 

Two dimensional deconvolution function. 

3dimensional deconvolution function. 

Ndimensional deconvolution function. 

Twodimensional depthwise convolution function. 

Twodimensional deformable convolution function using computed offset. 

Twodimensional dilated convolution function. 

Efficient linear function for onehot input. 

Linear function, or affine transformation. 

Twodimensional local convolution function. 

Stacked Bidirectional Gated Recurrent Unit function. 

Stacked Bidirectional Long ShortTerm Memory function. 

Stacked Bidirectional RNN function for sequence inputs. 

Stacked Unidirectional Gated Recurrent Unit function. 

Stacked Unidirectional Long ShortTerm Memory function. 

Stacked Unidirectional RNN function for sequence inputs. 

Shift function. 
Evaluation functions¶
Computes multiclass classification accuracy of the minibatch. 

Computes binary classification accuracy of the minibatch. 

Calculates Precision, Recall, F beta Score, and support. 

Computes R^2(coefficient of determination) regression score function. 

Loss functions¶
Elementwise absolute error function. 

Computes the negative loglikelihood of a Bernoulli distribution. 

BlackOut loss function. 

Connectionist Temporal Classification loss function. 

Computes contrastive loss. 

Calculates negative loglikelihood of linearchain CRF. 

Computes a state that maximizes a joint probability of the given CRF. 

Computes the sumsquared crosscovariance penalty between 

Computes the DeCov loss of 


Discriminative marginbased clustering loss function 
Computes the KLdivergence of Gaussian variables from the standard one. 

Computes the negative loglikelihood of a Gaussian distribution. 

Computes the hinge loss for a oneofmany classification task. 

Computes the Huber loss. 

Mean absolute error function. 

Mean squared error function. 

Negative sampling loss function. 

Computes cross entropy loss for presigmoid activations. 

Computes cross entropy loss for presoftmax activations. 

Squared error function. 

Computes triplet loss. 
Mathematical functions¶
Elementwise absolute. 
