Chainer – A flexible framework of neural networks¶
This is the Chainer documentation.
Tutorial¶
Introduction to Chainer¶
This is the first section of the Chainer Tutorial. In this section, you will learn about the following things:
- Pros and cons of existing frameworks and why we are developing Chainer
- Simple example of forward and backward computation
- Usage of links and their gradient computation
- Construction of chains (a.k.a. “model” in most frameworks)
- Parameter optimization
- Serialization of links and optimizers
After reading this section, you will be able to:
- Compute gradients of some arithmetics
- Write a multi-layer perceptron with Chainer
Core Concept¶
As mentioned on the front page, Chainer is a flexible framework for neural networks. One major goal is flexibility, so it must enable us to write complex architectures simply and intuitively.
Most existing deep learning frameworks are based on the “Define-and-Run” scheme. That is, first a network is defined and fixed, and then the user periodically feeds it with mini-batches. Since the network is statically defined before any forward/backward computation, all the logic must be embedded into the network architecture as data. Consequently, defining a network architecture in such systems (e.g. Caffe) follows a declarative approach. Note that one can still produce such a static network definition using imperative languages (e.g. torch.nn, Theano-based frameworks, and TensorFlow).
Define-by-Run¶
In contrast, Chainer adopts a “Define-by-Run” scheme, i.e., the network is defined on-the-fly via the actual forward computation. More precisely, Chainer stores the history of computation instead of programming logic. This strategy enables us to fully leverage the power of programming logic in Python. For example, Chainer does not need any magic to introduce conditionals and loops into the network definitions. The Define-by-Run scheme is the core concept of Chainer. We will show in this tutorial how to define networks dynamically.
This strategy also makes it easy to write multi-GPU parallelization, since logic comes closer to network manipulation. We will review such amenities in later sections of this tutorial.
Chainer represents a network as an execution path on a computational graph.
A computational graph is a series of function applications, so that it can be described with multiple Function
objects.
When such function is a layer of neural network, the parameters of the function will be updated through training.
Therefore, the function needs to keep trainable parameters inside, so that Chainer has Link
class that can keep trainable parameters in the object of the class.
The parameters of the function performed inside the Link
object are represented as Variable
objects.
In short, the difference between these two objects, Link
and Function
, is whether it contains trainable parameters or not.
A neural network model is typically described as a series of Function
and Link
.
You can build a computational graph by dynamically ‘chaining’ various kinds of Link
s and Function
s to define a Chain
. In the framework, the network is defined by running the chained graph, hence the name is Chainer.
Note
In the example code of this tutorial, we assume for simplicity that the following symbols are already imported:
import numpy as np
import chainer
from chainer import cuda, Function, gradient_check, report, training, utils, Variable
from chainer import datasets, iterators, optimizers, serializers
from chainer import Link, Chain, ChainList
import chainer.functions as F
import chainer.links as L
from chainer.training import extensions
These imports appear widely in Chainer code and examples. For simplicity, we omit these imports in this tutorial.
Forward/Backward Computation¶
As described above, Chainer uses the “Define-by-Run” scheme, so forward computation itself defines the network.
In order to start forward computation, we have to set the input array to a Variable
object.
Here we start with a simple ndarray
with only one element:
>>> x_data = np.array([5], dtype=np.float32)
>>> x = Variable(x_data)
A Variable
object has basic arithmetic operators.
In order to compute \(y = x^2 - 2x + 1\), just write:
>>> y = x**2 - 2 * x + 1
The resulting y
is also a Variable
object, whose value can be extracted by accessing the data
attribute:
>>> y.data
array([16.], dtype=float32)
What y
holds is not only the result value.
It also holds the history of computation (i.e., computational graph), which enables to compute its differentiation.
This is done by calling its backward()
method:
>>> y.backward()
This runs error backpropagation (a.k.a. backprop or reverse-mode automatic differentiation).
Then, the gradient is computed and stored in the grad
attribute of the input variable x
:
>>> x.grad
array([8.], dtype=float32)
Also we can compute gradients of intermediate variables.
Note that Chainer, by default, releases the gradient arrays of intermediate variables for memory efficiency.
In order to preserve gradient information, pass the retain_grad
argument to the backward method:
>>> z = 2*x
>>> y = x**2 - z + 1
>>> y.backward(retain_grad=True)
>>> z.grad
array([-1.], dtype=float32)
Otherwise, z.grad
will be None
as follows:
>>> y.backward() # The default value of retain_grad is False
>>> z.grad is None
True
All these computations are easily generalized to multi-element array input.
Note that if we want to start backward computation from a variable holding a multi-element array, we must set the initial error manually.
Because when the size
of a variable (it means the number of elements in the array) is 1
, it’s considered as a variable object that represents a loss value, so that the grad
attribute of the variable is automatically filled with 1
.
On the other hand, when the size of a variable is larger than 1
, the grad
attribute remains None
, and it is necessary to set the initial error explicitly before running backward()
.
This is simply done by setting the grad
attribute of the output variable as follows:
>>> x = Variable(np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float32))
>>> y = x**2 - 2*x + 1
>>> y.grad = np.ones((2, 3), dtype=np.float32)
>>> y.backward()
>>> x.grad
array([[ 0., 2., 4.],
[ 6., 8., 10.]], dtype=float32)
Links¶
In order to write neural networks, we have to combine functions with parameters and optimize the parameters.
You can use links to do this.
A Link
is an object that holds parameters (i.e., optimization targets).
The most fundamental ones are links that behave like regular functions while replacing some arguments by their parameters. We will introduce higher level links, but here think of links as simply functions with parameters.
One of the most frequently used links is the Linear
link (a.k.a. fully-connected layer or affine transformation).
It represents a mathematical function \(f(x) = xW^\top + b\), where the matrix \(W\) and the vector \(b\) are parameters.
This link corresponds to its pure counterpart linear()
, which accepts \(x, W, b\) as arguments.
A linear link from three-dimensional space to two-dimensional space is defined by the following line:
>>> f = L.Linear(3, 2)
Note
Most functions and links only accept mini-batch input, where the first dimension of the input array is considered as the batch dimension. In the above Linear link case, input must have shape of (N, 3), where N is the mini-batch size.
The parameters of a link are stored as attributes.
Each parameter is an instance of Variable
.
In the case of the Linear link, two parameters, W
and b
, are stored.
By default, the matrix W
is initialized randomly, while the vector b
is initialized with zeros.
>>> f.W.data
array([[ 1.0184761 , 0.23103087, 0.5650746 ],
[ 1.2937803 , 1.0782351 , -0.56423163]], dtype=float32)
>>> f.b.data
array([0., 0.], dtype=float32)
An instance of the Linear link acts like a usual function:
>>> x = Variable(np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float32))
>>> y = f(x)
>>> y.data
array([[3.1757617, 1.7575557],
[8.619507 , 7.1809077]], dtype=float32)
Note
Sometimes it is cumbersome to compute the dimension of the input space. The linear link and some of (de)convolution links can omit the input dimension in their instantiation and infer it from the first mini-batch.
For example, the following line creates a linear link whose output dimension is two:
f = L.Linear(2)
If we feed a mini-batch of shape (N, M)
, the input dimension will be inferred as M
,
which means f.W
will be a 2 x M matrix.
Note that its parameters are initialized in a lazy manner at the first mini-batch.
Therefore, f
does not have W
attribute if no data is put to the link.
Gradients of parameters are computed by the backward()
method.
Note that gradients are accumulated by the method rather than overwritten.
So first you must clear gradients to renew the computation.
It can be done by calling the cleargrads()
method.
>>> f.cleargrads()
Note
cleargrads()
is introduced in v1.15 to replace zerograds()
for efficiency.
zerograds()
is left only for backward compatibility.
Now we can compute the gradients of parameters by simply calling the backward method.
>>> y.grad = np.ones((2, 2), dtype=np.float32)
>>> y.backward()
>>> f.W.grad
array([[5., 7., 9.],
[5., 7., 9.]], dtype=float32)
>>> f.b.grad
array([2., 2.], dtype=float32)
Write a model as a chain¶
Most neural network architectures contain multiple links. For example, a multi-layer perceptron consists of multiple linear layers. We can write complex procedures with trainable parameters by combining multiple links like this:
>>> l1 = L.Linear(4, 3)
>>> l2 = L.Linear(3, 2)
>>> def my_forward(x):
... h = l1(x)
... return l2(h)
Here the L
indicates the links
module.
A procedure with parameters defined in this way is hard to reuse.
More Pythonic way is combining the links and procedures into a class:
>>> class MyProc(object):
... def __init__(self):
... self.l1 = L.Linear(4, 3)
... self.l2 = L.Linear(3, 2)
...
... def forward(self, x):
... h = self.l1(x)
... return self.l2(h)
In order to make it more reusable, we want to support parameter management, CPU/GPU migration, robust and flexible save/load features, etc.
These features are all supported by the Chain
class in Chainer.
Then, what we have to do here is just define the above class as a subclass of Chain:
>>> class MyChain(Chain):
... def __init__(self):
... super(MyChain, self).__init__()
... with self.init_scope():
... self.l1 = L.Linear(4, 3)
... self.l2 = L.Linear(3, 2)
...
... def __call__(self, x):
... h = self.l1(x)
... return self.l2(h)
It shows how a complex chain is constructed by simpler links.
Links like l1
and l2
are called child links of MyChain.
Note that Chain itself inherits Link.
It means we can define more complex chains that hold MyChain objects as their child links.
Note
We often define a single forward method of a link by the __call__
operator.
Such links and chains are callable and behave like regular functions of Variables.
Note
In Chainer v1, we could also register the trainable layers
(i.e., Link
s) to the model by putting them to the
__init__()
of Chain
or registering them via add_link()
.
But as these ways are deprecated in Chainer v2, users are recommended
to use the way explained above.
Another way to define a chain is using the ChainList
class, which behaves like a list of links:
>>> class MyChain2(ChainList):
... def __init__(self):
... super(MyChain2, self).__init__(
... L.Linear(4, 3),
... L.Linear(3, 2),
... )
...
... def __call__(self, x):
... h = self[0](x)
... return self[1](h)
ChainList can conveniently use an arbitrary number of links, however if the number of links is fixed like in the above case, the Chain class is recommended as a base class.
Optimizer¶
In order to get good values for parameters, we have to optimize them by the Optimizer
class.
It runs a numerical optimization algorithm on a given link.
Many algorithms are implemented in the optimizers
module.
Here we use the simplest one, called Stochastic Gradient Descent (SGD):
>>> model = MyChain()
>>> optimizer = optimizers.SGD()
>>> optimizer.setup(model)
The method setup()
prepares for the optimization given a link.
Some parameter/gradient manipulations, e.g. weight decay and gradient clipping, can be done by setting hook functions to the optimizer. Hook functions are called after the gradient computation and right before the actual update of parameters. For example, we can set weight decay regularization by running the next line beforehand:
>>> optimizer.add_hook(chainer.optimizer.WeightDecay(0.0005))
Of course, you can write your own hook functions. It should be a function or a callable object, taking the optimizer as the argument.
There are two ways to use the optimizer.
One is using it via Trainer
, which we will see in the following sections.
The other way is using it directly.
We here review the latter case.
If you are interested in getting able to use the optimizer in a simple way, skip this section and go to the next one.
There are two further ways to use the optimizer directly.
One is manually computing gradients and then calling the update()
method with no arguments.
Do not forget to clear the gradients beforehand!
>>> x = np.random.uniform(-1, 1, (2, 4)).astype('f')
>>> model.cleargrads()
>>> # compute gradient here...
>>> loss = F.sum(model(chainer.Variable(x)))
>>> loss.backward()
>>> optimizer.update()
The other way is just passing a loss function to the update()
method.
In this case, cleargrads()
is automatically called by the update method, so the user does not have to call it manually.
>>> def lossfun(arg1, arg2):
... # calculate loss
... loss = F.sum(model(arg1 - arg2))
... return loss
>>> arg1 = np.random.uniform(-1, 1, (2, 4)).astype('f')
>>> arg2 = np.random.uniform(-1, 1, (2, 4)).astype('f')
>>> optimizer.update(lossfun, chainer.Variable(arg1), chainer.Variable(arg2))
See Optimizer.update()
for the full specification.
Trainer¶
When we want to train neural networks, we have to run training loops that update the parameters many times. A typical training loop consists of the following procedures:
- Iterations over training datasets
- Preprocessing of extracted mini-batches
- Forward/backward computations of the neural networks
- Parameter updates
- Evaluations of the current parameters on validation datasets
- Logging and printing of the intermediate results
Chainer provides a simple yet powerful way to make it easy to write such training processes. The training loop abstraction mainly consists of two components:
- Dataset abstraction.
It implements 1 and 2 in the above list.
The core components are defined in the
dataset
module. There are also many implementations of datasets and iterators indatasets
anditerators
modules, respectively. - Trainer.
It implements 3, 4, 5, and 6 in the above list.
The whole procedure is implemented by
Trainer
. The way to update parameters (3 and 4) is defined byUpdater
, which can be freely customized. 5 and 6 are implemented by instances ofExtension
, which appends an extra procedure to the training loop. Users can freely customize the training procedure by adding extensions. Users can also implement their own extensions.
We will see how to use Trainer in the example section below.
Serializer¶
Before proceeding to the first example, we introduce Serializer, which is the last core feature described in this page.
Serializer is a simple interface to serialize or deserialize an object.
Link
, Optimizer
, and Trainer
supports serialization.
Concrete serializers are defined in the serializers
module.
It supports NumPy NPZ and HDF5 formats.
For example, we can serialize a link object into NPZ file by the serializers.save_npz()
function:
>>> serializers.save_npz('my.model', model)
It saves the parameters of model
into the file 'my.model'
in NPZ format.
The saved model can be read by the serializers.load_npz()
function:
>>> serializers.load_npz('my.model', model)
Note
Note that only the parameters and the persistent values are serialized by this serialization code.
Other attributes are not saved automatically.
You can register arrays, scalars, or any serializable objects as persistent values by the Link.add_persistent()
method.
The registered values can be accessed by attributes of the name passed to the add_persistent method.
The state of an optimizer can also be saved by the same functions:
>>> serializers.save_npz('my.state', optimizer)
>>> serializers.load_npz('my.state', optimizer)
Note
Note that serialization of optimizer only saves its internal states including number of iterations, momentum vectors of MomentumSGD, etc. It does not save the parameters and persistent values of the target link. We have to explicitly save the target link with the optimizer to resume the optimization from saved states.
Support of the HDF5 format is enabled if the h5py package is installed.
Serialization and deserialization with the HDF5 format are almost identical to those with the NPZ format;
just replace save_npz()
and load_npz()
by save_hdf5()
and load_hdf5()
, respectively.
Example: Multi-layer Perceptron on MNIST¶
Now you can solve a multiclass classification task using a multi-layer perceptron (MLP).
We use a hand-written digits dataset called MNIST, which is one of the long-standing de facto “hello world” examples used in machine learning.
This MNIST example is also found in the examples/mnist directory of the official repository.
We show how to use Trainer
to construct and run the training loop in this section.
We first have to prepare the MNIST dataset.
The MNIST dataset consists of 70,000 greyscale images of size 28x28 (i.e. 784 pixels) and corresponding digit labels.
The dataset is divided into 60,000 training images and 10,000 test images by default.
We can obtain the vectorized version (i.e., a set of 784 dimensional vectors) by datasets.get_mnist()
.
>>> train, test = datasets.get_mnist()
...
This code automatically downloads the MNIST dataset and saves the NumPy arrays to the $(HOME)/.chainer
directory.
The returned train
and test
can be seen as lists of image-label pairs (strictly speaking, they are instances of TupleDataset
).
We also have to define how to iterate over these datasets.
We want to shuffle the training dataset for every epoch, i.e. at the beginning of every sweep over the dataset.
In this case, we can use iterators.SerialIterator
.
>>> train_iter = iterators.SerialIterator(train, batch_size=100, shuffle=True)
On the other hand, we do not have to shuffle the test dataset.
In this case, we can pass shuffle=False
argument to disable the shuffling.
It makes the iteration faster when the underlying dataset supports fast slicing.
>>> test_iter = iterators.SerialIterator(test, batch_size=100, repeat=False, shuffle=False)
We also pass repeat=False
, which means we stop iteration when all examples are visited.
This option is usually required for the test/validation datasets; without this option, the iteration enters an infinite loop.
Next, we define the architecture. We use a simple three-layer rectifier network with 100 units per layer as an example.
>>> class MLP(Chain):
... def __init__(self, n_units, n_out):
... super(MLP, self).__init__()
... with self.init_scope():
... # the size of the inputs to each layer will be inferred
... self.l1 = L.Linear(None, n_units) # n_in -> n_units
... self.l2 = L.Linear(None, n_units) # n_units -> n_units
... self.l3 = L.Linear(None, n_out) # n_units -> n_out
...
... def __call__(self, x):
... h1 = F.relu(self.l1(x))
... h2 = F.relu(self.l2(h1))
... y = self.l3(h2)
... return y
This link uses relu()
as an activation function.
Note that the 'l3'
link is the final linear layer whose output corresponds to scores for the ten digits.
In order to compute loss values or evaluate the accuracy of the predictions, we define a classifier chain on top of the above MLP chain:
>>> class Classifier(Chain):
... def __init__(self, predictor):
... super(Classifier, self).__init__()
... with self.init_scope():
... self.predictor = predictor
...
... def __call__(self, x, t):
... y = self.predictor(x)
... loss = F.softmax_cross_entropy(y, t)
... accuracy = F.accuracy(y, t)
... report({'loss': loss, 'accuracy': accuracy}, self)
... return loss
This Classifier class computes accuracy and loss, and returns the loss value.
The pair of arguments x
and t
corresponds to each example in the datasets (a tuple of an image and a label).
softmax_cross_entropy()
computes the loss value given prediction and ground truth labels.
accuracy()
computes the prediction accuracy.
We can set an arbitrary predictor link to an instance of the classifier.
The report()
function reports the loss and accuracy values to the trainer.
For the detailed mechanism of collecting training statistics, see Reporter.
You can also collect other types of observations like activation statistics in a similar ways.
Note that a class similar to the Classifier above is defined as chainer.links.Classifier
.
So instead of using the above example, we will use this predefined Classifier chain.
>>> model = L.Classifier(MLP(100, 10)) # the input size, 784, is inferred
>>> optimizer = optimizers.SGD()
>>> optimizer.setup(model)
Now we can build a trainer object.
>>> updater = training.StandardUpdater(train_iter, optimizer)
>>> trainer = training.Trainer(updater, (20, 'epoch'), out='result')
The second argument (20, 'epoch')
represents the duration of training.
We can use either epoch
or iteration
as the unit.
In this case, we train the multi-layer perceptron by iterating over the training set 20 times.
In order to invoke the training loop, we just call the run()
method.
>>> trainer.run()
This method executes the whole training sequence.
The above code just optimizes the parameters.
In most cases, we want to see how the training proceeds, where we can use extensions inserted before calling the run
method.
>>> trainer.extend(extensions.Evaluator(test_iter, model))
>>> trainer.extend(extensions.LogReport())
>>> trainer.extend(extensions.PrintReport(['epoch', 'main/accuracy', 'validation/main/accuracy']))
>>> trainer.extend(extensions.ProgressBar())
>>> trainer.run()
These extensions perform the following tasks:
Evaluator
- Evaluates the current model on the test dataset at the end of every epoch.
It automatically switches to the test mode (see Configuring Chainer for details), and so we do not have to take any special function for functions that behave differently in training/test modes (e.g.
dropout()
,BatchNormalization
). LogReport
- Accumulates the reported values and emits them to the log file in the output directory.
PrintReport
- Prints the selected items in the LogReport.
ProgressBar
- Shows the progress bar.
There are many extensions implemented in the chainer.training.extensions
module.
The most important one that is not included above is snapshot()
, which saves the snapshot of the training procedure (i.e., the Trainer object) to a file in the output directory.
The example code in the examples/mnist directory additionally contains GPU support, though the essential part is the same as the code in this tutorial. We will review in later sections how to use GPU(s).
How to Write a New Network¶
Convolutional Network for Visual Recognition Tasks¶
In this section, you will learn how to write
- A small convolutional network with a model class that is inherited from
Chain
, - A large convolutional network that has several building block networks with
ChainList
.
After reading this section, you will be able to:
- Write your own original convolutional network in Chainer
A convolutional network (ConvNet) is mainly comprised of convolutional layers. This type of network is commonly used for various visual recognition tasks, e.g., classifying hand-written digits or natural images into given object classes, detecting objects from an image, and labeling all pixels of an image with the object classes (semantic segmentation), and so on.
In such tasks, a typical ConvNet takes a set of images whose shape is \((N, C, H, W)\), where
- \(N\) denotes the number of images in a mini-batch,
- \(C\) denotes the number of channels of those images,
- \(H\) and \(W\) denote the height and width of those images,
respectively. Then, it typically outputs a fixed-sized vector as membership probabilities over the target object classes. It also can output a set of feature maps that have the corresponding size to the input image for a pixel labeling task, etc.
Note
The below example code assumes that some packages are already imported. Please see the details here: Introduction to Chainer.
LeNet5¶
Here, let’s start by defining LeNet5 [LeCun98] in Chainer. This is a ConvNet model that has 5 layers comprised of 3 convolutional layers and 2 fully-connected layers. This was proposed to classify hand-written digit images in 1998. In Chainer, the model can be written as follows:
class LeNet5(Chain):
def __init__(self):
super(LeNet5, self).__init__()
with self.init_scope():
self.conv1 = L.Convolution2D(
in_channels=1, out_channels=6, ksize=5, stride=1)
self.conv2 = L.Convolution2D(
in_channels=6, out_channels=16, ksize=5, stride=1)
self.conv3 = L.Convolution2D(
in_channels=16, out_channels=120, ksize=4, stride=1)
self.fc4 = L.Linear(None, 84)
self.fc5 = L.Linear(84, 10)
def __call__(self, x):
h = F.sigmoid(self.conv1(x))
h = F.max_pooling_2d(h, 2, 2)
h = F.sigmoid(self.conv2(h))
h = F.max_pooling_2d(h, 2, 2)
h = F.sigmoid(self.conv3(h))
h = F.sigmoid(self.fc4(h))
if chainer.config.train:
return self.fc5(h)
return F.softmax(self.fc5(h))
A typical way to write your network is creating a new class inherited from
Chain
class. When defining your model in this way, typically,
all the layers which have trainable parameters are registered to the model
by assigning the objects of Link
as an attribute.
The model class is instantiated before the forward and backward computations.
To give input images and label vectors simply by calling the model object
like a function, __call__()
is usually defined in the model class.
This method performs the forward computation of the model. Chainer uses
the powerful autograd system for any computational graphs written with
FunctionNode
s and Link
s (actually a
Link
calls a corresponding FunctionNode
inside of it), so that you don’t need to explicitly write the code for backward
computations in the model. Just prepare the data, then give it to the model.
The way this works is the resulting output Variable
from the
forward computation has a backward()
method to perform
autograd. In the above model, __call__()
has a if
statement at the
end to switch its behavior by the Chainer’s running mode, i.e., training mode or
not. Chainer presents the running mode as a global variable chainer.config.train
.
When it’s in training mode, __call__()
returns the output value of the
last layer as is to compute the loss later on, otherwise it returns a
prediction result by calculating softmax()
.
Note
In Chainer v1, if a function or link behaved differently in training and other modes, it was common that it held an attribute that represented its running mode or was provided with the mode from outside as an argument. In Chainer v2, it is recommended to use the global configuration chainer.config.train to switch the running mode.
If you don’t want to write conv1
and the other layers more than once, you
can also write the model like in this way:
class LeNet5(Chain):
def __init__(self):
super(LeNet5, self).__init__()
net = [('conv1', L.Convolution2D(1, 6, 5, 1))]
net += [('_sigm1', F.Sigmoid())]
net += [('_mpool1', F.MaxPooling2D(2, 2))]
net += [('conv2', L.Convolution2D(6, 16, 5, 1))]
net += [('_sigm2', F.Sigmoid())]
net += [('_mpool2', F.MaxPooling2D(2, 2))]
net += [('conv3', L.Convolution2D(16, 120, 4, 1))]
net += [('_sigm3', F.Sigmoid())]
net += [('_mpool3', F.MaxPooling2D(2, 2))]
net += [('fc4', L.Linear(None, 84))]
net += [('_sigm4', F.Sigmoid())]
net += [('fc5', L.Linear(84, 10))]
net += [('_sigm5', F.Sigmoid())]
with self.init_scope():
for n in net:
if not n[0].startswith('_'):
setattr(self, n[0], n[1])
self.forward = net
def __call__(self, x):
for n, f in self.forward:
if not n.startswith('_'):
x = getattr(self, n)(x)
else:
x = f(x)
if chainer.config.train:
return x
return F.softmax(x)
This code creates a list of all Link
s and
FunctionNode
s after calling its superclass’s constructor.
Then the elements of the list are registered to this model as
trainable layers when the name of an element doesn’t start with _
character. This operation can be freely replaced with many other ways because
those names are just designed to select Link
s only from the
list net
easily. FunctionNode
doesn’t have any trainable
parameters, so that we can’t register it to the model, but we want to use
FunctionNode
s for constructing a forward path. The list
net
is stored as an attribute forward
to refer it in
__call__()
. In __call__()
, it retrieves all layers in the network
from self.forward
sequentially regardless of what types of object (
Link
or FunctionNode
) it is, and gives the
input variable or the intermediate output from the previous layer to the
current layer. The last part of the __call__()
to switch its behavior
by the training/inference mode is the same as the former way.
Ways to calculate loss¶
When you train the model with label vector t
, the loss should be calculated
using the output from the model. There also are several ways to calculate the
loss:
model = LeNet5()
# Input data and label
x = np.random.rand(32, 1, 28, 28).astype(np.float32)
t = np.random.randint(0, 10, size=(32,)).astype(np.int32)
# Forward computation
y = model(x)
# Loss calculation
loss = F.softmax_cross_entropy(y, t)
This is a primitive way to calculate a loss value from the output of the model.
On the other hand, the loss computation can be included in the model itself by
wrapping the model object (Chain
or
ChainList
object) with a class inherited from
Chain
. The outer Chain
should take the
model defined above and register it with init_scope()
.
Chain
is actually
inherited from Link
, so that Chain
itself
can also be registered as a trainable Link
to another
Chain
. Actually, Classifier
class to
wrap the model and add the loss computation to the model already exists.
Actually, there is already a Classifier
class that can
be used to wrap the model and include the loss computation as well.
It can be used like this:
model = L.Classifier(LeNet5())
# Foward & Loss calculation
loss = model(x, t)
This class takes a model object as an input argument and registers it to
a predictor
property as a trained parameter. As shown above, the returned
object can then be called like a function in which we pass x
and t
as
the input arguments and the resulting loss value (which we recall is a
Variable
) is returned.
See the detailed implementation of Classifier
from
here: chainer.links.Classifier
and check the implementation by looking
at the source.
From the above examples, we can see that Chainer provides the flexibility to write our original network in many different ways. Such flexibility intends to make it intuitive for users to design new and complex models.
VGG16¶
Next, let’s write some larger models in Chainer. When you write a large network
consisting of several building block networks, ChainList
is
useful. First, let’s see how to write a VGG16 [Simonyan14] model.
class VGG16(chainer.ChainList):
def __init__(self):
super(VGG16, self).__init__(
VGGBlock(64),
VGGBlock(128),
VGGBlock(256, 3),
VGGBlock(512, 3),
VGGBlock(512, 3, True))
def __call__(self, x):
for f in self.children():
x = f(x)
if chainer.config.train:
return x
return F.softmax(x)
class VGGBlock(chainer.Chain):
def __init__(self, n_channels, n_convs=2, fc=False):
w = chainer.initializers.HeNormal()
super(VGGBlock, self).__init__()
with self.init_scope():
self.conv1 = L.Convolution2D(None, n_channels, 3, 1, 1, initialW=w)
self.conv2 = L.Convolution2D(
n_channels, n_channels, 3, 1, 1, initialW=w)
if n_convs == 3:
self.conv3 = L.Convolution2D(
n_channels, n_channels, 3, 1, 1, initialW=w)
if fc:
self.fc4 = L.Linear(None, 4096, initialW=w)
self.fc5 = L.Linear(4096, 4096, initialW=w)
self.fc6 = L.Linear(4096, 1000, initialW=w)
self.n_convs = n_convs
self.fc = fc
def __call__(self, x):
h = F.relu(self.conv1(x))
h = F.relu(self.conv2(h))
if self.n_convs == 3:
h = F.relu(self.conv3(h))
h = F.max_pooling_2d(h, 2, 2)
if self.fc:
h = F.dropout(F.relu(self.fc4(h)))
h = F.dropout(F.relu(self.fc5(h)))
h = self.fc6(h)
return h
That’s it. VGG16 is a model which won the 1st place in
classification + localization task at ILSVRC 2014,
and since then, has become one of the standard models for many different tasks
as a pre-trained model. This has 16-layers, so it’s called “VGG-16”, but we can
write this model without writing all layers independently. Since this model
consists of several building blocks that have the same architecture, we can
build the whole network by re-using the building block definition. Each part
of the network is consisted of 2 or 3 convolutional layers and activation
function (relu()
) following them, and
max_pooling_2d()
operations. This block is written as
VGGBlock
in the above example code. And the whole network just calls
this block one by one in sequential manner.
ResNet152¶
How about ResNet? ResNet [He16] came in the following year’s ILSVRC. It is a much deeper model than VGG16, having up to 152 layers. This sounds super laborious to build, but it can be implemented in almost same manner as VGG16. In the other words, it’s easy. One possible way to write ResNet-152 is:
class ResNet152(chainer.Chain):
def __init__(self, n_blocks=[3, 8, 36, 3]):
w = chainer.initializers.HeNormal()
super(ResNet152, self).__init__()
with self.init_scope():
self.conv1 = L.Convolution2D(None, 64, 7, 2, 3, initialW=w, nobias=True)
self.bn1 = L.BatchNormalization(64)
self.res2 = ResBlock(n_blocks[0], 64, 64, 256, 1)
self.res3 = ResBlock(n_blocks[1], 256, 128, 512)
self.res4 = ResBlock(n_blocks[2], 512, 256, 1024)
self.res5 = ResBlock(n_blocks[3], 1024, 512, 2048)
self.fc6 = L.Linear(2048, 1000)
def __call__(self, x):
h = self.bn1(self.conv1(x))
h = F.max_pooling_2d(F.relu(h), 2, 2)
h = self.res2(h)
h = self.res3(h)
h = self.res4(h)
h = self.res5(h)
h = F.average_pooling_2d(h, h.shape[2:], stride=1)
h = self.fc6(h)
if chainer.config.train:
return h
return F.softmax(h)
class ResBlock(chainer.ChainList):
def __init__(self, n_layers, n_in, n_mid, n_out, stride=2):
super(ResBlock, self).__init__()
self.add_link(BottleNeck(n_in, n_mid, n_out, stride, True))
for _ in range(n_layers - 1):
self.add_link(BottleNeck(n_out, n_mid, n_out))
def __call__(self, x):
for f in self.children():
x = f(x)
return x
class BottleNeck(chainer.Chain):
def __init__(self, n_in, n_mid, n_out, stride=1, proj=False):
w = chainer.initializers.HeNormal()
super(BottleNeck, self).__init__()
with self.init_scope():
self.conv1x1a = L.Convolution2D(
n_in, n_mid, 1, stride, 0, initialW=w, nobias=True)
self.conv3x3b = L.Convolution2D(
n_mid, n_mid, 3, 1, 1, initialW=w, nobias=True)
self.conv1x1c = L.Convolution2D(
n_mid, n_out, 1, 1, 0, initialW=w, nobias=True)
self.bn_a = L.BatchNormalization(n_mid)
self.bn_b = L.BatchNormalization(n_mid)
self.bn_c = L.BatchNormalization(n_out)
if proj:
self.conv1x1r = L.Convolution2D(
n_in, n_out, 1, stride, 0, initialW=w, nobias=True)
self.bn_r = L.BatchNormalization(n_out)
self.proj = proj
def __call__(self, x):
h = F.relu(self.bn_a(self.conv1x1a(x)))
h = F.relu(self.bn_b(self.conv3x3b(h)))
h = self.bn_c(self.conv1x1c(h))
if self.proj:
x = self.bn_r(self.conv1x1r(x))
return F.relu(h + x)
In the BottleNeck
class, depending on the value of the proj argument
supplied to the initializer, it will conditionally compute a convolutional
layer conv1x1r
which will extend the number of channels of the input x
to be equal to the number of channels of the output of conv1x1c
, and
followed by a batch normalization layer before the final ReLU layer.
Writing the building block in this way improves the re-usability of a class.
It switches not only the behavior in __class__()
by flags but also the
parameter registration. In this case, when proj
is False
, the
BottleNeck
doesn’t have conv1x1r and bn_r layers, so the memory
usage would be efficient compared to the case when it registers both anyway and
just ignore them if proj
is False
.
Using nested Chain
s and ChainList
for
sequential part enables us to write complex and very deep models easily.
Use Pre-trained Models¶
Various ways to write your models were described above. It turns out that VGG16 and ResNet are very useful as general feature extractors for many kinds of tasks, including but not limited to image classification. So, Chainer provides you with the pre-trained VGG16 and ResNet-50/101/152 models with a simple API. You can use these models as follows:
from chainer.links import VGG16Layers
model = VGG16Layers()
When VGG16Layers
is instantiated, the pre-trained
parameters are automatically downloaded from the author’s server. So you can
immediately start to use VGG16 with pre-trained weight as a good image feature
extractor. See the details of this model here:
chainer.links.VGG16Layers
.
In the case of ResNet models, there are three variations differing in the number
of layers. We have chainer.links.ResNet50Layers
,
chainer.links.ResNet101Layers
, and chainer.links.ResNet152Layers
models
with easy parameter loading feature. ResNet’s pre-trained parameters are not
available for direct downloading, so you need to download the weight from the
author’s web page first, and then place it into the dir
$CHAINER_DATSET_ROOT/pfnet/chainer/models
or your favorite place. Once
the preparation is finished, the usage is the same as VGG16:
from chainer.links import ResNet152Layers
model = ResNet152layers()
Please see the details of usage and how to prepare the pre-trained weights for
ResNet here: chainer.links.ResNet50Layers
References¶
[LeCun98] | Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324, 1998. |
[Simonyan14] | Simonyan, K. and Zisserman, A., Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556, 2014. |
[He16] | Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Deep Residual Learning for Image Recognition. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, 2016. |
Recurrent Nets and their Computational Graph¶
In this section, you will learn how to write
- recurrent nets with full backprop,
- recurrent nets with truncated backprop,
- evaluation of networks with few memory.
After reading this section, you will be able to:
- Handle input sequences of variable length
- Truncate upper stream of the network during forward computation
- Use no-backprop mode to prevent network construction
Recurrent Nets¶
Recurrent nets are neural networks with loops. They are often used to learn from sequential input/output. Given an input stream \(x_1, x_2, \dots, x_t, \dots\) and the initial state \(h_0\), a recurrent net iteratively updates its state by \(h_t = f(x_t, h_{t-1})\), and at some or every point in time \(t\), it outputs \(y_t = g(h_t)\). If we expand the procedure along the time axis, it looks like a regular feed-forward network except that same parameters are repeatedly used within the network.
Here we learn how to write a simple one-layer recurrent net. The task is language modeling: given a finite sequence of words, we want to predict the next word at each position without peeking the successive words. Suppose there are 1,000 different word types, and that we use 100 dimensional real vectors to represent each word (a.k.a. word embedding).
Let’s start from defining the recurrent neural net language model (RNNLM) as a chain.
We can use the chainer.links.LSTM
link that implements a fully-connected stateful LSTM layer.
This link looks like an ordinary fully-connected layer.
On construction, you pass the input and output size to the constructor:
>>> l = L.LSTM(100, 50)
Then, call on this instance l(x)
executes one step of LSTM layer:
>>> l.reset_state()
>>> x = Variable(np.random.randn(10, 100).astype(np.float32))
>>> y = l(x)
Do not forget to reset the internal state of the LSTM layer before the forward computation! Every recurrent layer holds its internal state (i.e. the output of the previous call). At the first application of the recurrent layer, you must reset the internal state. Then, the next input can be directly fed to the LSTM instance:
>>> x2 = Variable(np.random.randn(10, 100).astype(np.float32))
>>> y2 = l(x2)
Based on this LSTM link, let’s write our recurrent network as a new chain:
class RNN(Chain):
def __init__(self):
super(RNN, self).__init__()
with self.init_scope():
self.embed = L.EmbedID(1000, 100) # word embedding
self.mid = L.LSTM(100, 50) # the first LSTM layer
self.out = L.Linear(50, 1000) # the feed-forward output layer
def reset_state(self):
self.mid.reset_state()
def __call__(self, cur_word):
# Given the current word ID, predict the next word.
x = self.embed(cur_word)
h = self.mid(x)
y = self.out(h)
return y
rnn = RNN()
model = L.Classifier(rnn)
optimizer = optimizers.SGD()
optimizer.setup(model)
Here EmbedID
is a link for word embedding.
It converts input integers into corresponding fixed-dimensional embedding vectors.
The last linear link out
represents the feed-forward output layer.
The RNN
chain implements a one-step-forward computation.
It does not handle sequences by itself, but we can use it to process sequences by just feeding items in a sequence straight to the chain.
Suppose we have a list of word variables x_list
.
Then, we can compute loss values for the word sequence by simple for
loop.
def compute_loss(x_list):
loss = 0
for cur_word, next_word in zip(x_list, x_list[1:]):
loss += model(cur_word, next_word)
return loss
Of course, the accumulated loss is a Variable object with the full history of computation.
So we can just call its backward()
method to compute gradients of the total loss according to the model parameters:
# Suppose we have a list of word variables x_list.
rnn.reset_state()
model.cleargrads()
loss = compute_loss(x_list)
loss.backward()
optimizer.update()
Or equivalently we can use the compute_loss
as a loss function:
rnn.reset_state()
optimizer.update(compute_loss, x_list)
Truncate the Graph by Unchaining¶
Learning from very long sequences is also a typical use case of recurrent nets. Suppose the input and state sequence is too long to fit into memory. In such cases, we often truncate the backpropagation into a short time range. This technique is called truncated backprop. It is heuristic, and it makes the gradients biased. However, this technique works well in practice if the time range is long enough.
How to implement truncated backprop in Chainer?
Chainer has a smart mechanism to achieve truncation, called backward unchaining.
It is implemented in the Variable.unchain_backward()
method.
Backward unchaining starts from the Variable object, and it chops the computation history backwards from the variable.
The chopped variables are disposed automatically (if they are not referenced explicitly from any other user object).
As a result, they are no longer a part of computation history, and are not involved in backprop anymore.
Let’s write an example of truncated backprop. Here we use the same network as the one used in the previous subsection. Suppose we are given a very long sequence, and we want to run backprop truncated at every 30 time steps. We can write truncated backprop using the model defined above:
loss = 0
count = 0
seqlen = len(x_list[1:])
rnn.reset_state()
for cur_word, next_word in zip(x_list, x_list[1:]):
loss += model(cur_word, next_word)
count += 1
if count % 30 == 0 or count == seqlen:
model.cleargrads()
loss.backward()
loss.unchain_backward()
optimizer.update()
State is updated at model()
, and the losses are accumulated to loss
variable.
At each 30 steps, backprop takes place at the accumulated loss.
Then, the unchain_backward()
method is called, which deletes the computation history backward from the accumulated loss.
Note that the last state of model
is not lost, since the RNN instance holds a reference to it.
The implementation of truncated backprop is simple, and since there is no complicated trick on it, we can generalize this method to different situations. For example, we can easily extend the above code to use different schedules between backprop timing and truncation length.
Network Evaluation without Storing the Computation History¶
On evaluation of recurrent nets, there is typically no need to store the computation history. While unchaining enables us to walk through unlimited length of sequences with limited memory, it is a bit of a work-around.
As an alternative, Chainer provides an evaluation mode of forward computation which does not store the computation history.
This is enabled by just calling no_backprop_mode()
context:
with chainer.no_backprop_mode():
x_list = [Variable(...) for _ in range(100)] # list of 100 words
loss = compute_loss(x_list)
Note that we cannot call loss.backward()
to compute the gradient here, since the variable created in the no-backprop context does not remember the computation history.
No-backprop context is also useful to evaluate feed-forward networks to reduce the memory footprint.
We can combine a fixed feature extractor network and a trainable predictor network using no_backprop_mode()
.
For example, suppose we want to train a feed-forward network predictor_func
, which is located on top of another fixed pre-trained network fixed_func
.
We want to train predictor_func
without storing the computation history for fixed_func
.
This is simply done by following code snippets (suppose x_data
and y_data
indicate input data and label, respectively):
with chainer.no_backprop_mode():
x = Variable(x_data)
feat = fixed_func(x)
y = predictor_func(feat)
y.backward()
At first, the input variable x
is in no-backprop mode, so fixed_func
does not memorize the computation history.
Then predictor_func
is executed in backprop mode, i.e., with memorizing the history of computation.
Since the history of computation is only memorized between variables feat
and y
, the backward computation stops at the feat
variable.
Making it with Trainer¶
The above codes are written with plain Function/Variable APIs. When we write a training loop, it is better to use Trainer, since we can then easily add functionalities by extensions.
Before implementing it on Trainer, let’s clarify the training settings.
We here use Penn Tree Bank dataset as a set of sentences.
Each sentence is represented as a word sequence.
We concatenate all sentences into one long word sequence, in which each sentence is separated by a special word <eos>
, which stands for “End of Sequence”.
This dataset is easily obtained by chainer.datasets.get_ptb_words()
.
This function returns train, validation, and test dataset, each of which is represented as a long array of integers.
Each integer represents a word ID.
Our task is to learn a recurrent neural net language model from the long word sequence. We use words in different locations to form mini-batches. It means we maintain \(B\) indices pointing to different locations in the sequence, read from these indices at each iteration, and increment all indices after the read. Of course, when one index reaches the end of the whole sequence, we turn the index back to 0.
In order to implement this training procedure, we have to customize the following components of Trainer:
- Iterator. Built-in iterators do not support reading from different locations and aggregating them into a mini-batch.
- Update function. The default update function does not support truncated BPTT.
When we write a dataset iterator dedicated to the dataset, the dataset implementation can be arbitrary; even the interface is not fixed.
On the other hand, the iterator must support the Iterator
interface.
The important methods and attributes to implement are batch_size
, epoch
, epoch_detail
, is_new_epoch
, iteration
, __next__
, and serialize
.
Following is a code from the official example in the examples/ptb directory.
from __future__ import division
class ParallelSequentialIterator(chainer.dataset.Iterator):
def __init__(self, dataset, batch_size, repeat=True):
self.dataset = dataset
self.batch_size = batch_size
self.epoch = 0
self.is_new_epoch = False
self.repeat = repeat
self.offsets = [i * len(dataset) // batch_size for i in range(batch_size)]
self.iteration = 0
def __next__(self):
length = len(self.dataset)
if not self.repeat and self.iteration * self.batch_size >= length:
raise StopIteration
cur_words = self.get_words()
self.iteration += 1
next_words = self.get_words()
epoch = self.iteration * self.batch_size // length
self.is_new_epoch = self.epoch < epoch
if self.is_new_epoch:
self.epoch = epoch
return list(zip(cur_words, next_words))
@property
def epoch_detail(self):
return self.iteration * self.batch_size / len(self.dataset)
def get_words(self):
return [self.dataset[(offset + self.iteration) % len(self.dataset)]
for offset in self.offsets]
def serialize(self, serializer):
self.iteration = serializer('iteration', self.iteration)
self.epoch = serializer('epoch', self.epoch)
train_iter = ParallelSequentialIterator(train, 20)
val_iter = ParallelSequentialIterator(val, 1, repeat=False)
Although the code is slightly long, the idea is simple.
First, this iterator creates offsets
pointing to positions equally spaced within the whole sequence.
The i-th examples of mini-batches refer the sequence with the i-th offset.
The iterator returns a list of tuples of the current words and the next words.
Each mini-batch is converted to a tuple of integer arrays by the concat_examples
function in the standard updater (see the previous tutorial).
Backprop Through Time is implemented as follows.
class BPTTUpdater(training.StandardUpdater):
def __init__(self, train_iter, optimizer, bprop_len):
super(BPTTUpdater, self).__init__(train_iter, optimizer)
self.bprop_len = bprop_len
# The core part of the update routine can be customized by overriding.
def update_core(self):
loss = 0
# When we pass one iterator and optimizer to StandardUpdater.__init__,
# they are automatically named 'main'.
train_iter = self.get_iterator('main')
optimizer = self.get_optimizer('main')
# Progress the dataset iterator for bprop_len words at each iteration.
for i in range(self.bprop_len):
# Get the next batch (a list of tuples of two word IDs)
batch = train_iter.__next__()
# Concatenate the word IDs to matrices and send them to the device
# self.converter does this job
# (it is chainer.dataset.concat_examples by default)
x, t = self.converter(batch)
# Compute the loss at this time step and accumulate it
loss += optimizer.target(chainer.Variable(x), chainer.Variable(t))
optimizer.target.cleargrads() # Clear the parameter gradients
loss.backward() # Backprop
loss.unchain_backward() # Truncate the graph
optimizer.update() # Update the parameters
updater = BPTTUpdater(train_iter, optimizer, bprop_len) # instantiation
In this case, we update the parameters on every bprop_len
consecutive words.
The call of unchain_backward
cuts the history of computation accumulated to the LSTM links.
The rest of the code for setting up Trainer is almost same as one given in the previous tutorial.
In this section we have demonstrated how to write recurrent nets in Chainer and some fundamental techniques to manage the history of computation (a.k.a. computational graph). The example in the examples/ptb directory implements truncated backprop learning of a LSTM language model from the Penn Treebank corpus. In the next section, we will review how to use GPU(s) in Chainer.
How to Train a Network¶
How to write a training loop in Chainer¶
In this tutorial section, we will learn how to train a deep neural network to classify images of hand-written digits in the popular MNIST dataset. This dataset contains 50,000 training examples and 10,000 test examples. Each example is a set of a 28 x 28 greyscale image and a corresponding class label. Since the digits from 0 to 9 are used, there are 10 classes for the labels.
Chainer provides a feature called Trainer
that can simplify the training procedure of your model. However, it is also good to know how the training works in Chainer before starting to use the useful Trainer
class that hides the actual processes. Writing your own training loop can be useful for learning how Trainer
works or for implementing features not included in the standard trainer.
The complete training procedure consists of the following steps:
-
- Retrieve a set of examples (mini-batch) from the training dataset.
- Feed the mini-batch to your network.
- Run a forward pass of the network and compute the loss.
- Just call the
backward()
method from the lossVariable
to compute the gradients for all trainable parameters. - Run the optimizer to update those parameters.
Perform classification by the saved model and check the network performance on validation/test sets.
1. Prepare a dataset¶
Chainer contains some built-in functions to use some popular datasets like MNIST, CIFAR10/100, etc. Those can automatically download the data from servers and provide dataset objects which are easy to use.
The code below shows how to retrieve the MNIST dataset from the server and save an image from its training split to make sure the images are correctly obtained.
from __future__ import print_function
import matplotlib.pyplot as plt
from chainer.datasets import mnist
# Download the MNIST data if you haven't downloaded it yet
train, test = mnist.get_mnist(withlabel=True, ndim=1)
# Display an example from the MNIST dataset.
# `x` contains the inpu t image array and `t` contains that target class
# label as an integer.
x, t = train[0]
plt.imshow(x.reshape(28, 28), cmap='gray')
plt.savefig('5.png')
print('label:', t)
label: 5
The saved image 5.png
will look like:

2. Create a dataset iterator¶
Although this is an optional step, we’d like to introduce the Iterator
class that retrieves a set of data and labels from the given dataset to easily make a mini-batch. There are some subclasses that can perform the same thing in different ways, e.g., using multi-processing to parallelize the data loading part, etc.
Here, we use SerialIterator
, which is also a subclass of Iterator
in the example code below. The SerialIterator
can provide mini-batches with or without shuffling the order of data in the given dataset.
All Iterator
s produce a new mini-batch by calling its next()
method. All
Iterator
s also have properties to know how many times we have taken all the data from the given dataset (epoch
) and whether the next mini-batch will be the start of a new epoch (is_new_epoch
), and so on.
The code below shows how to create a SerialIterator
object from a dataset object.
from chainer import iterators
# Choose the minibatch size.
batchsize = 128
train_iter = iterators.SerialIterator(train, batchsize)
test_iter = iterators.SerialIterator(test, batchsize,
repeat=False, shuffle=False)
Note
iterator
s can take a built-in Python list as a given dataset. It means that the example code below is able to work,
train = [(x1, t1), (x2, t2), ...] # A list of tuples
train_iter = iterators.SerialIterator(train, batchsize)
where x1, x2, ...
denote the input data and t1, t2, ...
denote the corresponding labels.
Details of SerialIterator¶
SerialIterator
is a built-in subclass ofIterator
that can retrieve a mini-batch from a given dataset in either sequential or shuffled order.- The
Iterator
’s constructor takes two arguments: a dataset object and a mini-batch size. - If you want to use the same dataset repeatedly during the training process, set the
repeat
argument toTrue
(default). Otherwise, the dataset will be used only one time. The latter case is actually for the evaluation. - If you want to shuffle the training dataset every epoch, set the
shuffle
argument toTrue
. Otherwise, the order of each data retrieved from the dataset will be always the same at each epoch.
In the example code shown above, we set batchsize = 128
in both train_iter
and test_iter
. So, these iterators will provide 128 images and corresponding labels at a time.
3. Define a network¶
Now let’s define a neural network that we will train to classify the MNIST images. For simplicity, we use a three-layer perceptron here. We set each hidden layer to have 100 units and set the output layer to have 10 units, which is corresponding to the number of class labels of the MNIST.
Create your network as a subclass of Chain¶
You can create your network by writing a new subclass of Chain
.
The main steps are twofold:
Register the network components which have trainable parameters to the subclass. Each of them must be instantiated and assigned to a property in the scope specified by
init_scope()
:Define a
__call__()
method that represents the actual forward computation of your network. This method takes one or moreVariable
,numpy.array
, orcupy.array
as its inputs and calculates the forward pass using them.class MyNetwork(Chain): def __init__(self, n_mid_units=100, n_out=10): super(MyNetwork, self).__init__() with self.init_scope(): self.l1 = L.Linear(None, n_mid_units) self.l2 = L.Linear(n_mid_units, n_mid_units) self.l3 = L.Linear(n_mid_units, n_out) def __call__(self, x): h = F.relu(self.l1(x)) h = F.relu(self.l2(h)) return self.l3(h) model = MyNetwork()
Link
, Chain
, ChainList
, and those subclass objects which contain trainable parameters should be registered to the model by assigning it as a property inside the init_scope()
. For example, a FunctionNode
does not contain any trainable parameters, so there is no need to keep the object as a property of your network. When you want to use relu()
in your network, using it as a function in __call__()
works correctly.
In Chainer, the Python code that implements the forward computation itself represents the network. In other words, we can conceptually think of the computation graph for our network being constructed dynamically as this forward computation code executes. This allows Chainer to describe networks in which different computations can be performed in each iteration, such as branched networks, intuitively and with a high degree of flexibility. This is the key feature of Chainer that we call Define-by-Run.
4. Select an optimization algorithm¶
Chainer provides a wide variety of optimization algorithms that can be used to optimize the network parameters during training. They are located in optimizers
module.
Here, we are going to use the stochastic gradient descent (SGD) method with momentum, which is implemented by MomentumSGD
. To use the optimizer, we give the network object (typically it’s a Chain
or ChainList
) to the setup()
method of the optimizer object to register it. In this way, the Optimizer
can automatically find the model parameters and update them during training.
You can easily try out other optimizers as well. Please test and observe the results of various optimizers. For example, you could try to change MomentumSGD
to Adam
,
RMSprop
, etc.
from chainer import optimizers
# Choose an optimizer algorithm
optimizer = optimizers.MomentumSGD(lr=0.01, momentum=0.9)
# Give the optimizer a reference to the model so that it
# can locate the model's parameters.
optimizer.setup(model)
Note
In the above example, we set lr
to 0.01 in the constructor. This value is known as the “learning rate”, one of the most important hyperparameters that need to be adjusted in order to obtain the best performance. The various optimizers may each have different hyperparameters and so be sure to check the documentation for the details.
5. Write a training loop¶
We now show how to write the training loop. Since we are working on a digit classification problem, we will use
softmax_cross_entropy()
as the loss function for the optimizer to minimize. For other types of problems, such as regression models, other loss functions might be more appropriate. See the Chainer documentation for detailed information on the various loss functions for more details.
Our training loop will be structured as follows.
- We will first get a mini-batch of examples from the training dataset.
- We will then feed the batch into our network by calling it (a
Chain
object) like a function. This will execute the forward-pass code that are written in the__call__()
method. - This will return the network output that represents class label predictions. We supply it to the loss function along with the true (that is, target) values. The loss function will output the loss as a
Variable
object. - We then clear any previous gradients in the network and perform the backward pass by calling the
backward()
method on the loss variable which computes the parameter gradients. We need to clear the gradients first because thebackward()
method accumulates gradients instead of overwriting the previous values. - Since the optimizer already has a reference to the network, it has access to the parameters and the computed gradients so that we can now call the
update()
method of the optimizer which will update the model parameters.
In addition to the above steps, you might want to check the performance of the network with a validation dataset. This allows you to observe how well it is generalized to new data so far, namely, you can check whether it is overfitting to the training data. The code below checks the performance on the test set at the end of each epoch. The code has the same structure as the training code except that no backpropagation is performed and we also compute the accuracy on the test data using the accuracy()
function.
The training loop code is as follows:
import numpy as np
from chainer.dataset import concat_examples
from chainer.cuda import to_cpu
max_epoch = 10
while train_iter.epoch < max_epoch:
# ---------- One iteration of the training loop ----------
train_batch = train_iter.next()
image_train, target_train = concat_examples(train_batch, gpu_id)
# Calculate the prediction of the network
prediction_train = model(image_train)
# Calculate the loss with softmax_cross_entropy
loss = F.softmax_cross_entropy(prediction_train, target_train)
# Calculate the gradients in the network
model.cleargrads()
loss.backward()
# Update all the trainable paremters
optimizer.update()
# --------------------- until here ---------------------
# Check the validation accuracy of prediction after every epoch
if train_iter.is_new_epoch: # If this iteration is the final iteration of the current epoch
# Display the training loss
print('epoch:{:02d} train_loss:{:.04f} '.format(
train_iter.epoch, float(to_cpu(loss.data))), end='')
test_losses = []
test_accuracies = []
while True:
test_batch = test_iter.next()
image_test, target_test = concat_examples(test_batch, gpu_id)
# Forward the test data
prediction_test = model(image_test)
# Calculate the loss
loss_test = F.softmax_cross_entropy(prediction_test, target_test)
test_losses.append(to_cpu(loss_test.data))
# Calculate the accuracy
accuracy = F.accuracy(prediction_test, target_test)
accuracy.to_cpu()
test_accuracies.append(accuracy.data)
if test_iter.is_new_epoch:
test_iter.epoch = 0
test_iter.current_position = 0
test_iter.is_new_epoch = False
test_iter._pushed_position = None
break
print('val_loss:{:.04f} val_accuracy:{:.04f}'.format(
np.mean(test_losses), np.mean(test_accuracies)))
Output¶
epoch:01 train_loss:0.8072 val_loss:0.7592 val_accuracy:0.8289
epoch:02 train_loss:0.5021 val_loss:0.4467 val_accuracy:0.8841
epoch:03 train_loss:0.3539 val_loss:0.3673 val_accuracy:0.9007
epoch:04 train_loss:0.2524 val_loss:0.3307 val_accuracy:0.9067
epoch:05 train_loss:0.4232 val_loss:0.3076 val_accuracy:0.9136
epoch:06 train_loss:0.3033 val_loss:0.2910 val_accuracy:0.9167
epoch:07 train_loss:0.2004 val_loss:0.2773 val_accuracy:0.9222
epoch:08 train_loss:0.2885 val_loss:0.2679 val_accuracy:0.9239
epoch:09 train_loss:0.2818 val_loss:0.2579 val_accuracy:0.9266
epoch:10 train_loss:0.2403 val_loss:0.2484 val_accuracy:0.9307
6. Save the trained model¶
Chainer provides two types of serializers
that can be used to save and restore model state. One supports the HDF5 format and the other supports the NumPy NPZ format. For this example, we are going to use the NPZ
format to save our model since it is easy to use with NumPy and doesn’t need to install any additional dependencies or libraries.
serializers.save_npz('my_mnist.model', model)
7. Perform classification by the saved model¶
Let’s use the saved model to classify a new image. In order to load the trained model parameters, we need to perform the following two steps:
- Instantiate the same network as what you trained.
- Overwrite all parameters in the model instance with the saved weights using the
load_npz()
function.
Once the model is restored, it can be used to predict image labels on new input data.
from chainer import serializers
# Create an instance of the network you trained
model = MyNetwork()
# Load the saved paremeters into the instance
serializers.load_npz('my_mnist.model', model)
# Get a test image and label
x, t = test[0]
plt.imshow(x.reshape(28, 28), cmap='gray')
plt.savefig('7.png')
print('label:', t)
label: 7
The saved test image looks like:

# Change the shape of the minibatch.
# In this example, the size of minibatch is 1.
# Inference using any mini-batch size can be performed.
print(x.shape, end=' -> ')
x = x[None, ...]
print(x.shape)
# forward calculation of the model by sending X
y = model(x)
# The result is given as Variable, then we can take a look at the contents by the attribute, .data.
y = y.data
# Look up the most probable digit number using argmax
pred_label = y.argmax(axis=1)
print('predicted label:', pred_label[0])
(784,) -> (1, 784)
predicted label: 7
The prediction result looks correct. Yay!
Let’s try using the Trainer feature¶
By using Trainer
, you don’t need to write the training loop explicitly any more. Furthermore, Chainer provides many useful extensions that can be used with Trainer
to visualize your results, evaluate your model, store and manage log files more easily.
This example will show how to use the Trainer
to train a fully-connected feed-forward neural network on the MNIST dataset.
Note
If you would like to know how to write a training loop without using the Trainer
, please check How to write a training loop in Chainer instead of this tutorial.
1. Prepare the dataset¶
Load the MNIST dataset, which contains a training set of images and class labels as well as a corresponding test set.
from chainer.datasets import mnist
train, test = mnist.get_mnist()
Note
You can use a Python list as a dataset. That’s because Iterator
can take any object as a dataset whose elements can be accessed via []
accessor and whose length can be obtained with len()
function. For example,
train = [(x1, t1), (x2, t2), ...]
a list of tuples like this can be used as a dataset.
There are many utility dataset classes defined in datasets
. It’s recommended to utilize them in the actual applications.
For example, if your dataset consists of a number of image files, it would take a large amount of memory to load those data into a list like above. In that case, you can use ImageDataset
, which just keeps the paths to image files. The actual image data will be loaded from the disk when the corresponding element is requested via []
accessor. Until then, no images are loaded to the memory to reduce memory use.
2. Prepare the dataset iterations¶
Iterator
creates a mini-batch from the given dataset.
batchsize = 128
train_iter = iterators.SerialIterator(train, batchsize)
test_iter = iterators.SerialIterator(test, batchsize, False, False)
3. Prepare the model¶
Here, we are going to use the same model as the one defined in How to write a training loop in Chainer.
class MLP(Chain):
def __init__(self, n_mid_units=100, n_out=10):
super(MLP, self).__init__()
with self.init_scope():
self.l1 = L.Linear(None, n_mid_units)
self.l2 = L.Linear(None, n_mid_units)
self.l3 = L.Linear(None, n_out)
def __call__(self, x):
h1 = F.relu(self.l1(x))
h2 = F.relu(self.l2(h1))
return self.l3(h2)
gpu_id = 0 # Set to -1 if you use CPU
model = MLP()
if gpu_id >= 0:
model.to_gpu(gpu_id)
4. Prepare the Updater¶
Trainer
is a class that holds all of the necessary components needed for training. The main components are shown below.

Basically, all you need to pass to Trainer
is an Updater
. However, Updater
contains an Iterator
and Optimizer
. Since Iterator
can access the dataset and Optimizer
has references to the model, Updater
can access to the model to update its parameters.
So, Updater
can perform the training procedure as shown below:
- Retrieve the data from dataset and construct a mini-batch (
Iterator
) - Pass the mini-batch to the model and calculate the loss
- Update the parameters of the model (
Optimizer
)
Now let’s create the Updater
object !
max_epoch = 10
# Wrap your model by Classifier and include the process of loss calculation within your model.
# Since we do not specify a loss function here, the default 'softmax_cross_entropy' is used.
model = L.Classifier(model)
# selection of your optimizing method
optimizer = optimizers.MomentumSGD()
# Give the optimizer a reference to the model
optimizer.setup(model)
# Get an updater that uses the Iterator and Optimizer
updater = training.StandardUpdater(train_iter, optimizer, device=gpu_id)
Note
Here, the model defined above is passed to Classifier
and changed to a new Chain
. Classifier
, which in fact inherits from the Chain
class, keeps the given Chain
model in its predictor
attribute. Once you give the input data and the corresponding class labels to the model by the ()
operator,
__call__()
of the model is invoked. The data is then given topredictor
to obtain the outputy
.- Next, together with the given labels, the output
y
is passed to the loss function which is determined bylossfun
argument in the constructor ofClassifier
. - The loss is returned as a
Variable
.
In Classifier
, the lossfun
is set to
softmax_cross_entropy()
as default.
StandardUpdater
is the simplest class among several updaters. There are also the ParallelUpdater
and the MultiprocessParallelUpdater
to utilize multiple GPUs. The MultiprocessParallelUpdater
uses the NVIDIA NCCL library, so you need to install NCCL and re-install CuPy before using it.
5. Setup Trainer¶
Lastly, we will setup Trainer
. The only requirement for creating a Trainer
is to pass the Updater
object that we previously created above. You can also pass a stop_trigger
to the second trainer argument as a tuple like (length, unit)
to tell the trainer when to stop the training. The length
is given as an integer and the unit
is given as a string which should be either epoch
or iteration
. Without setting stop_trigger
, the training will never be stopped.
# Setup a Trainer
trainer = training.Trainer(updater, (max_epoch, 'epoch'), out='mnist_result')
The out
argument specifies an output directory used to save the
log files, the image files of plots to show the time progress of loss, accuracy, etc. when you use PlotReport
extension. Next, we will explain how to display or save those information by using trainer Extension
.
6. Add Extensions to the Trainer object¶
The Trainer
extensions provide the following capabilities:
- Save log files automatically (
LogReport
) - Display the training information to the terminal periodically (
PrintReport
) - Visualize the loss progress by plotting a graph periodically and save it as an image file (
PlotReport
) - Automatically serialize the state periodically (
snapshot()
/snapshot_object()
) - Display a progress bar to the terminal to show the progress of training (
ProgressBar
) - Save the model architecture as a Graphviz’s dot file (
dump_graph()
)
To use these wide variety of tools for your training task, pass Extension
objects to the extend()
method of your Trainer
object.
trainer.extend(extensions.LogReport())
trainer.extend(extensions.snapshot(filename='snapshot_epoch-{.updater.epoch}'))
trainer.extend(extensions.snapshot_object(model.predictor, filename='model_epoch-{.updater.epoch}'))
trainer.extend(extensions.Evaluator(test_iter, model, device=gpu_id))
trainer.extend(extensions.PrintReport(['epoch', 'main/loss', 'main/accuracy', 'validation/main/loss', 'validation/main/accuracy', 'elapsed_time']))
trainer.extend(extensions.PlotReport(['main/loss', 'validation/main/loss'], x_key='epoch', file_name='loss.png'))
trainer.extend(extensions.PlotReport(['main/accuracy', 'validation/main/accuracy'], x_key='epoch', file_name='accuracy.png'))
trainer.extend(extensions.dump_graph('main/loss'))
LogReport
¶
Collect loss
and accuracy
automatically every epoch
or iteration
and store the information under the log
file in the directory specified by the out
argument when you create a Trainer
object.
snapshot()
¶
The snapshot()
method saves the Trainer
object at the designated timing (default: every epoch) in the directory specified by out
. The Trainer
object, as mentioned before, has an Updater
which contains an Optimizer
and a model inside. Therefore, as long as you have the snapshot file, you can use it to come back to the training or make inferences using the previously trained model later.
snapshot_object()
¶
However, when you keep the whole Trainer
object, in some cases, it is very tedious to retrieve only the inside of the model. By using snapshot_object()
, you can save the particular object (in this case, the model wrapped by Classifier
) as a separate snapshot. Classifier
is a Chain
object which keeps the model that is also a Chain
object as its predictor
property, and all the parameters are under the predictor
, so taking the snapshot of predictor
is enough to keep all the trained parameters.
dump_graph()
¶
This method saves the structure of the computational graph of the model. The graph is saved in the
Graphviz’s dot format. The output location (directory) to save the graph is set by the out
argument of Trainer
.
Evaluator
¶
The Iterator
that uses the evaluation dataset and the model object are required to use Evaluator
. It evaluates the model using the given dataset (typically it’s a validation dataset) at the specified timing interval.
PrintReport
¶
It outputs the specified values to the standard output.
PlotReport
¶
PlotReport
plots the values specified by its arguments saves it as a image file which has the same name as the file_name
argument.
Each Extension
class has different options and some extensions are not mentioned here. And one of other important feature is, for instance, by using the trigger
option, you can set individual timings to fire the Extension
. To know more details of all extensions, please take a look at the official document: Trainer extensions.
7. Start Training¶
Just call run()
method from
Trainer
object to start training.
trainer.run()
epoch main/loss main/accuracy validation/main/loss validation/main/accuracy elapsed_time
1 1.53241 0.638409 0.74935 0.835839 4.93409
2 0.578334 0.858059 0.444722 0.882812 7.72883
3 0.418569 0.886844 0.364943 0.899229 10.4229
4 0.362342 0.899089 0.327569 0.905558 13.148
5 0.331067 0.906517 0.304399 0.911788 15.846
6 0.309019 0.911964 0.288295 0.917722 18.5395
7 0.292312 0.916128 0.272073 0.921776 21.2173
8 0.278291 0.92059 0.261351 0.923457 23.9211
9 0.266266 0.923541 0.253195 0.927314 26.6612
10 0.255489 0.926739 0.242415 0.929094 29.466
Let’s see the plot of loss progress saved in the mnist_result
directory.

How about the accuracy?

Furthermore, let’s visualize the computational graph saved with dump_graph()
using Graphviz.
% dot -Tpng mnist_result/cg.dot -o mnist_result/cg.png

From the top to the bottom, you can see the data flow in the computational graph. It basically shows how data and parameters are passed to the Function
s.
8. Evaluate a pre-trained model¶
Evaluation using the snapshot of a model is as easy as what explained in the How to write a training loop in Chainer.
import matplotlib.pyplot as plt
model = MLP()
serializers.load_npz('mnist_result/model_epoch-10', model)
# Show the output
x, t = test[0]
plt.imshow(x.reshape(28, 28), cmap='gray')
plt.show()
print('label:', t)
y = model(x[None, ...])
print('predicted_label:', y.data.argmax(axis=1)[0])

label: 7
predicted_label: 7
The prediction looks correct. Success!
Using GPU(s) in Chainer¶
In this section, you will learn about the following things:
- Relationship between Chainer and CuPy
- Basics of CuPy
- Single-GPU usage of Chainer
- Multi-GPU usage of model-parallel computing
- Multi-GPU usage of data-parallel computing
After reading this section, you will be able to:
- Use Chainer on a CUDA-enabled GPU
- Write model-parallel computing in Chainer
- Write data-parallel computing in Chainer
Relationship between Chainer and CuPy¶
Note
From v2.0.0, CuPy is turned into a separate package and repository. Even if you have CUDA installed in your environment, you have to install CuPy separately to use GPUs. See Enable CUDA/cuDNN support for the way to set up CUDA support.
Chainer uses CuPy as its backend for GPU computation.
In particular, the cupy.ndarray
class is the GPU array implementation for Chainer.
CuPy supports a subset of features of NumPy with a compatible interface.
It enables us to write a common code for CPU and GPU.
It also supports PyCUDA-like user-defined kernel generation, which enables us to write fast implementations dedicated to GPU.
Note
The chainer.cuda
module imports many important symbols from CuPy.
For example, the cupy namespace is referred as cuda.cupy
in the Chainer code.
Note that the chainer.cuda
module can be imported even if CUDA is not installed.
Chainer uses a memory pool for GPU memory allocation.
As shown in the previous sections, Chainer constructs and destructs many arrays during learning and evaluating iterations.
It is not well suited for CUDA architecture, since memory allocation and release in CUDA (i.e. cudaMalloc
and cudaFree
functions) synchronize CPU and GPU computations, which hurts performance.
In order to avoid memory allocation and deallocation during the computation, Chainer uses CuPy’s memory pool as the standard memory allocator.
Chainer changes the default allocator of CuPy to the memory pool, so user can use functions of CuPy directly without dealing with the memory allocator.
Basics of cupy.ndarray
¶
See the document of CuPy for the basic usage of cupy.ndarray
CuPy is a GPU array backend that implements a subset of NumPy interface.
The cupy.ndarray
class is in its core, which is a compatible GPU alternative of numpy.ndarray
.
CuPy implements many functions on cupy.ndarray
objects.
See the reference for the supported subset of NumPy API.
Understanding NumPy might help utilizing most features of CuPy.
See the NumPy documentation for learning it.
The main difference of cupy.ndarray
from numpy.ndarray
is that the content is allocated on the device memory.
The allocation takes place on the current device by default.
The current device can be changed by cupy.cuda.Device
object as follows:
with cupy.cuda.Device(1):
x_on_gpu1 = cupy.array([1, 2, 3, 4, 5])
Most operations of CuPy is done on the current device. Be careful that it causes an error to process an array on a non-current device.
Chainer provides some convenient functions to automatically switch and choose the device.
For example, the chainer.cuda.to_gpu()
function copies a numpy.ndarray
object to a specified device:
x_cpu = np.ones((5, 4, 3), dtype=np.float32)
x_gpu = cuda.to_gpu(x_cpu, device=1)
It is equivalent to the following code using CuPy:
x_cpu = np.ones((5, 4, 3), dtype=np.float32)
with cupy.cuda.Device(1):
x_gpu = cupy.array(x_cpu)
Moving a device array to the host can be done by chainer.cuda.to_cpu()
as follows:
x_cpu = cuda.to_cpu(x_gpu)
It is equivalent to the following code using CuPy:
with x_gpu.device:
x_cpu = x_gpu.get()
Note
The with statements in these codes are required to select the appropriate CUDA device.
If user uses only one device, these device switching is not needed.
chainer.cuda.to_cpu()
and chainer.cuda.to_gpu()
functions automatically switch the current device correctly.
Chainer also provides a convenient function chainer.cuda.get_device_from_id()
and chainer.cuda.get_device_from_array()
to select a device.
The former function accepts an integer or None
.
When None
is given, it returns a dummy device object.
Otherwise, it returns a corresponding device object.
The latter function accepts CuPy array or NumPy array.
When a NumPy array is given, it returns a dummy device object.
Otherwise, it returns a corresponding device object to the give CuPy array.
The dummy device object also supports with statements like the above example but does nothing.
Here are some other examples:
cuda.get_device_from_id(1).use()
x_gpu1 = cupy.empty((4, 3), dtype='f') # 'f' indicates float32
with cuda.get_device_from_id(1):
x_gpu1 = cupy.empty((4, 3), dtype='f')
with cuda.get_device_from_array(x_gpu1):
y_gpu1 = x_gpu + 1
Since it accepts NumPy arrays, we can write a function that accepts both NumPy and CuPy arrays with correct device switching:
def add1(x):
with cuda.get_device_from_array(x):
return x + 1
The compatibility of CuPy with NumPy enables us to write CPU/GPU generic code.
It can be made easy by the chainer.cuda.get_array_module()
function.
This function returns the numpy
or cupy
module based on arguments.
A CPU/GPU generic function is defined using it like follows:
# Stable implementation of log(1 + exp(x))
def softplus(x):
xp = cuda.get_array_module(x)
return xp.maximum(0, x) + xp.log1p(xp.exp(-abs(x)))
Run Neural Networks on a Single GPU¶
Single-GPU usage is very simple.
What you have to do is transferring Link
and input arrays to the GPU beforehand.
In this subsection, the code is based on our first MNIST example in this tutorial.
A Link
object can be transferred to the specified GPU using the to_gpu()
method.
This time, we make the number of input, hidden, and output units configurable.
The to_gpu()
method also accepts a device ID like model.to_gpu(0)
.
In this case, the link object is transferred to the appropriate GPU device.
The current device is used by default.
If we use chainer.training.Trainer
, what we have to do is just let the updater know the device ID to send each mini-batch.
updater = training.StandardUpdater(train_iter, optimizer, device=0)
trainer = training.Trainer(updater, (20, 'epoch'), out='result')
We also have to specify the device ID for an evaluator extension as well.
trainer.extend(extensions.Evaluator(test_iter, model, device=0))
When we write down the training loop by hand, we have to transfer each mini-batch to the GPU manually:
model.to_gpu()
batchsize = 100
datasize = len(x_train)
for epoch in range(20):
print('epoch %d' % epoch)
indexes = np.random.permutation(datasize)
for i in range(0, datasize, batchsize):
x = Variable(cuda.to_gpu(x_train[indexes[i : i + batchsize]]))
t = Variable(cuda.to_gpu(y_train[indexes[i : i + batchsize]]))
optimizer.update(model, x, t)
Model-parallel Computation on Multiple GPUs¶
Parallelization of machine learning is roughly classified into two types called “model-parallel” and “data-parallel”. Model-parallel means parallelizations of the computations inside the model. In contrast, data-parallel means parallelizations using data sharding. In this subsection, we show how to use the model-parallel approach on multiple GPUs in Chainer.
Recall the MNIST example. Now suppose that we want to modify this example by expanding the network to 6 layers with 2000 units each using two GPUs. In order to make multi-GPU computation efficient, we only make the two GPUs communicate at the third and sixth layer. The overall architecture looks like the following diagram:
(GPU0) input --+--> l1 --> l2 --> l3 --+--> l4 --> l5 --> l6 --+--> output
| | |
(GPU1) +--> l1 --> l2 --> l3 --+--> l4 --> l5 --> l6 --+
We can use the above MLP chain as following diagram:
(GPU0) input --+--> mlp1 --+--> mlp2 --+--> output
| | |
(GPU1) +--> mlp1 --+--> mlp2 --+
Let’s write a link for the whole network.
class ParallelMLP(Chain):
def __init__(self):
super(ParallelMLP, self).__init__()
with self.init_scope():
# the input size, 784, is inferred
self.mlp1_gpu0 = MLP(1000, 2000).to_gpu(0)
self.mlp1_gpu1 = MLP(1000, 2000).to_gpu(1)
# the input size, 2000, is inferred
self.mlp2_gpu0 = MLP(1000, 10).to_gpu(0)
self.mlp2_gpu1 = MLP(1000, 10).to_gpu(1)
def __call__(self, x):
# assume x is on GPU 0
z0 = self.mlp1_gpu0(x)
z1 = self.mlp1_gpu1(F.copy(x, 1))
# sync
h0 = F.relu(z0 + F.copy(z1, 0))
h1 = F.relu(z1 + F.copy(z0, 1))
y0 = self.mlp2_gpu0(h0)
y1 = self.mlp2_gpu1(h1)
# sync
y = y0 + F.copy(y1, 0)
return y # output is on GPU0
Recall that the Link.to_gpu()
method returns the link itself.
The copy()
function copies an input variable to specified GPU device and returns a new variable on the device.
The copy supports backprop, which just reversely transfers an output gradient to the input device.
Note
Above code is not parallelized on CPU, but is parallelized on GPU. This is because all the functions in the above code run asynchronously to the host CPU.
An almost identical example code can be found at examples/mnist/train_mnist_model_parallel.py.
Data-parallel Computation on Multiple GPUs with Trainer¶
Data-parallel computation is another strategy to parallelize online processing. In the context of neural networks, it means that a different device does computation on a different subset of the input data. In this subsection, we review the way to achieve data-parallel learning on two GPUs.
Suppose again our task is the MNIST example. This time we want to directly parallelize the three-layer network. The most simple form of data-parallelization is parallelizing the gradient computation for a distinct set of data. First, define a model and optimizer instances:
model = L.Classifier(MLP(1000, 10)) # the input size, 784, is inferred
optimizer = optimizers.SGD()
optimizer.setup(model)
Recall that the MLP
link implements the multi-layer perceptron, and the Classifier
link wraps it to provide a classifier interface.
We used StandardUpdater
in the previous example.
In order to enable data-parallel computation with multiple GPUs, we only have to replace it with ParallelUpdater
.
updater = training.ParallelUpdater(train_iter, optimizer,
devices={'main': 0, 'second': 1})
The devices
option specifies which devices to use in data-parallel learning.
The device with name 'main'
is used as the main device.
The original model is sent to this device, so the optimization runs on the main device.
In the above example, the model is also cloned and sent to GPU 1.
Half of each mini-batch is fed to this cloned model.
After every backward computation, the gradient is accumulated into the main device, the parameter update runs on it, and then the updated parameters are sent to GPU 1 again.
See also the example code in examples/mnist/train_mnist_data_parallel.py.
Data-parallel Computation on Multiple GPUs without Trainer¶
We here introduce a way to write data-parallel computation without the help of Trainer
.
Most users can skip this section.
If you are interested in how to write a data-parallel computation by yourself, this section should be informative.
It is also helpful to, e.g., customize the ParallelUpdater
class.
We again start from the MNIST example.
At this time, we use a suffix like _0
and _1
to distinguish objects on each device.
First, we define a model.
model_0 = L.Classifier(MLP(1000, 10)) # the input size, 784, is inferred
We want to make two copies of this instance on different GPUs.
The Link.to_gpu()
method runs in place, so we cannot use it to make a copy.
In order to make a copy, we can use Link.copy()
method.
model_1 = model_0.copy()
model_0.to_gpu(0)
model_1.to_gpu(1)
The Link.copy()
method copies the link into another instance.
It just copies the link hierarchy, and does not copy the arrays it holds.
Then, set up an optimizer:
optimizer = optimizers.SGD()
optimizer.setup(model_0)
Here we use the first copy of the model as the master model.
Before its update, gradients of model_1
must be aggregated to those of model_0
.
Then, we can write a data-parallel learning loop as follows:
batchsize = 100
datasize = len(x_train)
for epoch in range(20):
print('epoch %d' % epoch)
indexes = np.random.permutation(datasize)
for i in range(0, datasize, batchsize):
x_batch = x_train[indexes[i : i + batchsize]]
y_batch = y_train[indexes[i : i + batchsize]]
x0 = Variable(cuda.to_gpu(x_batch[:batchsize//2], 0))
t0 = Variable(cuda.to_gpu(y_batch[:batchsize//2], 0))
x1 = Variable(cuda.to_gpu(x_batch[batchsize//2:], 1))
t1 = Variable(cuda.to_gpu(y_batch[batchsize//2:], 1))
loss_0 = model_0(x0, t0)
loss_1 = model_1(x1, t1)
model_0.cleargrads()
model_1.cleargrads()
loss_0.backward()
loss_1.backward()
model_0.addgrads(model_1)
optimizer.update()
model_1.copyparams(model_0)
Do not forget to clear the gradients of both model copies!
One half of the mini-batch is forwarded to GPU 0, the other half to GPU 1.
Then the gradients are accumulated by the Link.addgrads()
method.
This method adds the gradients of a given link to those of the self.
After the gradients are prepared, we can update the optimizer in usual way.
Note that the update only modifies the parameters of model_0
.
So we must manually copy them to model_1
using Link.copyparams()
method.
Note
If the batch size used in one model remain the same, the scale of the gradient
is roughly proportional to the number of models, when we aggregate
gradients from all models by chainer.Link.addgrads()
. So you need to adjust the batch size
and/or learning rate of the optimizer accordingly.
Now you can use Chainer with GPUs. All examples in the examples directory support GPU computation, so please refer to them if you want to know more practices on using GPUs. In the next section, we will show how to define a differentiable (i.e. backpropable) function on Variable objects. We will also show there how to write a simple (elementwise) CUDA kernel using Chainer’s CUDA utilities.
Customizing Chainer¶
Define your own function¶
In this section, you will learn about the following things:
- How to define a function on variables
- Useful tools to write a function using a GPU
- How to test the function definition
After reading this section, you will be able to:
- Write your own functions
- Define simple kernels in the function definition
Differentiable Functions¶
Chainer provides a collection of functions in the functions
module.
It covers typical use cases in deep learning, so many existing works can be implemented with them.
On the other hand, deep learning is evolving rapidly and we cannot cover all possible functions to define unseen architectures.
So it is important to learn how to define your own functions.
First, suppose we want to define an elementwise function \(f(x, y, z) = x * y + z\).
While it is possible to implement this equation using a combination of the *
and +
functions,
defining it as a single function may reduce memory consumption, so it is not only a toy example.
Here we call this function MulAdd.
Let’s start with defining MulAdd working on the CPU.
Any function must inherit the Function
class.
The skeleton of a function looks like:
class MulAdd(Function):
def forward_cpu(self, inputs):
# do forward computation on CPU
return some_tuple
def backward_cpu(self, inputs, grad_outputs):
# do backward computation on CPU
return some_tuple
We must implement forward_cpu()
and backward_cpu()
methods.
The non-self arguments of these functions are tuples of array(s), and these functions must return a tuple of array(s).
Warning
Be careful to return a tuple of arrays even if you have just one array to return.
MulAdd is simple and implemented as follows
class MulAdd(Function):
def forward_cpu(self, inputs):
x, y, z = inputs
w = x * y + z
return w,
def backward_cpu(self, inputs, grad_outputs):
x, y, z = inputs
gw, = grad_outputs
gx = y * gw
gy = x * gw
gz = gw
return gx, gy, gz
As per the warning above, the forward_cpu
method returns a tuple of single element.
Note that all arrays appearing in CPU functions are numpy.ndarray
.
The forward function is straightforward:
It unpacks the input tuple, computes the output, and packs it into a tuple.
The backward function is a bit more complicated.
Recall the rule of differentiation of multiplication.
This example just implements the rule.
Look at the return values, the function just packs the gradient of each input in same order and returns them.
By just defining the core computation of forward and backward, Function class provides a chaining logic on it (i.e. storing the history of computation, etc.).
Note
Assuming we implement a (forward) function \(y=f(x)\) which takes as input the
vector \(x \in \mathbb{R}^n\) and produces as output a vector
\(y \in \mathbb{R}^m\). Then the backward
method has to compute
where \(\gamma\) is the grad_outputs
. Note, that the
resulting vector \(\lambda\) must have the same shape as the arguments of the forward
method.
Now let’s define the corresponding GPU methods.
You can easily predict that the methods we have to write are named forward_gpu()
and backward_gpu()
:
class MulAdd(Function):
def forward_cpu(self, inputs):
...
def backward_cpu(self, inputs, grad_outputs):
...
def forward_gpu(self, inputs):
x, y, z = inputs
w = x * y + z
return w,
def backward_gpu(self, inputs, grad_outputs):
x, y, z = inputs
gw, = grad_outputs
gx = y * gw
gy = x * gw
gz = gw
return gx, gy, gz
In GPU methods, arrays are of type cupy.ndarray
.
We use arithmetic operators defined for this class.
These operators implement the basic elementwise arithmetics.
You may find that the definitions of GPU methods are exactly same as those of CPU methods.
In that case, we can reduce them to forward()
and backward()
methods
class MulAdd(Function):
def forward(self, inputs):
x, y, z = inputs
w = x * y + z
return w,
def backward(self, inputs, grad_outputs):
x, y, z = inputs
gw, = grad_outputs
gx = y * gw
gy = x * gw
gz = gw
return gx, gy, gz
Since the cupy.ndarray
class implements many methods of numpy.ndarray
, we can write these unified methods in most cases.
The MulAdd function is used as follows:
x = Variable(np.random.uniform(-1, 1, (3, 2)).astype(np.float32))
y = Variable(np.random.uniform(-1, 1, (3, 2)).astype(np.float32))
z = Variable(np.random.uniform(-1, 1, (3, 2)).astype(np.float32))
w = MulAdd()(x, y, z)
It looks a bit ugly: we have to explicitly instantiate MulAdd before applying it to variables. We also have to be careful that one instance of MulAdd must not be used multiple times, since it acts as a node in the computational graph. In Chainer, we often define a thin wrapper Python function that hide the instantiation:
def muladd(x, y, z):
return MulAdd()(x, y, z)
w = muladd(x, y, z)
Unified forward/backward methods with NumPy/CuPy functions¶
CuPy also implements many functions that are compatible to those of NumPy. We can write unified forward/backward methods with them. Consider that we want to write a backprop-able function \(f(x, y) = \exp(x) + \exp(y)\). We name it ExpAdd here. It can be written straight-forward as follows
class ExpAdd(Function):
def forward_cpu(self, inputs):
x, y = inputs
z = np.exp(x) + np.exp(y)
return z,
def backward_cpu(self, inputs, grad_outputs):
x, y = inputs
gz, = grad_outputs
gx = gz * np.exp(x)
gy = gz * np.exp(y)
return gx, gy
def forward_gpu(self, inputs):
cupy = cuda.cupy
x, y = inputs
z = cupy.exp(x) + cupy.exp(y)
return z,
def backward_gpu(self, inputs, grad_outputs):
cupy = cuda.cupy
x, y = inputs
gz, = grad_outputs
gx = gz * cupy.exp(x)
gy = gz * cupy.exp(y)
return gx, gy
def expadd(x, y):
return ExpAdd()(x, y)
Note
Here we used cuda.cupy
instead of directly accessing cupy
.
This is because the cupy
module cannot be imported if the CUDA is not installed.
In order to keep the implementation valid in non-CUDA environment, we have to defer the access to the cupy
module.
Note that the chainer.cuda
module can be imported even if the CUDA is not installed.
Of course, the module in such environment is almost useless, but if the interpreter does not run through the code accessing CUDA-dedicated functions, the code is still valid.
The CPU and GPU implementations are almost same, except that numpy
is replaced by cupy
in GPU methods.
We can unify these functions using the cuda.get_array_module()
function.
This function accepts arbitrary number of arrays, and returns an appropriate module for them.
See the following code
class ExpAdd(Function):
def forward(self, inputs):
xp = cuda.get_array_module(*inputs)
x, y = inputs
z = xp.exp(x) + xp.exp(y)
return z,
def backward(self, inputs, grad_outputs):
xp = cuda.get_array_module(*inputs)
x, y = inputs
gz, = grad_outputs
gx = gz * xp.exp(x)
gy = gz * xp.exp(y)
return gx, gy
def expadd(x, y):
return ExpAdd()(x, y)
Note that this code works correctly even if CUDA is not installed in the environment.
If CUDA is not found, get_array_module function always returns numpy
.
We often use the name xp
for the variadic module name, which is analogous to the abbreviation np
for NumPy and cp
for CuPy.
Write an Elementwise Kernel Function¶
Let’s turn back to the MulAdd example.
The GPU implementation of MulAdd as shown above is already fast and parallelized on GPU cores. However, it invokes two kernels during each of forward and backward computations. It might hurt performance, since the intermediate temporary arrays are read and written by possibly different GPU cores, which consumes much bandwidth. We can reduce the number of invocations by defining our own kernel. It also reduce the memory consumption.
Most functions only require elementwise operations like MulAdd.
CuPy provides a useful tool to define elementwise kernels, the cupy.elementwise.ElementwiseKernel
class, and Chainer wraps it by cuda.elementwise()
function.
Our MulAdd implementation can be improved as follows:
class MulAdd(Function):
def forward_cpu(self, inputs):
...
def backward_cpu(self, inputs, grad_outputs):
...
def forward_gpu(self, inputs):
cupy = cuda.cupy
x, y, z = inputs
w = cuda.elementwise(
'float32 x, float32 y, float32 z',
'float32 w',
'w = x * y + z',
'muladd_fwd')(x, y, z)
return w,
def backward_gpu(self, inputs, grad_outputs):
x, y, z = inputs
gw, = grad_outputs
gx, gy = cuda.elementwise(
'float32 x, float32 y, float32 gw',
'float32 gx, float32 gy',
'''
gx = y * gw;
gy = x * gw;
''',
'muladd_bwd')(x, y, gw)
gz = gw
return gx, gy, gz
cuda.elementwise()
function accepts the essential implementation of the kernel function, and returns a kernel invocation function (actually, it returns ElementwiseKernel
object, which is callable).
In typical usage, we pass four arguments to this function as follows:
- Input argument list. This is a comma-separated string each entry of which consists of a type specification and an argument name.
- Output argument list in the same format as the input argument list.
- Body of parallel loop. We can use the input/output argument names as an element of these arrays.
- Name of the kernel function, which is shown in debuggers and profilers.
Above code is not compiled on every forward/backward computation thanks to two caching mechanisms provided by cuda.elementwise()
.
The first one is binary caching:
cuda.elementwise()
function caches the compiled binary in the $(HOME)/.cupy/kernel_cache
directory with a hash value of the CUDA code, and reuses it if the given code matches the hash value.
This caching mechanism is actually implemented in CuPy.
The second one is upload caching:
Given a compiled binary code, we have to upload it to the current GPU in order to execute it.
cuda.elementwise()
function memoizes the arguments and the current device, and if it is called with the same arguments for the same device, it reuses the previously uploaded kernel code.
The above MulAdd code only works for float32 arrays.
The ElementwiseKernel
also supports the type-variadic kernel definition.
In order to define variadic kernel functions, you can use type placeholder by placing a single character as type specifier:
class MulAdd(Function):
def forward_cpu(self, inputs):
...
def backward_cpu(self, inputs, grad_outputs):
...
def forward_gpu(self, inputs):
cupy = cuda.cupy
x, y, z = inputs
w = cuda.elementwise(
'T x, T y, T z',
'T w',
'w = x * y + z',
'muladd_fwd')(x, y, z)
return w,
def backward_gpu(self, inputs, grad_outputs):
x, y, z = inputs
gw, = grad_outputs
gx, gy = cuda.elementwise(
'T x, T y, T gw',
'T gx, T gy',
'''
gx = y * gw;
gy = x * gw;
''',
'muladd_bwd')(x, y, gw)
gz = gw
return gx, gy, gz
The type placeholder T
indicates an arbitrary data type that CuPy supports.
There are more functionalities on user-defined kernels in CuPy. See the CuPy documentation on user-defined kernels for more details.
Write a function with training/test mode¶
We sometimes want to make a function behave differently in training and test modes.
The training/test mode in Chainer is configured by chainer.config
.
This is a thread-local configuration object, and users can substitute True or False to its train
attribute.
You can refer to Configuring Chainer to see how to configure this flag as well as other configuration items.
Here, we just show how to use this flag to make a function support training/test mode.
You will need to check the value of the boolean flag chainer.config.train
and branch appropriately.
For example, consider the following simple dropout function:
def dropout(x):
xp = cuda.get_array_module(x.data)
mask = 2 * (xp.random.rand(*x.shape) > 0.5).astype(x.dtype)
return x * mask
This function applies dropout to each element and doubles survived elements to preserve the scale. The above implementation applies dropout even in test mode, but it is not a desired behavior. We can fix it as follows:
def dropout(x):
if not chainer.config.train:
return x
xp = cuda.get_array_module(x.data)
mask = 2 * (xp.random.rand(*x.shape) > 0.5).astype(x.dtype)
return x * mask
The function now supports test mode.
Note that you usually do not have to implement your own dropout function because dropout()
is officially provided.
Links that wrap functions¶
Some functions are meant to be combined with parameters.
In such case, it is useful to write a small link that wraps the function.
We have already seen how to define a chain that wraps other links (by inheriting Chain
class).
Here we study how to define a link that does not hold any other links.
As the first example, suppose that we want to implement elementwise product function between the input array and the parameter array. It can be defined as follows:
class EltwiseParamProduct(Link):
def __init__(self, shape):
super(EltwiseParamProduct, self).__init__()
with self.init_scope():
self.W = chainer.Parameter(initializers.Normal(scale=1.), shape)
def __call__(self, x):
return self.W * x
For another example, assume we want to define a simple linear layer.
It is already defined as Linear
, so this is an educational example.
The linear layer is divided into two parts: a function and its wrapper link.
First, we have to define a function on variables:
class LinearFunction(Function):
def forward(self, inputs):
x, W, b = inputs
return x.dot(W.T) + b,
def backward(self, inputs, grad_outputs):
x, W, b = inputs
gy, = grad_outputs
gx = gy.dot(W)
gW = gy.T.dot(x)
gb = gy.sum(axis=0)
return gx, gW, gb
def linear(x, W, b):
return LinearFunction()(x, W, b)
This function takes three arguments: input, weight, and bias. It can be used as a part of model definition, though is inconvenient since the user have to manage the weight and bias parameters directly. In order to make a convenient module, let’s wrap it into a link:
class Linear(Link):
def __init__(self, in_size, out_size):
super(Linear, self).__init__()
with self.init_scope():
self.W = chainer.Parameter(
initializers.Normal(1. / math.sqrt(in_size)),
(out_size, in_size))
self.b = chainer.Parameter(0, (out_size,))
def __call__(self, x):
return linear(x, self.W, self.b)
This link hides the parameters of the linear layer.
Note
An advanced tip to implement functions: if you want to preserve some information between forward and backward computations (e.g. to cache some arrays), you can store it as attributes. Be careful that it might increase the memory consumption during the whole forward-backward computation. If you want to train very large networks on a GPU with limited memory, it is not recommended to cache arrays between forward and backward. There is one exception for this: caching the output arrays does not change the memory consumption, because they are also held by the output Variable objects.
Warning
You should not assume a one-to-one match of calls of forward and backward. Some users may call backward more than once after one forward call.
Testing Function¶
In order to isolate the cause of learning failure from implementation bugs, it is important to test function implementations.
Chainer provides simple utilities to help writing unit tests.
They are defined in the gradient_check
module.
The most important test utility is the numerical_grad()
function.
This function computes the numerical gradient of given function using finite differences.
It can be used as follows
x = np.random.randn(4, 3).astype(np.float32)
gy = np.ones((4, 3), dtype=np.float32)
f = lambda: (x * x,)
gx = gradient_check.numerical_grad(f, (x,), (gy,))
f
is a closure that returns a tuple of array(s) computed from input arrays.
The second and third arguments of numerical_grad()
are tuples of input arrays and output gradient arrays, respectively.
The code above computes the numerical gradients of sum(f(x))
, where sum
indicates the summation over all elements.
The summation can be weighted by changing gy
.
numerical_grad()
function also accepts additional eps
argument, which indicates the quantization width of finite differences.
Note
numerical_grad()
function accepts both CPU and GPU arrays.
Note that we cannot mix CPU and GPU arrays.
Another utility is chainer.testing.assert_allclose()
function.
This is similar to numpy.testing.assert_allclose()
function.
The difference is that Chainer’s version accepts CPU and GPU arrays as inputs.
We can mix them in one invocation of chainer.testing.assert_allclose()
.
The default values of optional arguments are also different.
Here is a typical usage of gradient checking utilities.
This is a test example of functions.relu()
function
import unittest
from chainer import testing
class TestReLU(unittest.TestCase):
def test_backward_cpu(self):
x = Variable(np.random.randn(3, 2).astype(np.float32))
y = F.relu(x)
y.grad = np.random.randn(3, 2).astype(np.float32)
y.backward()
def f():
return F.relu(x).data,
gx, = gradient_check.numerical_grad(f, (x.data,), (y.grad,))
testing.assert_allclose(gx, x.grad)
The first four lines of the test code are simple forward and backward computation of ReLU function. The next two lines compute numerical gradient using the same forward function without backward routine. And at last, we compare these two results elementwise. Note that the above test code can be easily modified to test GPU version just by replacing CPU arrays to GPU arrays.
In most cases, we do not write the code like the above explicitly because Chainer
offers a utility function chainer.gradient_check.check_backward()
that follows this procedure.
import unittest
from chainer import gradient_check
class TestReLU(unittest.TestCase):
def test_backward_cpu(self):
def f(x):
return F.relu(x)
x = np.random.randn(3, 2).astype(np.float32)
y_grad = np.random.randn(3, 2).astype(np.float32)
gradient_check.check_backward(f, x, y_grad, atol=1e-4, rtol=1e-4)
You can find many examples of function tests under tests/chainer_tests/functions_tests directory.
Create your own trainer extension¶
In this section, you will learn about the following things:
How to create your own trainer extension
What is trainer Extension?¶
Extension
is a callable object that takes a Trainer
object as an argument. Adding an Extension
to a Trainer
using extend()
method, the Extension
will be called at the given timing you specified by using trigger
object (See the details in 1. trigger)
A Trainer
object has all information used in a training loop, e.g., models, optimizers, updaters, iterators, and datasets, etc. So you can change the settings of optimizers
Write a simple function¶
You can make a new Extension
by writing a simple function which takes Trainer
object as its argument. For example, when you want to reduce the learning rate at specified timing during training, lr_drop
extension can be written as follows:
def lr_drop(trainer):
trainer.updater.get_optimizer('main').lr *= 0.1
Then you can add this function to a Trainer
object via extend()
method.
trainer.extend(lr_drop, trigger=(10, 'epoch'))
It lowers the learning rate every 10 epochs by multiplying 0.1 with the current learning rate.
Write a function decorated with @make_extension¶
make_extension()
is a decorator that adds some attributes to a given function. For example, the simple extension we created above can be written in this form:
@training.make_extension(trigger=(10, 'epoch'))
def lr_drop(trainer):
trainer.updater.get_optimizer('main').lr *= 0.1
The difference between the above one and this is whether it has a default trigger
or not. In this latter case, lr_drop()
has its default trigger
so that unless another trigger
is specified via extend()
method, the trigger
specified in make_extension()
is used as default. So the code below acts the same as the former example, i.e., it reduces the learning rate every 10 epochs.
trainer.extend(lr_drop)
There are several attributes you can add using the make_extension()
decorator.
1. trigger¶
trigger
is an object that takes a Trainer
object as an argument and returns a boolean value. If a tuple in a form (period, unit)
is given as a trigger, it will be considered as an IntervalTrigger
that invokes the extension at every period
unit
. For example, when the given tuple is (10, 'epoch')
, the extension will be fired at every 10 epochs.
trigger
can also be given to the extend()
method that adds an extension to a Trainer
object. The priority of trigger
s is as follows:
- When both
extend()
and a givenExtension
havetrigger
s, thetrigger
given toextend()
is used. - When
None
is given toextend()
as thetrigger
argument and a givenExtension
hastrigger
, thetrigger
given to theExtension
is used. - When both
trigger
attributes inextend()
andExtension
areNone
, theExtension
will be fired every iteration.
See the details in the documentation of get_trigger()
.
2. default_name¶
An Extension
is kept in a dictionary which is a property in a Trainer
. This argument gives the name of the Extension
. Users will see this name in keys of the snapshot which is a dictionary generated by serialization.
3. priority¶
The priority that is used to determine the order of execution of extensions in a Trainer
object. There are three standard values for the priorities:
PRIORITY_WRITER
: The priority for extensions that write some records to the observation dictionary. It includes cases that the extension directly adds values to the observation dictionary, or the extension uses the chainer.report() function to report values to the observation dictionary. Extensions which write something to reporter should go first because the other Extensions which read those values could be added.PRIORITY_EDITOR
: The priority for extensions that edit the observation dictionary based on already reported values. Extensions which edit some values of reported ones should go after the extensions which write values to reporter but before extensions which read the final values.PRIORITY_READER
: The priority for extensions that only read records from the observation dictionary. This is also suitable for extensions that do not use the observation dictionary at all. Extensions which read the reported values should be fired after all the extensions which have other priorities, e.g,PRIORITY_WRITER
andPRIORITY_EDITOR
because it should read the final values.
See the details in the documentation of Trainer
.
Write a class inherited from Extension class¶
This is the way to define your own extension with the maximum degree of freedom. You can keep any values inside of the extension and serialize them.
As an example, let’s make an extension that drops the learning rate polynomially. It calculates the learning rate by this equation:
The learning rate will be dropped like the curve below with \({\rm power} = 0.5\):

class PolynomialShift(training.Extension):
def __init__(self, attr, power, stop_trigger, batchsize=None,
len_dataset=None):
self._attr = attr
self._power = power
self._init = None
self._t = 0
self._last_value = 0
if stop_trigger[1] == 'iteration':
self._maxiter = stop_trigger[0]
elif stop_trigger[1] == 'epoch':
if batchsize is None or len_dataset is None:
raise ValueError(
'When the unit of \'stop_trigger\' is \'epoch\', '
'\'batchsize\' and \'len_dataset\' should be '
'specified to calculate the maximum iteration.')
n_iter_per_epoch = len_dataset / float(batchsize)
self._maxiter = float(stop_trigger[0] * n_iter_per_epoch)
def initialize(self, trainer):
optimizer = trainer.updater.get_optimizer('main')
# ensure that _init is set
if self._init is None:
self._init = getattr(optimizer, self._attr)
def __call__(self, trainer):
self._t += 1
optimizer = trainer.updater.get_optimizer('main')
value = self._init * ((1 - (self._t / self._maxiter)) ** self._power)
setattr(optimizer, self._attr, value)
self._last_value = value
def serialize(self, serializer):
self._t = serializer('_t', self._t)
self._last_value = serializer('_last_value', self._last_value)
if isinstance(self._last_value, np.ndarray):
self._last_value = np.asscalar(self._last_value)
stop_trigger = (10000, 'iteration')
trainer.extend(PolynomialShift('lr', 0.5, stop_trigger)
This extension PolynomialShift
takes five arguments.
attr
: The name of optimizer property you want to update by this extension.power
: The power of the above equation to calculate the learning rate.stop_trigger
: The trigger given to theTrainer
object to specify when to stop the training loop.batchsize
: The training mini-batchsize.len_dataset
: The length of dataset, i.e., the number of data in the training dataset.
This extension calculates the number of iterations which will be performed in training by using stop_trigger
, batchsize
, and len_dataset
, then store it as a property _maxiter
. This property will be used in __call__()
method to update the learning rate. initialize()
method obtains the initial learning rate from the optimizer set to give Trainer
object. serialize()
method stores or recovers the properties, _t
(number of iterations) and _last_value
(the latest learning rate), which this extension has.
Type check¶
In this section, you will learn about the following things:
- Basic usage of type check
- Detail of type information
- Internal mechanism of type check
- More complicated cases
- Call functions
- Typical type check example
After reading this section, you will be able to:
- Write a code to check types of input arguments of your own functions
Basic usage of type check¶
When you call a function with an invalid type of array, you sometimes receive no error, but get an unexpected result by broadcasting. When you use CUDA with an illegal type of array, it causes memory corruption, and you get a serious error. These bugs are hard to fix. Chainer can check preconditions of each function, and helps to prevent such problems. These conditions may help a user to understand specification of functions.
Each implementation of Function
has a method for type check, check_type_forward()
.
This function is called just before the forward()
method of the Function
class.
You can override this method to check the condition on types and shapes of arguments.
check_type_forward()
gets an argument in_types
:
def check_type_forward(self, in_types):
...
in_types
is an instance of TypeInfoTuple
, which is a sub-class of tuple
.
To get type information about the first argument, use in_types[0]
.
If the function gets multiple arguments, we recommend to use new variables for readability:
x_type, y_type = in_types
In this case, x_type
represents the type of the first argument, and y_type
represents the second one.
We describe usage of in_types
with an example.
When you want to check if the number of dimension of x_type
equals to 2
, write this code:
utils.type_check.expect(x_type.ndim == 2)
When this condition is true, nothing happens. Otherwise this code throws an exception, and the user gets a message like this:
Traceback (most recent call last):
...
chainer.utils.type_check.InvalidType: Expect: in_types[0].ndim == 2
Actual: 3 != 2
This error message means that “ndim
of the first argument expected to be 2
, but actually it is 3
”.
Detail of type information¶
You can access three information of x_type
.
.shape
is a tuple of ints. Each value is size of each dimension..ndim
isint
value representing the number of dimensions. Note thatndim == len(shape)
.dtype
isnumpy.dtype
representing data type of the value.
You can check all members. For example, the size of the first dimension must be positive, you can write like this:
utils.type_check.expect(x_type.shape[0] > 0)
You can also check data types with .dtype
:
utils.type_check.expect(x_type.dtype == np.float64)
And an error is like this:
Traceback (most recent call last):
...
chainer.utils.type_check.InvalidType: Expect: in_types[0].dtype == <class 'numpy.float64'>
Actual: float32 != <class 'numpy.float64'>
You can also check kind
of dtype
.
This code checks if the type is floating point
utils.type_check.expect(x_type.dtype.kind == 'f')
You can compare between variables. For example, the following code checks if the first argument and the second argument have the same length:
utils.type_check.expect(x_type.shape[1] == y_type.shape[1])
Internal mechanism of type check¶
How does it show an error message like "in_types[0].ndim == 2"
?
If x_type
is an object containing ndim
member variable, we cannot show such an error message because this equation is evaluated as a boolean value by Python interpreter.
Actually x_type
is a Expr
objects, and doesn’t have a ndim
member variable itself.
Expr
represents a syntax tree.
x_type.ndim
makes a Expr
object representing (getattr, x_type, 'ndim')
.
x_type.ndim == 2
makes an object like (eq, (getattr, x_type, 'ndim'), 2)
.
type_check.expect()
gets a Expr
object and evaluates it.
When it is True
, it causes no error and shows nothing.
Otherwise, this method shows a readable error message.
If you want to evaluate a Expr
object, call eval()
method:
actual_type = x_type.eval()
actual_type
is an instance of TypeInfo
, while x_type
is an instance of Expr
.
In the same way, x_type.shape[0].eval()
returns an int value.
More powerful methods¶
Expr
class is more powerful.
It supports all mathematical operators such as +
and *
.
You can write a condition that the first dimension of x_type
is the first dimension of y_type
times four:
utils.type_check.expect(x_type.shape[0] == y_type.shape[0] * 4)
When x_type.shape[0] == 3
and y_type.shape[0] == 1
, users can get the error message below:
Traceback (most recent call last):
...
chainer.utils.type_check.InvalidType: Expect: in_types[0].shape[0] == in_types[1].shape[0] * 4
Actual: 3 != 4
To compare a member variable of your function, wrap a value with Variable
to show readable error message:
x_type.shape[0] == utils.type_check.Variable(self.in_size, "in_size")
This code can check the equivalent condition below:
x_type.shape[0] == self.in_size
However, the latter condition doesn’t know the meaning of this value. When this condition is not satisfied, the latter code shows unreadable error message:
chainer.utils.type_check.InvalidType: Expect: in_types[0].shape[0] == 4 # what does '4' mean?
Actual: 3 != 4
Note that the second argument of utils.type_check.Variable
is only for readability.
The former shows this message:
chainer.utils.type_check.InvalidType: Expect: in_types[0].shape[0] == in_size # OK, `in_size` is a value that is given to the constructor
Actual: 3 != 4 # You can also check actual value here
Call functions¶
How to check summation of all values of shape?
Expr
also supports function call:
sum = utils.type_check.Variable(np.sum, 'sum')
utils.type_check.expect(sum(x_type.shape) == 10)
Why do we need to wrap the function numpy.sum
with utils.type_check.Variable
?
x_type.shape
is not a tuple but an object of Expr
as we have seen before.
Therefore, numpy.sum(x_type.shape)
fails.
We need to evaluate this function lazily.
The above example produces an error message like this:
Traceback (most recent call last):
...
chainer.utils.type_check.InvalidType: Expect: sum(in_types[0].shape) == 10
Actual: 7 != 10
More complicated cases¶
How to write a more complicated condition that can’t be written with these operators?
You can evaluate Expr
and get its result value with eval()
method.
Then check the condition and show warning message by hand:
x_shape = x_type.shape.eval() # get actual shape (int tuple)
if not more_complicated_condition(x_shape):
expect_msg = 'Shape is expected to be ...'
actual_msg = 'Shape is ...'
raise utils.type_check.InvalidType(expect_msg, actual_msg)
Please write a readable error message. This code generates the following error message:
Traceback (most recent call last):
...
chainer.utils.type_check.InvalidType: Expect: Shape is expected to be ...
Actual: Shape is ...
Typical type check example¶
We show a typical type check for a function.
First check the number of arguments:
utils.type_check.expect(in_types.size() == 2)
in_types.size()
returns a Expr
object representing the number of arguments.
You can check it in the same way.
And then, get each type:
x_type, y_type = in_types
Don’t get each value before checking in_types.size()
.
When the number of argument is illegal, type_check.expect
might output unuseful error messages.
For example, this code doesn’t work when the size of in_types
is 0:
utils.type_check.expect(
in_types.size() == 2,
in_types[0].ndim == 3,
)
After that, check each type:
utils.type_check.expect(
x_type.dtype == np.float32,
x_type.ndim == 3,
x_type.shape[1] == 2,
)
The above example works correctly even when x_type.ndim == 0
as all conditions are evaluated lazily.
Reference Manual¶
Core functionalities¶
Variable and Parameter¶
chainer.Variable |
Array with a structure to keep track of computation. |
chainer.as_variable |
Converts an array or a variable into Variable . |
chainer.Parameter |
Parameter variable that can be registered to a link. |
chainer.variable.VariableNode |
Node in the backward computational graph representing a variable. |
Function¶
chainer.Function |
Old-style interface of a differentiable function. |
chainer.FunctionAdapter |
Adapter class to wrap Function with FunctionNode. |
chainer.FunctionNode |
Function node of the computational graph. |
chainer.force_backprop_mode |
Make a context manager which enables back-propagation. |
chainer.no_backprop_mode |
Make a context manager which disables back-propagation. |
chainer.grad |
Computes the gradient of output variables w.r.t. the input variables. |
Link and Chain¶
chainer.Link |
Building block of model definitions. |
chainer.Chain |
Composable link with object-like interface. |
chainer.ChainList |
Composable link with list-like interface. |
Optimizer¶
chainer.optimizer.Hyperparameter |
Set of hyperparameter entries of an optimizer. |
chainer.UpdateRule |
Base class of all update rules. |
chainer.Optimizer |
Base class of all numerical optimizers. |
chainer.GradientMethod |
Base class of all single gradient-based optimizers. |
Hook functions¶
chainer.optimizer.WeightDecay |
Optimizer/UpdateRule hook function for weight decay regularization. |
chainer.optimizer.Lasso |
Optimizer/UpdateRule hook function for Lasso regularization. |
chainer.optimizer.GradientClipping |
Optimizer hook function for gradient clipping. |
chainer.optimizer.GradientNoise |
Optimizer/UpdateRule hook function for adding gradient noise. |
Serializer¶
chainer.AbstractSerializer |
Abstract base class of all serializers and deserializers. |
chainer.Serializer |
Base class of all serializers. |
chainer.Deserializer |
Base class of all deserializers. |
Dataset abstraction¶
Chainer has a support of common interface of training and validation datasets. The dataset support consists of three components: datasets, iterators, and batch conversion functions.
Dataset represents a set of examples. The interface is only determined by combination with iterators you want to use on it. The built-in iterators of Chainer requires the dataset to support __getitem__
and __len__
method. In particular, the __getitem__
method should support indexing by both an integer and a slice. We can easily support slice indexing by inheriting DatasetMixin
, in which case users only have to implement get_example()
method for indexing. Some iterators also restrict the type of each example. Basically, datasets are considered as stateless objects, so that we do not need to save the dataset as a checkpoint of the training procedure.
Iterator iterates over the dataset, and at each iteration, it yields a mini batch of examples as a list. Iterators should support the Iterator
interface, which includes the standard iterator protocol of Python. Iterators manage where to read next, which means they are stateful.
Batch conversion function converts the mini batch into arrays to feed to the neural nets. They are also responsible to send each array to an appropriate device. Chainer currently provides concat_examples()
as the only example of batch conversion functions.
These components are all customizable, and designed to have a minimum interface to restrict the types of datasets and ways to handle them. In most cases, though, implementations provided by Chainer itself are enough to cover the usages.
Chainer also has a light system to download, manage, and cache concrete examples of datasets. All datasets managed through the system are saved under the dataset root directory, which is determined by the CHAINER_DATASET_ROOT
environment variable, and can also be set by the set_dataset_root()
function.
Dataset representation¶
See Dataset examples for dataset implementations.
chainer.dataset.DatasetMixin |
Default implementation of dataset indexing. |
Iterator interface¶
See Iterator examples for dataset iterator implementations.
chainer.dataset.Iterator |
Base class of all dataset iterators. |
Batch conversion function¶
chainer.dataset.concat_examples |
Concatenates a list of examples into array(s). |
chainer.dataset.to_device |
Send an array to a given device. |
Dataset management¶
chainer.dataset.get_dataset_root |
Gets the path to the root directory to download and cache datasets. |
chainer.dataset.set_dataset_root |
Sets the root directory to download and cache datasets. |
chainer.dataset.cached_download |
Downloads a file and caches it. |
chainer.dataset.cache_or_load_file |
Caches a file if it does not exist, or loads it otherwise. |
Training loop abstraction¶
Chainer provides a standard implementation of the training loops under the chainer.training
module. It is built on top of many other core features of Chainer, including Variable and Function, Link/Chain/ChainList, Optimizer, Dataset, and Reporter/Summary. Compared to the training loop abstraction of other machine learning tool kits, Chainer’s training framework aims at maximal flexibility, while keeps the simplicity for the typical usages. Most components are pluggable, and users can overwrite the definition.
The core of the training loop abstraction is Trainer
, which implements the training loop itself. The training loop consists of two parts: one is Updater
, which actually updates the parameters to train, and the other is Extension
for arbitrary functionalities other than the parameter update.
Updater and some extensions use chainer.dataset
and Iterator
to scan the datasets and load mini batches. The trainer also uses Reporter
to collect the observed values, and some extensions use DictSummary
to accumulate them and computes the statistics.
You can find many examples for the usage of this training utilities from the official examples. You can also search the extension implementations from Trainer extensions.
Trainer¶
chainer.training.Trainer |
The standard training loop in Chainer. |
Updater¶
chainer.training.Updater |
Interface of updater objects for trainers. |
chainer.training.StandardUpdater |
Standard implementation of Updater. |
chainer.training.ParallelUpdater |
Implementation of a parallel GPU Updater. |
chainer.training.updaters.MultiprocessParallelUpdater |
Implementation of a multiprocess parallel GPU Updater. |
Extension¶
chainer.training.Extension |
Base class of trainer extensions. |
chainer.training.make_extension |
Decorator to make given functions into trainer extensions. |
Trigger¶
Trigger is a callable object to decide when to process some specific event within the training loop. It takes a Trainer object as the argument, and returns True if some event should be fired.
It is mainly used to determine when to call an extension. It is also used to determine when to quit the training loop.
chainer.training.get_trigger |
Gets a trigger object. |
Debug mode¶
In debug mode, Chainer checks values of variables on runtime and shows more detailed error messages. It helps you to debug your programs. However, it requires some additional overhead time.
You can enable debug mode with chainer.using_config()
:
with chainer.using_config('debug', True):
...
See Configuring Chainer for Chainer’s configuration mechanism.
You can also set CHAINER_DEBUG
environment variable to 1
to enable this mode.
In debug mode, Chainer checks all results of forward and backward computation, and if it finds a NaN value, it raises RuntimeError
.
Some functions and links also check validity of input values.
You can check if debug mode is enabled with chainer.is_debug()
function.
chainer.is_debug |
Get the debug mode. |
chainer.set_debug |
Set the debug mode. |
Deprecated interface¶
As of v2.0.0, it is recommended to turn on the debug mode using chainer.config.debug
.
See Configuring Chainer for the way to use the config object.
We leave the reference of the conventional way (which has been available since Chainer v1) as follows.
chainer.DebugMode |
Debug mode context. |
Configuring Chainer¶
Chainer provides some global settings that affect the behavior of some functionalities. Such settings can be configured using the unified configuration system. The system provides a transparent way to manage the configuration for each process and for each thread.
The configuration is managed by two global objects: chainer.global_config
and chainer.config
.
- The
global_config
object maintains the configuration shared in the Python process. This is an instance of theGlobalConfig
class. It can be used just as a plain object, and users can freely set any attributes on it. - The
config
object, on the other hand, maintains the configuration for the current thread. This is an instance of theLocalConfig
class. It behaves like a thread-local object, and any attribute modifications are only visible to the current thread.
If no value is set to config
for a given key, global_config
is transparently referred.
Thanks to this transparent lookup, users can always use config
to read any configuration so that the thread-local configuration is used if available and otherwise the default global setting is used.
The following entries of the configuration are currently provided by Chainer. Some entries support environment variables to set the default values. Note that the default values are set in the global config.
chainer.config.cudnn_deterministic
- Flag to configure deterministic computations in cuDNN APIs.
If it is
True
, convolution functions that use cuDNN use the deterministic mode (i.e, the computation is reproducible). Otherwise, the results of convolution functions using cuDNN may be non-deterministic in exchange for the performance. The default value isFalse
. chainer.config.debug
- Debug mode flag.
If it is
True
, Chainer runs in the debug mode. See Debug mode for more information of the debug mode. The default value is given byCHAINER_DEBUG
environment variable (set to 0 or 1) if available, otherwise usesFalse
. chainer.config.enable_backprop
- Flag to enable backpropagation support.
If it is
True
, computational graphs are created during forward passes byFunctionNode
\ s, allowing backpropagation to start from anyVariable
in the graph. Otherwise, computational graphs are not created but memory consumptions are reduced. So callingbackward()
on the results of a function will not compute any gradients of any input. The default value isTrue
. chainer.config.keep_graph_on_report
- Flag to configure whether or not to let
report()
keep the computational graph. If it isFalse
,report()
does not keep the computational graph when aVariable
object is reported. It means thatreport()
stores a copy of theVariable
object which is purged from the computational graph. If it isTrue
,report()
just stores theVariable
object as is with the computational graph left attached. The default value isFalse
. chainer.config.train
- Training mode flag.
If it is
True
, Chainer runs in training mode. Otherwise, it runs in the testing (evaluation) mode. This configuration alters the behavior of e.g.chainer.functions.dropout()
andchainer.functions.batch_normalization()
. It does not reduce memory consumption or affect the creation of computational graphs required in order to compute gradients. The default value isTrue
. chainer.config.type_check
- Type checking mode flag.
If it is
True
, Chainer checks the types (data types and shapes) of inputs onFunction
applications. Otherwise, it skips type checking. The default value is given byCHAINER_TYPE_CHECK
environment variable (set to 0 or 1) if available, otherwise usesTrue
. chainer.config.use_cudnn
Flag to configure whether or not to use cuDNN. This is a ternary flag with
'always'
,'auto'
, and'never'
as its allowed values. The meaning of each flag is as follows.- If it is
'always'
, Chainer will try to use cuDNN everywhere if possible. - If it is
'auto'
, Chainer will use cuDNN only if it is known that the usage does not degrade the performance. - If it is
'never'
, Chainer will never use cuDNN anywhere.
The default value is
'auto'
.- If it is
chainer.config.autotune
- Autotune for convolutional networks flag.
If it is
True
, Chainer uses the cuDNN autotune feature to find the fastest calculation process forchainer.links.Convolution2D
,ConvolutionND
,Deconvolution2D
, orDeconvolutionND
links. The default value isFalse
.
Users can also define their own configurations. There are two ways:
- Use Chainer’s configuration objects. In this case, it is strongly recommended to prefix the name by “user_” to avoid name conflicts with configurations introduced to Chainer in the future.
- Use your own configuration objects.
Users can define their own configuration objects using
chainer.configuration.GlobalConfig
andchainer.configuration.LocalConfig
. In this case, there is no need to take care of the name conflicts.
Example
If you want to share a setting within the process, set an attribute to the global configuration.
>>> chainer.global_config.user_my_setting = 123
This value is automatically extracted by referring to the local config.
>>> chainer.config.user_my_setting
123
If you set an attribute to the local configuration, the value is only visible to the current thread.
>>> chainer.config.user_my_setting = 123
We often want to temporarily modify the configuration for the current thread.
It can be done by using using_config()
.
For example, if you only want to enable debug mode in a fragment of code, write as follows.
>>> with chainer.using_config('debug', True):
... pass # code running in the debug mode
We often want to switch to the test mode for an evaluation. This is also done in the same way.
>>> with chainer.using_config('train', False):
... pass # code running in the test mode
Note that Evaluator
automatically switches to the test mode, and thus you do not need to manually switch in the loss function for the evaluation.
You can also make your own code behave differently in training and test modes as follows.
if chainer.config.train:
pass # code only running in the training mode
else:
pass # code only running in the test mode
chainer.global_config |
Global configuration of Chainer. |
chainer.config |
Thread-local configuration of Chainer. |
chainer.using_config |
Context manager to temporarily change the thread-local configuration. |
chainer.configuration.GlobalConfig |
The plain object that represents the global configuration of Chainer. |
chainer.configuration.LocalConfig |
Thread-local configuration of Chainer. |
Utilities¶
Convolution/Deconvolution utilities¶
chainer.utils.get_conv_outsize |
Calculates output size of convolution. |
chainer.utils.get_deconv_outsize |
Calculates output size of deconvolution. |
CUDA utilities¶
Device, context and memory management on CuPy.
Chainer uses CuPy (with very thin wrapper)
to exploit the speed of GPU computation. Following modules and classes defined
in CuPy are imported to chainer.cuda
module for convenience (refer to
this table when reading chainer’s source codes).
imported name | original name |
---|---|
chainer.cuda.cupy |
cupy |
chainer.cuda.ndarray |
cupy.ndarray |
chainer.cuda.cupy.cuda |
cupy.cuda |
chainer.cuda.Device |
cupy.cuda.Device |
chainer.cuda.Event |
cupy.cuda.Event |
chainer.cuda.Stream |
cupy.cuda.Stream |
Chainer replaces the default allocator of CuPy by its memory pool implementation. It enables us to reuse the device memory over multiple forward/backward computations, and temporary arrays for consecutive elementwise operations.
Devices¶
chainer.cuda.get_device |
Gets the device from a device object, an ID integer or an array object. |
chainer.cuda.get_device_from_id |
Gets the device from an ID integer. |
chainer.cuda.get_device_from_array |
Gets the device from a list of CuPy array or a single CuPy array. |
CuPy array allocation and copy¶
chainer.cuda.copy |
Copies a cupy.ndarray object using the default stream. |
chainer.cuda.to_cpu |
Copies the given GPU array to host CPU. |
chainer.cuda.to_gpu |
Copies the given CPU array to the specified device. |
Kernel definition utilities¶
chainer.cuda.memoize |
Makes a function memoizing the result for each argument and device. |
chainer.cuda.clear_memo |
Clears the memoized results for all functions decorated by memoize. |
chainer.cuda.elementwise |
Creates an elementwise kernel function. |
chainer.cuda.reduce |
Creates a global reduction kernel function. |
CPU/GPU generic code support¶
chainer.cuda.get_array_module |
Gets an appropriate one from numpy or cupy . |
cuDNN support¶
chainer.cuda.set_max_workspace_size |
Sets the workspace size for cuDNN. |
chainer.cuda.get_max_workspace_size |
Gets the workspace size for cuDNN. |
Common algorithms¶
chainer.utils.WalkerAlias |
Implementation of Walker’s alias method. |
Reporter¶
Reporter¶
chainer.Reporter |
Object to which observed values are reported. |
chainer.get_current_reporter |
Returns the current reporter object. |
chainer.report |
Reports observed values with the current reporter object. |
chainer.report_scope |
Returns a report scope with the current reporter. |
Summary and DictSummary¶
chainer.Summary |
Online summarization of a sequence of scalars. |
chainer.DictSummary |
Online summarization of a sequence of dictionaries. |
Experimental feature annotation¶
chainer.utils.experimental |
Declares that user is using an experimental feature. |
Assertion and Testing¶
Chainer provides some facilities to make debugging easy.
Type checking utilities¶
FunctionNode
uses a systematic type checking of the chainer.utils.type_check
module.
It enables users to easily find bugs of forward and backward implementations.
You can find examples of type checking in some function implementations.
chainer.utils.type_check.Expr |
Abstract syntax tree of an expression. |
chainer.utils.type_check.expect |
Evaluates and tests all given expressions. |
chainer.utils.type_check.TypeInfo |
Type information of an input/gradient array. |
chainer.utils.type_check.TypeInfoTuple |
Type information of input/gradient tuples. |
Gradient checking utilities¶
Most function implementations are numerically tested by gradient checking.
This method computes numerical gradients of forward routines and compares their results with the corresponding backward routines.
It enables us to make the source of issues clear when we hit an error of gradient computations.
The chainer.gradient_check
module makes it easy to implement the gradient checking.
chainer.gradient_check.check_backward |
Test backward procedure of a given function. |
chainer.gradient_check.numerical_grad |
Computes numerical gradient by finite differences. |
Standard Assertions¶
The assertions have same names as NumPy’s ones.
The difference from NumPy is that they can accept both numpy.ndarray
and cupy.ndarray
.
chainer.testing.assert_allclose |
Asserts if some corresponding element of x and y differs too much. |
Function testing utilities¶
Chainer provides some utilities for testing its functions.
chainer.testing.unary_math_function_unittest |
Decorator for testing unary mathematical Chainer functions. |
Standard Function implementations¶
Chainer provides basic FunctionNode
implementations in the
chainer.functions
package. Most of them are wrapped by plain Python
functions, which users should use.
Note
As of v1.5, the concept of parameterized functions are gone, and they are
replaced by corresponding Link
implementations. They are
found in the links
namespace.
Activation functions¶
chainer.functions.clipped_relu |
Clipped Rectifier Unit function. |
chainer.functions.crelu |
Concatenated Rectified Linear Unit function. |
chainer.functions.elu |
Exponential Linear Unit function. |
chainer.functions.hard_sigmoid |
Element-wise hard-sigmoid function. |
chainer.functions.leaky_relu |
Leaky Rectified Linear Unit function. |
chainer.functions.log_softmax |
Channel-wise log-softmax function. |
chainer.functions.lstm |
Long Short-Term Memory units as an activation function. |
chainer.functions.maxout |
Maxout activation function. |
chainer.functions.prelu |
Parametric ReLU function. |
chainer.functions.relu |
Rectified Linear Unit function. |
chainer.functions.selu |
Scaled Exponential Linear Unit function. |
chainer.functions.sigmoid |
Element-wise sigmoid logistic function. |
chainer.functions.slstm |
S-LSTM units as an activation function. |
chainer.functions.softmax |
Softmax function. |
chainer.functions.softplus |
Element-wise softplus function. |
chainer.functions.tanh |
Elementwise hyperbolic tangent function. |
chainer.functions.tree_lstm |
TreeLSTM unit as an activation function. |
Array manipulations¶
chainer.functions.broadcast |
Broadcast given variables. |
chainer.functions.broadcast_to |
Broadcast a given variable to a given shape. |
chainer.functions.cast |
Cast an input variable to a given type. |
chainer.functions.concat |
Concatenates given variables along an axis. |
chainer.functions.copy |
Copies the input variable onto the specified device. |
chainer.functions.depth2space |
Computes the depth2space transformation for subpixel calculations. |
chainer.functions.dstack |
Concatenate variables along third axis (depth wise). |
chainer.functions.expand_dims |
Expands dimensions of an input variable without copy. |
chainer.functions.flatten |
Flatten a given array into one dimension. |
chainer.functions.flip |
Flips an input variable in reverse order along the given axis. |
chainer.functions.fliplr |
Flip array in the left/right direction. |
chainer.functions.flipud |
Flip array in the up/down direction. |
chainer.functions.get_item |
Extract elements from array with specified shape, axes and offsets. |
chainer.functions.hstack |
Concatenate variables horizontally (column wise). |
chainer.functions.im2col |
Extract patches from an image based on the filter. |
chainer.functions.pad |
Pad an input variable. |
chainer.functions.pad_sequence |
Pad given arrays to make a matrix. |
chainer.functions.permutate |
Permutates a given variable along an axis. |
chainer.functions.reshape |
Reshapes an input variable without copy. |
chainer.functions.resize_images |
Resize images to the given shape. |
chainer.functions.rollaxis |
Roll the axis backwards to the given position. |
chainer.functions.select_item |
Select elements stored in given indices. |
chainer.functions.separate |
Separates an array along a given axis. |
chainer.functions.space2depth |
Computes the space2depth transformation for subpixel calculations. |
chainer.functions.spatial_transformer_grid |
2D Spatial Transformer grid. |
chainer.functions.spatial_transformer_sampler |
2D Spatial Transformer sampler. |
chainer.functions.split_axis |
Splits given variables along an axis. |
chainer.functions.squeeze |
Remove demensions of size one from the shape of a ndarray. |
chainer.functions.stack |
Concatenate variables along a new axis. |
chainer.functions.swapaxes |
Swap two axes of a variable. |
chainer.functions.tile |
Construct an array by tiling a given array. |
chainer.functions.transpose |
Permute the dimensions of an input variable without copy. |
chainer.functions.transpose_sequence |
Transpose a list of Variables. |
chainer.functions.vstack |
Concatenate variables vertically (row wise). |
chainer.functions.where |
Choose elements depending on condition. |
Neural network connections¶
chainer.functions.bilinear |
Applies a bilinear function based on given parameters. |
chainer.functions.convolution_2d |
Two-dimensional convolution function. |
chainer.functions.convolution_nd |
N-dimensional convolution function. |
chainer.functions.deconvolution_2d |
Two dimensional deconvolution function. |
chainer.functions.deconvolution_nd |
N-dimensional deconvolution function. |
chainer.functions.depthwise_convolution_2d |
Two-dimensional depthwise convolution function. |
chainer.functions.dilated_convolution_2d |
Two-dimensional dilated convolution function. |
chainer.functions.embed_id |
Efficient linear function for one-hot input. |
chainer.functions.linear |
Linear function, or affine transformation. |
chainer.functions.n_step_bigru |
Stacked Bi-directional Gated Recurrent Unit function. |
chainer.functions.n_step_bilstm |
Stacked Bi-directional Long Short-Term Memory function. |
chainer.functions.n_step_birnn |
Stacked Bi-directional RNN function for sequence inputs. |
chainer.functions.n_step_gru |
Stacked Uni-directional Gated Recurrent Unit function. |
chainer.functions.n_step_lstm |
Stacked Uni-directional Long Short-Term Memory function. |
chainer.functions.n_step_rnn |
Stacked Uni-directional RNN function for sequence inputs. |
Evaluation functions¶
chainer.functions.accuracy |
Computes multiclass classification accuracy of the minibatch. |
chainer.functions.binary_accuracy |
Computes binary classification accuracy of the minibatch. |
chainer.functions.classification_summary |
Calculates Precision, Recall, F beta Score, and support. |
chainer.functions.f1_score |
|
chainer.functions.precision |
|
chainer.functions.r2_score |
Computes R^2(coefficient of determination) regression score function. |
chainer.functions.recall |
Loss functions¶
chainer.functions.absolute_error |
Element-wise absolute error function. |
chainer.functions.bernoulli_nll |
Computes the negative log-likelihood of a Bernoulli distribution. |
chainer.functions.black_out |
BlackOut loss function. |
chainer.functions.connectionist_temporal_classification |
Connectionist Temporal Classification loss function. |
chainer.functions.contrastive |
Computes contrastive loss. |
chainer.functions.crf1d |
Calculates negative log-likelihood of linear-chain CRF. |
chainer.functions.argmax_crf1d |
Computes a state that maximizes a joint probability of the given CRF. |
chainer.functions.cross_covariance |
Computes the sum-squared cross-covariance penalty between y and z |
chainer.functions.decov |
Computes the DeCov loss of h |
chainer.functions.gaussian_kl_divergence |
Computes the KL-divergence of Gaussian variables from the standard one. |
chainer.functions.gaussian_nll |
Computes the negative log-likelihood of a Gaussian distribution. |
chainer.functions.hinge |
Computes the hinge loss for a one-of-many classification task. |
chainer.functions.huber_loss |
Loss function which is less sensitive to outliers in data than MSE. |
chainer.functions.mean_absolute_error |
Mean absolute error function. |
chainer.functions.mean_squared_error |
Mean squared error function. |
chainer.functions.negative_sampling |
Negative sampling loss function. |
chainer.functions.sigmoid_cross_entropy |
Computes cross entropy loss for pre-sigmoid activations. |
chainer.functions.softmax_cross_entropy |
Computes cross entropy loss for pre-softmax activations. |
chainer.functions.squared_error |
Squared error function. |
chainer.functions.triplet |
Computes triplet loss. |
Mathematical functions¶
chainer.functions.absolute |
Element-wise absolute. |
chainer.functions.arccos |
Elementwise arccosine function. |
chainer.functions.arcsin |
Elementwise arcsine function. |
chainer.functions.arctan |
Elementwise arctangent function. |
chainer.functions.arctan2 |
Elementwise arctangent function with two arguments. |
chainer.functions.argmax |
Returns index which holds maximum of array elements over a given axis. |
chainer.functions.argmin |
Returns index which holds minimum of array elements over a given axis. |
chainer.functions.average |
Calculate weighted average of array elements over a given axis. |
chainer.functions.batch_inv |
Computes the inverse of a batch of square matrices. |
chainer.functions.batch_l2_norm_squared |
L2 norm (a.k.a. Euclidean norm) squared. |
chainer.functions.batch_matmul |
Computes the batch matrix multiplications of two sets of arrays. |
chainer.functions.bias |
Elementwise summation with broadcasting. |
chainer.functions.ceil |
Elementwise ceil function. |
chainer.functions.clip |
Clips (limits) elements of input variable. |
chainer.functions.cos |
Elementwise cos function. |
chainer.functions.cosh |
Elementwise hyperbolic cosine function. |
chainer.functions.det |
Computes the determinant of a single square matrix. |
chainer.functions.batch_det |
Computes the determinant of a batch of square matrices. |
chainer.functions.exp |
Elementwise exponential function. |
chainer.functions.expm1 |
Elementwise exponential minus one function. |
chainer.functions.fix |
Elementwise fix function. |
chainer.functions.fmod |
Elementwise mod function. |
chainer.functions.floor |
Elementwise floor function. |
chainer.functions.identity |
Just returns input variables. |
chainer.functions.inv |
Computes the inverse of square matrix. |
chainer.functions.linear_interpolate |
Elementwise linear-interpolation function. |
chainer.functions.log |
Elementwise natural logarithm function. |
chainer.functions.log10 |
Elementwise logarithm function to the base 10. |
chainer.functions.log1p |
Elementwise natural logarithm plus one function. |
chainer.functions.log2 |
Elementwise logarithm function to the base 2. |
chainer.functions.logsumexp |
Log-sum-exp of array elements over a given axis. |
chainer.functions.matmul |
Computes the matrix multiplication of two arrays. |
chainer.functions.max |
Maximum of array elements over a given axis. |
chainer.functions.maximum |
Element-wise maximum of input variables. |
chainer.functions.mean |
Calculate weighted average of array elements over a given axis. |
chainer.functions.min |
Minimum of array elements over a given axis. |
chainer.functions.minimum |
Element-wise minimum of input variables. |
chainer.functions.prod |
Product of array elements over a given axis. |
chainer.functions.rsqrt |
Computes elementwise reciprocal of square root of input \(x_i\). |
chainer.functions.scale |
Elementwise product with broadcasting. |
chainer.functions.sin |
Elementwise sin function. |
chainer.functions.sinh |
Elementwise hyperbolic sine function. |
chainer.functions.sign |
Elementwise sign function. |
chainer.functions.sqrt |
Elementwise square root function. |
chainer.functions.square |
Elementwise square function. |
chainer.functions.squared_difference |
Squared difference of input variables. |
chainer.functions.sum |
Sum of array elements over a given axis. |
chainer.functions.tanh |
Elementwise hyperbolic tangent function. |
chainer.functions.tan |
Elementwise tan function. |
Noise injections¶
chainer.functions.dropout |
Drops elements of input variable randomly. |
chainer.functions.gaussian |
Gaussian sampling function. |
chainer.functions.simplified_dropconnect |
Linear unit regularized by simplified dropconnect. |
chainer.functions.zoneout |
Drops elements of input variable and sets to previous variable randomly. |
Normalization functions¶
chainer.functions.batch_normalization |
Batch normalization function. |
chainer.functions.batch_renormalization |
Batch renormalization function. |
chainer.functions.fixed_batch_normalization |
Batch normalization function with fixed statistics. |
chainer.functions.fixed_batch_renormalization |
|
chainer.functions.layer_normalization |
Layer normalization. |
chainer.functions.local_response_normalization |
Local response normalization across neighboring channels. |
chainer.functions.normalize |
L2 norm squared (a.k.a. Euclidean norm). |
Spatial pooling¶
chainer.functions.average_pooling_2d |
Spatial average pooling function. |
chainer.functions.average_pooling_nd |
N-dimensionally spatial average pooling function. |
chainer.functions.max_pooling_2d |
Spatial max pooling function. |
chainer.functions.max_pooling_nd |
N-dimensionally spatial max pooling function. |
chainer.functions.roi_pooling_2d |
Spatial Region of Interest (ROI) pooling function. |
chainer.functions.spatial_pyramid_pooling_2d |
Spatial pyramid pooling function. |
chainer.functions.unpooling_2d |
Inverse operation of pooling for 2d array. |
chainer.functions.unpooling_nd |
Inverse operation of N-dimensional spatial pooling. |
chainer.functions.upsampling_2d |
Upsampling using pooling indices. |
Utility functions¶
chainer.functions.forget |
Calls a function without storing intermediate results. |
Standard Link implementations¶
Chainer provides many Link
implementations in the
chainer.links
package.
Note
Some of the links are originally defined in the chainer.functions
namespace. They are still left in the namespace for backward compatibility,
though it is strongly recommended to use them via the chainer.links
package.
Learnable connections¶
chainer.links.Bias |
Broadcasted elementwise summation with learnable parameters. |
chainer.links.Bilinear |
Bilinear layer that performs tensor multiplication. |
chainer.links.ChildSumTreeLSTM |
Child-Sum TreeLSTM unit. |
chainer.links.Convolution2D |
Two-dimensional convolutional layer. |
chainer.links.ConvolutionND |
N-dimensional convolution layer. |
chainer.links.Deconvolution2D |
Two dimensional deconvolution function. |
chainer.links.DeconvolutionND |
N-dimensional deconvolution function. |
chainer.links.DepthwiseConvolution2D |
Two-dimensional depthwise convolutional layer. |
chainer.links.DilatedConvolution2D |
Two-dimensional dilated convolutional layer. |
chainer.links.EmbedID |
Efficient linear layer for one-hot input. |
chainer.links.GRU |
Stateful Gated Recurrent Unit function (GRU) |
chainer.links.Highway |
Highway module. |
chainer.links.Inception |
Inception module of GoogLeNet. |
chainer.links.InceptionBN |
Inception module of the new GoogLeNet with BatchNormalization. |
chainer.links.Linear |
Linear layer (a.k.a. fully-connected layer). |
chainer.links.LSTM |
Fully-connected LSTM layer. |
chainer.links.MLPConvolution2D |
Two-dimensional MLP convolution layer of Network in Network. |
chainer.links.NaryTreeLSTM |
N-ary TreeLSTM unit. |
chainer.links.NStepBiGRU |
Stacked Bi-directional GRU for sequences. |
chainer.links.NStepBiLSTM |
Stacked Bi-directional LSTM for sequences. |
chainer.links.NStepBiRNNReLU |
Stacked Bi-directional RNN for sequences. |
chainer.links.NStepBiRNNTanh |
Stacked Bi-directional RNN for sequences. |
chainer.links.NStepGRU |
Stacked Uni-directional GRU for sequences. |
chainer.links.NStepLSTM |
Stacked Uni-directional LSTM for sequences. |
chainer.links.NStepRNNReLU |
Stacked Uni-directional RNN for sequences. |
chainer.links.NStepRNNTanh |
Stacked Uni-directional RNN for sequences. |
chainer.links.Parameter |
Link that just holds a parameter and returns it. |
chainer.links.Scale |
Broadcasted elementwise product with learnable parameters. |
chainer.links.StatefulGRU |
Stateful Gated Recurrent Unit function (GRU). |
chainer.links.StatelessGRU |
Stateless Gated Recurrent Unit function (GRU). |
chainer.links.StatefulMGU |
|
chainer.links.StatelessMGU |
|
chainer.links.StatefulPeepholeLSTM |
Fully-connected LSTM layer with peephole connections. |
chainer.links.StatefulZoneoutLSTM |
|
chainer.links.StatelessLSTM |
Stateless LSTM layer. |
Activation/loss/normalization functions with parameters¶
chainer.links.BatchNormalization |
Batch normalization layer on outputs of linear or convolution functions. |
chainer.links.BatchRenormalization |
Batch renormalization layer on outputs of linear or convolution functions. |
chainer.links.LayerNormalization |
Layer normalization layer on outputs of linear functions. |
chainer.links.BinaryHierarchicalSoftmax |
Hierarchical softmax layer over binary tree. |
chainer.links.BlackOut |
BlackOut loss layer. |
chainer.links.CRF1d |
Linear-chain conditional random field loss layer. |
chainer.links.SimplifiedDropconnect |
Fully-connected layer with simplified dropconnect regularization. |
chainer.links.PReLU |
Parametric ReLU function as a link. |
chainer.links.Maxout |
Fully-connected maxout layer. |
chainer.links.NegativeSampling |
Negative sampling loss layer. |
Machine learning models¶
chainer.links.Classifier |
A simple classifier model. |
Pre-trained models¶
Pre-trained models are mainly used to achieve a good performance with a small
dataset, or extract a semantic feature vector. Although CaffeFunction
automatically loads a pre-trained model released as a caffemodel,
the following link models provide an interface for automatically converting
caffemodels, and easily extracting semantic feature vectors.
For example, to extract the feature vectors with VGG16Layers
, which is
a common pre-trained model in the field of image recognition,
users need to write the following few lines:
from chainer.links import VGG16Layers
from PIL import Image
model = VGG16Layers()
img = Image.open("path/to/image.jpg")
feature = model.extract([img], layers=["fc7"])["fc7"]
where fc7
denotes a layer before the last fully-connected layer.
Unlike the usual links, these classes automatically load all the
parameters from the pre-trained models during initialization.
VGG16Layers¶
chainer.links.VGG16Layers |
A pre-trained CNN model with 16 layers provided by VGG team. |
chainer.links.model.vision.vgg.prepare |
Converts the given image to the numpy array for VGG models. |
GoogLeNet¶
chainer.links.GoogLeNet |
A pre-trained GoogLeNet model provided by BVLC. |
chainer.links.model.vision.googlenet.prepare |
Converts the given image to the numpy array for GoogLeNet. |
Residual Networks¶
chainer.links.model.vision.resnet.ResNetLayers |
A pre-trained CNN model provided by MSRA. |
chainer.links.ResNet50Layers |
A pre-trained CNN model with 50 layers provided by MSRA. |
chainer.links.ResNet101Layers |
A pre-trained CNN model with 101 layers provided by MSRA. |
chainer.links.ResNet152Layers |
A pre-trained CNN model with 152 layers provided by MSRA. |
chainer.links.model.vision.resnet.prepare |
Converts the given image to the numpy array for ResNets. |
Compatibility with other frameworks¶
chainer.links.TheanoFunction |
Theano function wrapper. |
chainer.links.caffe.CaffeFunction |
Caffe emulator based on the model file of Caffe. |
Optimizers¶
chainer.optimizers.AdaDelta |
Zeiler’s ADADELTA. |
chainer.optimizers.AdaGrad |
AdaGrad optimizer. |
chainer.optimizers.Adam |
Adam optimizer. |
chainer.optimizers.MomentumSGD |
Momentum SGD optimizer. |
chainer.optimizers.NesterovAG |
Nesterov’s Accelerated Gradient. |
chainer.optimizers.RMSprop |
RMSprop optimizer. |
chainer.optimizers.RMSpropGraves |
Alex Graves’s RMSprop. |
chainer.optimizers.SGD |
Vanilla Stochastic Gradient Descent. |
chainer.optimizers.SMORMS3 |
Simon Funk’s SMORMS3. |
Serializers¶
Serialization in NumPy NPZ format¶
NumPy serializers can be used in arbitrary environments that Chainer runs with.
It consists of asymmetric serializer/deserializer due to the fact that numpy.savez()
does not support online serialization.
Therefore, serialization requires two-step manipulation: first packing the objects into a flat dictionary, and then serializing it into npz format.
chainer.serializers.DictionarySerializer |
Serializer for dictionary. |
chainer.serializers.NpzDeserializer |
Deserializer for NPZ format. |
chainer.serializers.save_npz |
Saves an object to the file in NPZ format. |
chainer.serializers.load_npz |
Loads an object from the file in NPZ format. |
Serialization in HDF5 format¶
chainer.serializers.HDF5Serializer |
Serializer for HDF5 format. |
chainer.serializers.HDF5Deserializer |
Deserializer for HDF5 format. |
chainer.serializers.save_hdf5 |
Saves an object to the file in HDF5 format. |
chainer.serializers.load_hdf5 |
Loads an object from the file in HDF5 format. |
Function hooks¶
Chainer provides a function-hook mechanism that enriches
the behavior of forward and backward propagation of
FunctionNode
s.
Base class¶
chainer.FunctionHook |
Base class of hooks for Functions. |
Concrete function hooks¶
chainer.function_hooks.CUDAProfileHook |
|
chainer.function_hooks.CupyMemoryProfileHook |
Function hook for measuring memory usage of functions in cupy memory pool. |
chainer.function_hooks.PrintHook |
Function hook that prints debug information. |
chainer.function_hooks.TimerHook |
Function hook for measuring elapsed time of functions. |
Weight Initializers¶
Weight initializers are used to initialize arrays.
They destructively modify the content of numpy.ndarray
or cupy.ndarray
.
Typically, weight initializers are passed to Link
s
to initialize their weights and biases.
A weight initializer can be any of the following objects.
chainer.Initializer
class instance.- Python or NumPy scalar or
numpy.ndarray
. - A callable that takes an array (
numpy.ndarray
orcupy.ndarray
) and feeds the initial data into it. None
, in which case the default initializer is used. Unless explicitly specified, it isLeCunNormal
with scale value 1.
Base class¶
chainer.Initializer |
Initializes array. |
Concrete initializers¶
chainer.initializers.Identity |
Initializes array with the identity matrix. |
chainer.initializers.Constant |
Initializes array with constant value. |
chainer.initializers.Zero |
Returns initializer that initializes array with the all-zero array. |
chainer.initializers.One |
Returns initializer that initializes array with the all-one array. |
chainer.initializers.NaN |
Returns initializer that initializes array with the all-NaN array. |
chainer.initializers.Normal |
Initializes array with a normal distribution. |
chainer.initializers.LeCunNormal |
Initializes array with scaled Gaussian distribution. |
chainer.initializers.GlorotNormal |
Initializes array with scaled Gaussian distribution. |
chainer.initializers.HeNormal |
Initializes array with scaled Gaussian distribution. |
chainer.initializers.Orthogonal |
Initializes array with an orthogonal system. |
chainer.initializers.Uniform |
Initializes array with a scaled uniform distribution. |
chainer.initializers.LeCunUniform |
Initializes array with a scaled uniform distribution. |
chainer.initializers.GlorotUniform |
Initializes array with a scaled uniform distribution. |
chainer.initializers.HeUniform |
Initializes array with scaled uniform distribution. |
Helper function¶
chainer.initializers.generate_array |
Return initialized array. |
Dataset examples¶
The most basic dataset
implementation is an array.
Both NumPy and CuPy arrays can be used directly as datasets.
In many cases, though, the simple arrays are not enough to write the training procedure. In order to cover most of such cases, Chainer provides many built-in implementations of datasets.
These built-in datasets are divided into two groups.
One is a group of general datasets.
Most of them are wrapper of other datasets to introduce some structures (e.g., tuple or dict) to each data point.
The other one is a group of concrete, popular datasets.
These concrete examples use the downloading utilities in the chainer.dataset
module to cache downloaded and converted datasets.
General datasets¶
General datasets are further divided into four types.
The first one is DictDataset
and TupleDataset
, both of which combine other datasets and introduce some structures on them.
The second one is ConcatenatedDataset
and SubDataset
.
ConcatenatedDataset
represents a concatenation of existing datasets. It can be used to merge datasets and make a larger dataset.
SubDataset
represents a subset of an existing dataset. It can be used to separate a dataset for hold-out validation or cross validation. Convenient functions to make random splits are also provided.
The third one is TransformDataset
, which wraps around a dataset by applying a function to data indexed from the underlying dataset.
It can be used to modify behavior of a dataset that is already prepared.
The last one is a group of domain-specific datasets. Currently, ImageDataset
and LabeledImageDataset
are provided for datasets of images.
DictDataset¶
chainer.datasets.DictDataset |
Dataset of a dictionary of datasets. |
TupleDataset¶
chainer.datasets.TupleDataset |
Dataset of tuples from multiple equal-length datasets. |
ConcatenatedDataset¶
-
class
chainer.datasets.
ConcatenatedDataset
(*datasets)[source]¶ Dataset which concatenates some base datasets.
This dataset wraps some base datasets and works as a concatenated dataset. For example, if a base dataset with 10 samples and another base dataset with 20 samples are given, this dataset works as a dataset which has 30 samples.
Parameters: datasets – The underlying datasets. Each dataset has to support __len__()
and__getitem__()
.-
get_example
(i)[source]¶ Returns the i-th example.
Implementations should override it. It should raise
IndexError
if the index is invalid.Parameters: i (int) – The index of the example. Returns: The i-th example.
-
SubDataset¶
chainer.datasets.SubDataset |
Subset of a base dataset. |
chainer.datasets.split_dataset |
Splits a dataset into two subsets. |
chainer.datasets.split_dataset_random |
Splits a dataset into two subsets randomly. |
chainer.datasets.get_cross_validation_datasets |
Creates a set of training/test splits for cross validation. |
chainer.datasets.get_cross_validation_datasets_random |
Creates a set of training/test splits for cross validation randomly. |
TransformDataset¶
chainer.datasets.TransformDataset |
Dataset that indexes the base dataset and transforms the data. |
ImageDataset¶
chainer.datasets.ImageDataset |
Dataset of images built from a list of paths to image files. |
LabeledImageDataset¶
chainer.datasets.LabeledImageDataset |
Dataset of image and label pairs built from a list of paths and labels. |
Concrete datasets¶
chainer.datasets.get_mnist |
Gets the MNIST dataset. |
chainer.datasets.get_cifar10 |
Gets the CIFAR-10 dataset. |
chainer.datasets.get_cifar100 |
Gets the CIFAR-100 dataset. |
chainer.datasets.get_ptb_words |
Gets the Penn Tree Bank dataset as long word sequences. |
chainer.datasets.get_ptb_words_vocabulary |
Gets the Penn Tree Bank word vocabulary. |
chainer.datasets.get_svhn |
Gets the SVHN dataset. |
Iterator examples¶
Chainer provides some iterators that implement typical strategies to create mini-batches by iterating over datasets.
SerialIterator
is the simplest one, which extract mini batches in the main thread.
MultiprocessIterator
is a parallelized version of SerialIterator
. It maintains worker subprocesses to load the next mini-batch in parallel.
chainer.iterators.SerialIterator |
Dataset iterator that serially reads the examples. |
chainer.iterators.MultiprocessIterator |
Dataset iterator that loads examples in parallel. |
Trainer extensions¶
chainer.training.extensions.dump_graph |
Returns a trainer extension to dump a computational graph. |
chainer.training.extensions.Evaluator |
Trainer extension to evaluate models on a validation set. |
chainer.training.extensions.ExponentialShift |
Trainer extension to exponentially shift an optimizer attribute. |
chainer.training.extensions.LinearShift |
Trainer extension to change an optimizer attribute linearly. |
chainer.training.extensions.LogReport |
Trainer extension to output the accumulated results to a log file. |
chainer.training.extensions.observe_lr |
Returns a trainer extension to record the learning rate. |
chainer.training.extensions.observe_value |
Returns a trainer extension to continuously record a value. |
chainer.training.extensions.snapshot |
Returns a trainer extension to take snapshots of the trainer. |
chainer.training.extensions.snapshot_object |
Returns a trainer extension to take snapshots of a given object. |
chainer.training.extensions.ParameterStatistics |
Trainer extension to report parameter statistics. |
chainer.training.extensions.PlotReport |
Trainer extension to output plots. |
chainer.training.extensions.PrintReport |
Trainer extension to print the accumulated results. |
chainer.training.extensions.ProgressBar |
Trainer extension to print a progress bar and recent training status. |
Trainer triggers¶
chainer.training.triggers.BestValueTrigger |
Trigger invoked when specific value becomes best. |
chainer.training.triggers.IntervalTrigger |
Trigger based on a fixed interval. |
chainer.training.triggers.ManualScheduleTrigger |
Trigger invoked at specified point(s) of iterations or epochs. |
chainer.training.triggers.MaxValueTrigger |
Trigger invoked when specific value becomes maximum. |
chainer.training.triggers.MinValueTrigger |
Trigger invoked when specific value becomes minimum. |
Caffe Reference Model Support¶
Caffe is a popular framework maintained by BVLC at UC Berkeley. It is widely used by computer vision communities, and aims at fast computation and easy usage without any programming. The BVLC team provides trained reference models in their Model Zoo, one of the reason why this framework gets popular.
Chainer can import the reference models and emulate the network by Link
implementations.
This functionality is provided by the chainer.links.caffe.CaffeFunction
class.
chainer.links.caffe.CaffeFunction |
Caffe emulator based on the model file of Caffe. |
Visualization of Computational Graph¶
As neural networks get larger and complicated, it gets much harder to confirm if their architectures are constructed properly.
Chainer supports visualization of computational graphs.
Users can generate computational graphs by invoking build_computational_graph()
. Generated computational graphs are dumped to specified format (Currently Dot Language is supported).
Basic usage is as follows:
import chainer.computational_graph as c
...
g = c.build_computational_graph(vs)
with open('path/to/output/file', 'w') as o:
o.write(g.dump())
where vs
is list of Variable
instances and g
is an instance of ComputationalGraph
.
This code generates the computational graph that are backward-reachable (i.e. reachable by repetition of steps backward) from at least one of vs
.
Here is an example of (a part of) the generated graph (inception(3a) in GoogLeNet). This example is from example/imagenet
.
chainer.computational_graph.build_computational_graph |
Builds a graph of functions and variables backward-reachable from outputs. |
chainer.computational_graph.ComputationalGraph |
Class that represents computational graph. |
Environment variables¶
Here are the environment variables Chainer uses.
CHAINER_CUDNN |
Set 0 to disable cuDNN in Chainer.
Otherwise cuDNN is enabled automatically. |
CHAINER_SEED |
Default seed value of random number generators for CUDA. If it is not set, the seed value is generated from Python random module. Set an integer value in decimal format. |
CHAINER_TYPE_CHECK |
Set 0 to disable type checking.
Otherwise type checking is enabled automatically.
See Type checking utilities for details. |
CHAINER_DEBUG |
Set 1 to enable debug mode. It is disabled by default.
In debug mode, Chainer performs various runtime checks that can help
debug user’s code at the cost of some overhead.
For details, see Debug mode. |
API Compatibility Policy¶
This document explains the design policy on compatibilities of Chainer APIs. Development team should follow this policy on deciding to add, extend, and change APIs and their behaviors.
This document is written for both users and developers. Users can decide the level of dependencies on Chainer’s implementations in their codes based on this document. Developers should read through this document before creating pull requests that contain changes on the interface. Note that this document may contain ambiguities on the level of supported compatibilities.
Targeted Versions¶
This policy is applied to Chainer v2.0.0 and higher. Note that this policy is not applied to Chainer of lower versions. For older versions of Chainer, see the old version of API Compatiblity Policy.
Versioning and Backward Compatibility¶
The versioning of Chainer follows the PEP 440 and a part of Semantic versioning. See Contribution Guide for details of versioning.
The backward compatibility is kept for revision updates and minor updates, which are applied to the stable version. A major update from the latest release candidate basically keeps the backward compatibility, although it is not guaranteed. Any pre-releases may break the backward compatibility.
Breaking the Compatibility¶
We sometimes need to break the backward compatibility to improve the framework design and to support new kinds of machine learning methods. Such a change is only made into pre-releases (alpha, beta, and release candidate) and sometimes into the major update.
A change that breaks the compatibility affects user codes. We try to lower the cost of adapting your code to the newer version. The following list shows an example of what we can do to reduce the cost (Note: this is not a promise; what kind of actions we can take depends on the situation).
- When an argument is removed from an existing API, passing the argument to the updated API will emit an error with a special error message. The error message tells you how to fix your code.
- When a function or a class is removed, we make the current stable version emit a deprecation warning.
Note that the deprecation warning is not printed by default in Python.
You have to manually turn on the deprecation warning by
warnings.simplefilter('always', DeprecationWarning)
. - When a definition of a link is changed, we try to enable it to deserialize a model dumped with an older version of Chainer. In most cases, we cannot guarantee that a model serialized with a newer version of Chainer is loadable by an older version of Chainer.
Note
Since Chainer v2, we have stopped adopting any solid processes to break backward compatibilities (e.g. a solid schedule for deprecating and removing a feature) in order to keep the development fast enough to support the cutting-edge research. It does not mean we stop taking care of maintainability of user codes. We are still paying much attention to not breaking user codes.
Experimental APIs¶
Thanks to many contributors, we have introduced many new features to Chainer.
However, we have sometimes released new features only to later notice that their APIs are not appropriate. In particular, we sometimes know that the API is likely to be modified in the near future because we do not have enough knowledge about how well the current design fits to the real usages. The objective of experimental APIs is to declare that the APIs are likely to be updated in the near future so that users can decide if they can(not) use them.
Any newly added API can be marked as experimental. Any API that is not experimental is called stable in this document.
Note
Undocumented behaviors are not considered as APIs, so they can be changed at any time (even in a revision update). The treatment of undocumented behaviors are described in Undocumented behaviors section.
When users use experimental APIs for the first time, warnings are raised once for each experimental API, unless users explicitly disable the emission of the warnings in advance.
See the document of chainer.utils.experimental()
to know how developers mark APIs as experimental
and how users enable or disable the warnings practically.
Note
It is up to developers if APIs should be annotated as experimental or not. We recommend to make the APIs experimental if they implement large modules or make a decision from several design choices.
Supported Backward Compatibility¶
This section defines backward compatibilities that revision updates must maintain.
Documented Interface¶
Chainer has the official API documentation. Many applications can be written based on the documented features. We support backward compatibilities of documented features. In other words, codes only based on the documented features run correctly with revision-updated versions.
Developers are encouraged to use apparent names for objects of implementation details. For example, attributes outside of the documented APIs should have one or more underscores at the prefix of their names.
Note
Although it is not stated as a rule, we also try to keep the compatibility for any interface that looks like a stable feature. For example, if the name of a symbol (function, class, method, attribute, etc.) is not prefixed by an underscore and the API is not experimental, the API should be kept over revision updates even if it is not documented.
Undocumented behaviors¶
Behaviors of Chainer implementation not stated in the documentation are undefined. Undocumented behaviors are not guaranteed to be stable between different revision versions.
Even revision updates may contain changes to undefined behaviors. One of the typical examples is a bug fix. Another example is an improvement on implementation, which may change the internal object structures not shown in the documentation. As a consequence, even revision updates do not support compatibility of pickling, unless the full layout of pickled objects is clearly documented.
Documentation Error¶
Compatibility is basically determined based on the documentation, although it sometimes contains errors. It may make the APIs confusing to assume the documentation always stronger than the implementations. We therefore may fix the documentation errors in any updates that may break the compatibility in regard to the documentation.
Note
Developers should not fix the documentation and implementation of the same functionality at the same time in revision updates as a “bug fix” unless the bug is so critical that no users are expected to be using the old version correctly.
Object Attributes and Properties¶
Object attributes and properties are sometimes replaced by each other. It does not break the user codes, except the codes depend on how the attributes and properties are implemented.
Functions and Methods¶
Methods may be replaced by callable attributes keeping the compatibility of parameters and return values. It does not break the user codes, except the codes depend on how the methods and callable attributes are implemented.
Exceptions and Warnings¶
The specifications of raising exceptions are considered as a part of standard backward compatibilities. No exception is raised in the future revision versions with correct usages that the documentation allows.
On the other hand, warnings may be added at any revision updates for any APIs. It means revision updates do not keep backward compatibility of warnings.
Model Format Compatibility¶
Links and chains serialized by official serializers that Chainer provides are correctly loaded with the future versions. They might not be correctly loaded with Chainer of the lower versions.
Note
Current serialization APIs do not support versioning. It prevents us from introducing changes in the layout of objects that support serialization. We are discussing versioning in serialization APIs.
Installation Compatibility¶
The installation process is another concern of compatibilities.
Any changes on the set of dependent libraries that force modifications on the existing environments should be done in pre-releases and major updates. Such changes include following cases:
- dropping supported versions of dependent libraries (e.g. dropping cuDNN v2)
- adding new mandatory dependencies (e.g. adding h5py to setup_requires)
Note
We sometimes have to narrow the supported versions due to bugs in the specific versions of libraries. In such a case, we may drop the support of those versions even in revision updates unless a workaround is found for the issue.
Contribution Guide¶
This is a guide for all contributions to Chainer. The development of Chainer is running on the official repository at GitHub. Anyone that wants to register an issue or to send a pull request should read through this document.
Note
Many points of this document are updated at v2. We strongly recommend all contributors of v1 to read through the document again.
Classification of Contributions¶
There are several ways to contribute to Chainer community:
- Registering an issue
- Sending a pull request (PR)
- Sending a question/reply to StackOverflow (with
chainer
tag) or Chainer User Group - Open-sourcing an external example
- Writing a post about Chainer
This document mainly focuses on 1 and 2, though other contributions are also appreciated.
Development Cycle¶
This section explains the development process of Chainer. Before contributing to Chainer, it is strongly recommended to understand the development cycle.
Versioning¶
The versioning of Chainer follows PEP 440 and a part of Semantic versioning.
The version number consists of three or four parts: X.Y.Zw
where X
denotes the major version, Y
denotes the minor version, Z
denotes the revision number, and the optional w
denotes the prelease suffix.
While the major, minor, and revision numbers follow the rule of semantic versioning, the pre-release suffix follows PEP 440 so that the version string is much friendly with Python eco-system.
Note that a major update basically does not contain compatibility-breaking changes from the last release candidate (RC). This is not a strict rule, though; if there is a critical API bug that we have to fix for the major version, we may add breaking changes to the major version up.
As for the backward compatibility, see API Compatibility Policy.
Release Cycle¶
Starting from v2.0.0, we are developing two tracks of versions at the same time. The first one is the track of stable versions, which is a series of revision updates for the latest major version. The second one is the track of development versions, which is a series of pre-releases for the upcoming major version.
Consider that X.0.0
is the latest major version and Y.0.0
, Z.0.0
are the succeeding major versions.
Then, the timeline of the updates is depicted by the following table.
Date | ver X | ver Y | ver Z |
---|---|---|---|
0 weeks | X.0.0rc1 | – | – |
4 weeks | X.0.0 | Y.0.0a1 | – |
8 weeks | X.1.0* | Y.0.0b1 | – |
12 weeks | X.2.0* | Y.0.0rc1 | – |
16 weeks | – | Y.0.0 | Z.0.0a1 |
(* These might be revision releases)
The dates shown in the left-most column are relative to the release of X.0.0rc1
.
In particular, each revision/minor release is made four weeks after the previous one of the same major version, and the pre-release of the upcoming major version is made at the same time.
Whether these releases are revision or minor is determined based on the contents of each update.
Note that there are only three stable releases for the versions X.x.x
.
During the parallel development of Y.0.0
and Z.0.0a1
, the version Y
is treated as an almost-stable version and Z
is treated as a development version.
If there is a critical bug found in X.x.x
after stopping the development of version X
, we may release a hot-fix for this version at any time.
We create a milestone for each upcoming release at GitHub. The GitHub milestone is basically used for collecting the issues and PRs resolved in the release.
Git Branches¶
The master
branch is used to develop pre-release versions.
It means that alpha, beta, and RC updates are developed at the master
branch.
This branch contains the most up-to-date source tree that includes features newly added after the latest major version.
The stable version is developed at the individual branch named as vN
where “N” reflects the version number (we call it a versioned branch).
For example, v3.0.0, v3.0.1, and v3.0.2 will be developed at the v3
branch.
Notes for contributors:
When you send a pull request, you basically have to send it to the master
branch.
If the change can also be applied to the stable version, a core team member will apply the same change to the stable version so that the change is also included in the next revision update.
If the change is only applicable to the stable version and not to the master
branch, please send it to the versioned branch.
We basically only accept changes to the latest versioned branch (where the stable version is developed) unless the fix is critical.
If you want to make a new feature of the master
branch available in the current stable version, please send a backport PR to the stable version (the latest vN
branch).
See the next section for details.
Note: a change that can be applied to both branches should be sent to the master
branch.
Each release of the stable version is also merged to the development version so that the change is also reflected to the next major version.
Feature Backport PRs¶
We basically do not backport any new features of the development version to the stable versions.
If you desire to include the feature to the current stable version and you can work on the backport work, we welcome such a contribution.
In such a case, you have to send a backport PR to the latest vN
branch.
Note that we do not accept any feature backport PRs to older versions because we are not running quality assurance workflows (e.g. CI) for older versions so that we cannot ensure that the PR is correctly ported.
There are some rules on sending a backport PR.
- Start the PR title from the prefix [backport].
- Clarify the original PR number in the PR description (something like “This is a backport of #XXXX”).
- (optional) Write to the PR description the motivation of backporting the feature to the stable version.
Please follow these rules when you create a feature backport PR.
Note: PRs that do not include any changes/additions to APIs (e.g. bug fixes, documentation improvements) are usually backported by core dev members. It is also appreciated to make such a backport PR by any contributors, though, so that the overall development proceeds more smoothly!
Issues and Pull Requests¶
In this section, we explain how to file issues and send pull requests (PRs).
Issue/PR Labels¶
Issues and PRs are labeled by the following tags:
- Bug: bug reports (issues) and bug fixes (PRs)
- Enhancement: implementation improvements without breaking the interface
- Feature: feature requests (issues) and their implementations (PRs)
- NoCompat: disrupts backward compatibility
- Test: test fixes and updates
- Document: document fixes and improvements
- Example: fixes and improvements on the examples
- Install: fixes installation script
- Contribution-Welcome: issues that we request for contribution (only issues are categorized to this)
- Other: other issues and PRs
Multiple tags might be labeled to one issue/PR. Note that revision releases cannot include PRs in Feature and NoCompat categories.
How to File an Issue¶
On registering an issue, write precise explanations on how you want Chainer to be. Bug reports must include necessary and sufficient conditions to reproduce the bugs. Feature requests must include what you want to do (and why you want to do, if needed) with Chainer. You can contain your thoughts on how to realize it into the feature requests, though what part is most important for discussions.
Warning
If you have a question on usages of Chainer, it is highly recommended to send a post to StackOverflow or Chainer User Group instead of the issue tracker. The issue tracker is not a place to share knowledge on practices. We may suggest these places and immediately close how-to question issues.
How to Send a Pull Request¶
If you can write code to fix an issue, we encourage to send a PR.
First of all, before starting to write any code, do not forget to confirm the following points.
- Read through the Coding Guidelines and Unit Testing.
- Check the appropriate branch that you should send the PR following Git Branches.
If you do not have any idea about selecting a branch, please choose the
master
branch.
In particular, check the branch before writing any code. The current source tree of the chosen branch is the starting point of your change.
After writing your code (including unit tests and hopefully documentations!), send a PR on GitHub. You have to write a precise explanation of what and how you fix; it is the first documentation of your code that developers read, which is a very important part of your PR.
Once you send a PR, it is automatically tested on Travis CI for Linux and Mac OS X, and on AppVeyor for Windows. Your PR needs to pass at least the test for Linux on Travis CI. After the automatic test passes, some of the core developers will start reviewing your code. Note that this automatic PR test only includes CPU tests.
Note
We are also running continuous integration with GPU tests for the master
branch and the versioned branch of the latest major version.
Since this service is currently running on our internal server, we do not use it for automatic PR tests to keep the server secure.
If you are planning to add a new feature or modify existing APIs, it is recommended to open an issue and discuss the design first. The design discussion needs lower cost for the core developers than code review. Following the consequences of the discussions, you can send a PR that is smoothly reviewed in a shorter time.
Even if your code is not complete, you can send a pull request as a work-in-progress PR by putting the [WIP]
prefix to the PR title.
If you write a precise explanation about the PR, core developers and other contributors can join the discussion about how to proceed the PR.
WIP PR is also useful to have discussions based on a concrete code.
Coding Guidelines¶
Note
Coding guidelines are updated at v3.0. Those who have contributed to older versions should read the guidelines again.
We use PEP 8 and a part of OpenStack Style Guidelines related to general coding style as our basic style guidelines.
To check your code, use autopep8
and flake8
command installed by hacking
package:
$ pip install autopep8 hacking
$ autopep8 --global-config .pep8 path/to/your/code.py
$ flake8 path/to/your/code.py
The autopep8
supports automatically correct Python code to conform to the PEP 8 style guide:
$ autopep8 --in-place --global-config .pep8 path/to/your/code.py
The flake8
command lets you know the part of your code not obeying our style guidelines.
Before sending a pull request, be sure to check that your code passes the flake8
checking.
Note that flake8
command is not perfect.
It does not check some of the style guidelines.
Here is a (not-complete) list of the rules that flake8
cannot check.
- Relative imports are prohibited. [H304]
- Importing non-module symbols is prohibited.
- Import statements must be organized into three parts: standard libraries, third-party libraries, and internal imports. [H306]
In addition, we restrict the usage of shortcut aliases in any global-scope code. In particular, you cannot use shortcut aliases to designate a parent class in global-scope class definitions. When you want to make a class inheriting another class defined in another module, you have to spell out the full module name instead of importing a module that provides an alias.
For example, the following code is not allowed.
import chainer
class MyLink(chainer.Link): ...
Instead, import chainer.link
and use that.
import chainer.link
class MyLink(chainer.link.Link): ...
If you feel the code too verbose, you can also use from import
or import as
.
from chainer import link
class MyLink(link.Link): ...
Note
From v3.0, we allow shortcut aliases used inside of functions and methods that are not called from any global scope code.
For example, you can write chainer.Variable
instead of chainer.variable.Variable
inside of functions and methods.
Use of such aliases is prohibited in the past for avoiding confusing errors related to cyclic dependencies;
we relaxed the rule so that the library code looks similar to user code.
When you use such shortcut aliases, please be careful with cyclic imports.
One of the typical pitfalls is a way to import chainer.functions
.
An import like import chainer.functions as F
within modules under chainer.functions
does not work.
An import like from chainer import functions
works well with Python 3, but does not with Python 2.
We recommend you to use import chainer.functions
and spell out like chainer.functions.foo
in your methods.
Once you send a pull request, your coding style is automatically checked by Travis-CI. The reviewing process starts after the check passes.
Unit Testing¶
Testing is one of the most important part of your code. You must write test cases and verify your implementation by following our testing guide.
Note that we are using pytest and mock package for testing, so install them before writing your code:
$ pip install pytest mock
How to Run Tests¶
You can run unit tests simply by running python -m pytest
command at the repository root:
$ python -m pytest
or specify the test script that you want to run:
$ python -m pytest path/to/your/test.py
You can also run all unit tests under a specified directory:
$ python -m pytest tests/chainer_tests/<directory name>
It requires CUDA and cuDNN by default.
In order to run unit tests that do not require CUDA and cuDNN, use CHAINER_TEST_GPU_LIMIT=0
environment variable and -m='not cudnn'
option:
$ export CHAINER_TEST_GPU_LIMIT=0
$ python -m pytest path/to/your/test.py -m='not cudnn'
Some GPU tests involve multiple GPUs.
If you want to run GPU tests with insufficient number of GPUs, specify the number of available GPUs to CHAINER_TEST_GPU_LIMIT
.
For example, if you have only one GPU, launch pytest
by the following command to skip multi-GPU tests:
$ export CHAINER_TEST_GPU_LIMIT=1
$ python -m pytest path/to/gpu/test.py
Some tests spend too much time.
If you want to skip such tests, pass -m='not slow'
option to the command:
$ python -m pytest path/to/your/test.py -m='not slow'
If you modify the code related to existing unit tests, you must run appropriate commands and confirm that the tests pass.
Test File and Directory Naming Conventions¶
Tests are put into the tests/chainer_tests directory. In order to enable test runner to find test scripts correctly, we are using special naming convention for the test subdirectories and the test scripts.
- The name of each subdirectory of
tests
must end with the_tests
suffix. - The name of each test script must start with the
test_
prefix.
When we write a test for a module, we use the appropriate path and file name for the test script whose correspondence to the tested module is clear.
For example, if you want to write a test for a module chainer.x.y.z
, the test script must be located at tests/chainer_tests/x_tests/y_tests/test_z.py
.
How to Write Tests¶
There are many examples of unit tests under the tests directory, so reading some of them is a good and recommended way to learn how to write tests for Chainer.
They simply use the unittest
package of the standard library, while some tests are using utilities from chainer.testing
.
Even if your patch includes GPU-related code, your tests should not fail without GPU capability.
Test functions that require CUDA must be tagged by chainer.testing.attr.gpu
decorator:
import unittest
from chainer.testing import attr
class TestMyFunc(unittest.TestCase):
...
@attr.gpu
def test_my_gpu_func(self):
...
The functions tagged by the gpu
decorator are skipped if CHAINER_TEST_GPU_LIMIT=0
environment variable is set.
We also have the chainer.testing.attr.cudnn
decorator to let pytest
know that the test depends on cuDNN.
The test functions decorated by cudnn
are skipped if -m='not cudnn'
is given.
The test functions decorated by gpu
must not depend on multiple GPUs.
In order to write tests for multiple GPUs, use chainer.testing.attr.multi_gpu()
decorator instead:
import unittest
from chainer.testing import attr
class TestMyFunc(unittest.TestCase):
...
@attr.multi_gpu(2) # specify the number of required GPUs here
def test_my_two_gpu_func(self):
...
If your test requires too much time, add chainer.testing.attr.slow
decorator.
The test functions decorated by slow
are skipped if -m='not slow'
is given:
import unittest
from chainer.testing import attr
class TestMyFunc(unittest.TestCase):
...
@attr.slow
def test_my_slow_func(self):
...
Note
If you want to specify more than two attributes, use and
operator like -m='not cudnn and not slow'
.
See detail in the document of pytest.
Once you send a pull request, your code is automatically tested by Travis-CI except for tests annotated with ``gpu``, ``multi_gpu`` and ``slow``. Since Travis-CI does not support CUDA, we cannot check your CUDA-related code automatically. The reviewing process starts after the test passes. Note that reviewers will test your code without the option to check CUDA-related code.
Note
Some of numerically unstable tests might cause errors irrelevant to your changes. In such a case, we ignore the failures and go on to the review process, so do not worry about it!
Installation Guide¶
Recommended Environments¶
We recommend these Linux distributions.
The following versions of Python can be used: 2.7.6+, 3.4.3+, 3.5.1+, and 3.6.0+.
Note
We are testing Chainer automatically with Jenkins, where all the above recommended environments are tested. We cannot guarantee that Chainer works on other environments including Windows and macOS (especially with CUDA support), even if Chainer looks running correctly.
Dependencies¶
Before installing Chainer, we recommend to upgrade setuptools
if you are using an old one:
$ pip install -U setuptools
The following Python packages are required to install Chainer. The latest version of each package will automatically be installed if missing.
The following packages are optional dependencies. Chainer can be installed without them, in which case the corresponding features are not available.
Install Chainer¶
Install Chainer via pip¶
We recommend to install Chainer via pip:
$ pip install chainer
Note
Any optional dependencies (including CuPy) can be added after installing Chainer. Chainer automatically detects the available packages and enables/disables the optional features appropriately.
Install Chainer from source¶
The tarball of the source tree is available via pip download chainer
or from the release notes page.
You can use setup.py
to install Chainer from the tarball:
$ tar zxf chainer-x.x.x.tar.gz
$ cd chainer-x.x.x
$ python setup.py install
You can also install the development version of Chainer from a cloned Git repository:
$ git clone https://github.com/chainer/chainer.git
$ cd chainer
$ python setup.py install
When an error occurs…¶
Use -vvvv
option with pip
command.
That shows all logs of installation.
It may help you:
$ pip install chainer -vvvv
Enable CUDA/cuDNN support¶
In order to enable CUDA support, you have to install CuPy manually. If you also want to use cuDNN, you have to install CuPy with cuDNN support. See CuPy’s installation guide to install CuPy. Once CuPy is correctly set up, Chainer will automatically enable CUDA support.
You can refer to the following flags to confirm if CUDA/cuDNN support is actually available.
chainer.cuda.available
True
if Chainer successfully importscupy
.chainer.cuda.cudnn_enabled
True
if cuDNN support is available.
Support image dataset¶
Install Pillow manually to activate image dataset feature:
$ pip install pillow
Note that this feature is optional.
Support HDF5 serialization¶
Install h5py manually to activate HDF5 serialization:
$ pip install h5py
Before installing h5py, you need to install libhdf5. The way to install it depends on your environment:
# Ubuntu 14.04/16.04
$ apt-get install libhdf5-dev
# CentOS 7
$ yum -y install epel-release
$ yum install hdf5-devel
Note that this feature is optional.
Uninstall Chainer¶
Use pip to uninstall Chainer:
$ pip uninstall chainer
Note
When you upgrade Chainer, pip
sometimes install the new version without removing the old one in site-packages
.
In this case, pip uninstall
only removes the latest one.
To ensure that Chainer is completely removed, run the above command repeatedly until pip
returns an error.
Reinstall Chainer¶
If you want to reinstall Chainer, please uninstall Chainer and then install it.
We recommend to use --no-cache-dir
option as pip
sometimes uses cache:
$ pip uninstall chainer
$ pip install chainer --no-cache-dir
Run Chainer with Docker¶
We are providing the official Docker image. Use nvidia-docker command to run Chainer image with GPU. You can login to the environment with bash, and run the Python interpreter:
$ nvidia-docker run -it chainer/chainer /bin/bash
Or run the interpreter directly:
$ nvidia-docker run -it chainer/chainer /usr/bin/python
FAQ¶
The installer says “hdf5.h is not found”¶
You don’t have libhdf5. Please install it first. See Support HDF5 serialization.
Examples say “cuDNN is not enabled”¶
You failed to build CuPy with cuDNN.
If you don’t need cuDNN, ignore this message.
Otherwise, retry to install CuPy with cuDNN.
-vvvv
option helps you.
There is no need of re-installing Chainer itself.
See CuPy’s installation guide for more details.
Tips and FAQs¶
It takes too long time to compile a computational graph. Can I skip it?¶
Chainer does not compile computational graphs, so you cannot skip it, or, I mean, you have already skipped it :).
It seems you have actually seen on-the-fly compilations of CUDA kernels. CuPy compiles kernels on demand to make kernels optimized to the number of dimensions and element types of input arguments. Pre-compilation is not available, because we have to compile an exponential number of kernels to support all CuPy functionalities. This restriction is unavoidable because Python cannot call CUDA/C++ template functions in generic way. Note that every framework using CUDA require compilation at some point; the difference between other statically-compiled frameworks (such as cutorch) and Chainer is whether a kernel is compiled at installation or at the first use.
These compilations should run only at the first use of the kernels.
The compiled binaries are cached to the $(HOME)/.cupy/kernel_cache
directory by default.
If you see that compilations run every time you run the same script, then the caching is failed.
Please check that the directory is kept as is between multiple executions of the script.
If your home directory is not suited to caching the kernels (e.g. in case that it uses NFS), change the kernel caching directory by setting the CUPY_CACHE_DIR
environment variable to an appropriate path.
See CuPy Overview for more details.
MNIST example does not converge in CPU mode on Mac OS X¶
Note
Mac OS X is not officially supported. Please use it at your own risk.
Many users have reported that MNIST example does not work correctly when using vecLib as NumPy backend on Mac OS X. vecLib is the default BLAS library installed on Mac OS X.
We recommend using other BLAS libraries such as OpenBLAS.
To use an alternative BLAS library, it is necessary to reinstall NumPy. Here is an instruction to install NumPy with OpenBLAS using Homebrew.
$ brew tap homebrew/science
$ brew install openblas
$ brew install numpy --with-openblas
If you want to install NumPy with pip, use site.cfg file.
For details of this problem, see issue #704.
Upgrade Guide¶
This is a list of changes introduced in each release that users should be aware of when migrating from older versions. Most changes are carefully designed not to break existing code; however changes that may possibly break them are highlighted with a box.
Chainer v3¶
Introduction of New-style Functions¶
This release introduces new-style functions (classes inheriting from FunctionNode
) that support double backward (gradient of gradient).
See the Release Note for v3.0.0 for the usage of this feature.
Many of Standard Function implementations are already migrated to new-style, although some of functions are still old-style (classes inheriting from Function
).
We are going to migrate more old-style functions to new-style in upcoming minor releases.
This does not break the existing code.
Old-style functions (classes inheriting from Function
) are still supported in v3 and future versions of Chainer.
If you are going to write new functions, it is encouraged to use FunctionNode
to support double backward.
Attention
Users relying on undocumented function APIs (directly instantiating old-style classes) may experience an error like TypeError: 'SomeFunction' object is not callable
after upgrading to v3.
Please use the function APIs documented in Standard Function implementations.
Changed Behavior of matmul Function¶
The behavior of chainer.functions.matmul()
has been changed to behave like the corresponding NumPy function (numpy.matmul()
).
See the discussion in #2426 for more details.
Attention
The existing code using chainer.functions.matmul()
may require modification to work with Chainer v3.
Also note that chainer.functions.batch_matmul()
is now deprecated by this change.
You can rewrite it using chainer.functions.matmul()
.
Removed use_cudnn Argument in spatial_transformer_grid and spatial_transformer_sampler Functions¶
use_cudnn
argument has been removed from chainer.functions.spatial_transformer_grid()
and chainer.functions.spatial_transformer_sampler()
.
See the discussion in #2955 for more details.
Attention
The existing code using use_cudnn
argument of chainer.functions.spatial_transformer_grid()
and chainer.functions.spatial_transformer_sampler()
require modification to work with Chainer v3.
Please use the configuration context (e.g., with chainer.using_config('use_cudnn', 'auto'):
) to enable or disable use of cuDNN.
See Configuring Chainer for details.
Chainer v2¶
See Upgrade Guide from v1 to v2 for the changes introduced in Chainer v2.
Upgrade Guide from v1 to v2¶
This document provides detailed information of differences between Chainer v1 and v2. You will know by reading it which part of your code is required (or recommended) to be fixed when you upgrade Chainer from v1 to v2.
- CuPy
- Global configurations
- Variable
- Function
- Link/Chain/ChainList
- wscale option is removed from links
- bias option is removed from links
- The bias vector is enabled by default in N-dimensional convolution links
- init_weight function is removed
- The order of arguments of GRU is changed
- The default value of the forget bias for LSTM and StatelessLSTM is changed to 1
- The interfaces of GRU and LSTM are aligned
- Aliases of links in chainer.functions are removed
- Parameter link is removed
- New-style parameter registration APIs are added to Link
- New-style child link registration APIs are added to Chain
- The input-size placeholder of links are made optional
- Optimizer
- Serializer
- Trainer and Extension
- Reporter
- Other utilities
CuPy¶
CuPy has been separated from Chainer into a separate package¶
CuPy, which was originally a part of Chainer, has been separated into a different Python package since Chainer v2.
It changes the way to set up Chainer with CUDA support.
In particular, you have to separately install cupy
package to enable CUDA support.
See Installation Guide for the recommended installation steps.
Fortunately, there is no need of updating your source code to catch up with this change.
Global configurations¶
Training mode is configured by a thread-local flag¶
In Chainer v2, the concept of training mode is added.
It is represented by a thread-local flag chainer.config.train
, which is a part of the unified configuration.
When chainer.config.train
is True
, functions of Chainer run in the training mode, and otherwise they run in the test mode.
For example, BatchNormalization
and dropout()
behave differently in each mode.
In Chainer v1, such a behavior was configured by the train
or test
argument of each function.
This train/test argument has been removed in Chainer v2.
If your code is using the train
or test
argument, you have to update it.
In most cases, what you have to do is just removing the train
/ test
argument from any function calls.
Example
Consider the following model definition and the code to call it in test mode written for Chainer v1.
# Chainer v1
import chainer.functions as F
class MyModel(chainer.Link):
...
def __call__(self, x, train=True):
return f(F.dropout(x, train=train))
m = MyModel(...)
y = m(x, train=False)
In Chainer v2, it should be updated into the following code:
# Chainer v2
import chainer.functions as F
class MyModel(chainer.Link):
...
def __call__(self, x):
return f(F.dropout(x))
m = MyModel(...)
with chainer.using_config('train', False):
y = m(x)
Configurations are added and replace some of existing global flags¶
There are many global settings moved to the unified configuration other than the training mode. Following is the complete list of the configuration entries that have corresponding features in Chainer v1.
chainer.config.cudnn_deterministic
- It is corresponding to the
deterministic
argument of some convolution functions in Chainer v1. This argument has been removed since Chainer v2. If you are using this argument, you have to use thechainer.config.cudnn_deterministic
flag to change the behavior of the convolution functions. chainer.config.debug
- It is corresponding to the debug mode in Chainer v1, which was configured by
set_debug()
and extracted byis_debug()
. These functions are also available in Chainer v2, so you basically do not need to update the code related to the debug mode. chainer.config.enable_backprop
- It is corresponding to the backprop mode in Chainer v1.
The functions
no_backprop_mode()
andforce_backprop_mode()
are still available in Chainer v2, which automatically turns on/off theenable_backprop
flag. One important difference from Chainer v1 is that thevolatile
flag is removed fromVariable
. Therefore, there are more situations that you need to modify theenable_backprop
flag. chainer.config.keep_graph_on_report
- This flag configures whether or not to keep the computational graph alive for a reported variable.
In Chainer v2, when a
Variable
object is reported byreport()
, a copy of the variable isolated from the computational graph is created and stored by default. SettingTrue
to this flag, you can change this behavior and then the originalVariable
object is stored as is. See When a variable is reported, the variable is copied with the graph purged for the details. chainer.config.train
- It is corresponding to the
train
ortest
argument of some functions in Chainer v1. This argument has been removed since Chainer v2. If you are using this argument, you have to use thechainer.config.train
flag instead. See Training mode is configured by a thread-local flag for more details. chainer.config.type_check
- It is corresponding to the
Function.type_check_enable
flag. If your code touches this flag, you have to usechainer.config.type_check
instead. Note that the environment variableCHAINER_TYPE_CHECK
is still available in Chainer v2, so if you are only using the environment variable, there is no need of updating your code. chainer.config.use_cudnn
- It is corresponding to the
use_cudnn
argument of many functions that have cuDNN implementations. This argument has been removed since Chainer v2. If you are using this argument, you have to use thechainer.config.use_cudnn
flag instead. Note that this flag is ternary, not binary. See Configuring Chainer for more details.
These configurations can be modified in two ways.
Simply substituting a new value to an entry, like
chainer.config.train = False
.Using the
chainer.using_config
context manager. It can be used with thewith
statement of Python as follows:with chainer.using_config('train', False): do something # this code runs with chainer.config.train == False
It recovers the original configuration after quitting the
with
block.
The chainer.config
manages the thread-local configuration.
You can also set the global configuration by modifying chainer.global_config
.
Note that the global configuration is used only if the entry of the thread-local configuration is not explicitly set up.
Variable¶
Volatile flag is removed¶
The Variable.volatile
flag has been removed since Chainer v2.
Instead, the configuration chainer.config.enable_backprop
can be used to enable/disable the automatic differentiation feature.
If it is True
, Chainer always creates a computational graph on the forward propagation, which corresponds to passing non-volatile variables in Chainer v1.
Otherwise, Chainer does not create a graph, which corresponds to passing volatile variables in Chainer v1.
The biggest difference is that enable_backprop
is a thread-local flag, whereas volatile
was a flag local to each Variable
object.
Note that enable_backprop
flag has already existed in Chainer v1, which took effect only if all the inputs to the function have volatile == 'auto'
.
The chainer.config.enable_backprop
flag can be modified directly or by using using_config()
.
See Configuring Chainer for details.
There is also a convenience function, no_backprop_mode()
, to turn off the flag.
If you are using the Variable.volatile
flag, you have to stop setting this flag (it will not take effect), and set the enable_backprop
flag instead.
Example
Let model
be your model, and consider the following code that calls it in volatile mode.
# Chainer v1
x_data = ... # ndarray
x = chainer.Variable(x_data, volatile=True)
y = model(x)
In Chainer v2, it should be updated as follows.
# Chainer v2
x_data = ... # ndarray
x = chainer.Variable(x_data)
with chainer.no_backprop_mode():
y = model(x)
Variable is not a part of a computational graph anymore¶
The Variable
class has been separated into two distinct classes, the Variable
class and the VariableNode
class, since Chainer v2.
Every class:Variable object owns its own VariableNode
object.
A computational graph consists of Function
objects and VariableNode
objects.
When one applies a Function
to a Variable
, the VariableNode
object of the variable is extracted and set to one of the inputs of the function.
Note that the underlying data array of the variable is till held by the Variable
object.
It allows each Function
implementation to release unneeded arrays from the computational graph, resulting in greatly reduced memory consumption.
This change does not affect most users’ code.
If you are directly traversing the computational graph by yourself or modifying the graph ad-hoc, you may have to update your code.
In most cases, it is enough to just change Variable
into VariableNode
in the code traversing the computational graph.
Parameter has to be an instance of Parameter class¶
Chainer v2 has a subclass of Variable
called Parameter
.
This class has an interface convenient on setting up a parameter variable registered to Link
.
You basically do not need to update your code because Link.add_param()
creates a Parameter
object in Chainer v2.
There is a new recommended way of registering parameters to a link in Chainer v2, though.
See here for the recommended way of parameter registration.
Small changes to Variable¶
There are some changes on the interface and specification of methods.
len(variable)
returns the length of the first axis of the underlying array in Chainer v2. This is equivalent tolen(variable.data)
. It is different from the behavior of Chainer v1, in whichlen
returned the total number of elements in the underlying array.repr(variable)
returns a NumPy-like text representation of the underlying array in Chainer v2. In Chainer v1, it just returns a string that shows the name of the variable.
Function¶
The force_tuple option of split_axis is True by default¶
In Chainer v2, the force_tuple
argument of functions.split_axis()
is set to True
by default.
Therefore, it always returns a tuple regardless of the number of sections made after the split.
It was False
by default in Chainer v1.
Type check APIs are updated to enable lazy building of the error messages¶
In Chainer v2, the type check APIs are updated so that the overhead of checking types is greatly reduced. In order to achieve the overhead reduction, some APIs are changed.
If you have custom Function implementations that do type checking, you have to update your code. The following list shows which part has to be updated.
- Use
utils.type_check.eval()
instead ofExpr.eval
. - Use
utils.type_check.make_variable()
to create autils.type_check.Variable
object instead of directly constructing it by yourself. - Stop using
.name
attribute of any expression.
Background of this change:
In Chainer v1, the type checking APIs build an abstract syntax tree (AST) based on each expression that tests some condition.
The AST is used to emit a kind error message.
However, building an AST requires constructions of many Python objects, which adds large Python overheads.
In Chainer v2, the Function.type_check_forward()
method is called once or twice.
At the first call, the type checking APIs run in light-weight mode, where it does not build an AST and just checks the condition.
The second call is made only if there is a test that fails, where it builds an AST.
This change makes the ordinary path of running the type checking much faster, while keeping the kind error messages.
Methods to release unneeded arrays are added¶
As is written above, Chainer v2 introduced a new mechanism to reduce the memory consumption of each Function
implementation.
In many cases, a Function
implementation does not need some input arrays in its backward computation.
A new method called Function.retain_inputs()
can be used to specify which input arrays are actually needed.
This method must not be called from the outside of Function.forward()
.
Example
For example, consider the following simple addition function.
class AddFunction(chainer.Function):
def forward(self, inputs):
return inputs[0] + inputs[1],
def backward(self, inputs, grad_outputs):
return grad_outputs[0], grad_outputs[0]
It can be seen that the backward computation of this function does not use any of the inputs.
Then, specifying an empty tuple of indexes to retain_inputs()
will reduce the memory overhead.
class AddFunction(chainer.Function):
def forward(self, inputs):
self.retain_inputs(()) # does not retain both inputs
return inputs[0] + inputs[1],
def backward(self, inputs, grad_outputs):
return grad_outputs[0], grad_outputs[0]
In some cases, the function can (or have to) use the output arrays instead of the inputs in its backward computation.
In Chainer v1, we have written code that store the output arrays to attributes of the Function
object and reuse them in the backward()
method.
In Chainer v2, it is recommended to use Function.retain_outputs()
to declare which outputs are required in the backward computation.
The retained output arrays can be accessed via Function.output_data
.
Note
The existing Function
implementations that store the output arrays to its attributes will run correctly in Chainer v2.
There is no any memory overhead right now.
It is recommended to use retain_outputs()
, though, so that we can incorporate more memory optimization in the future.
Example
For example, consider the following simple implementation of the tanh function.
class TanhFunction(chainer.Function):
def forward(self, inputs):
xp = chainer.cuda.get_array_module(inputs[0])
self.y = xp.tanh(inputs[0])
return self.y,
def backward(self, inputs, grad_outputs):
one = self.y.dtype.type(1) # avoid type promotion
return grad_outputs[0] * (one - self.y * self.y),
We can use retain_outputs()
instead of preserving the output array by ourselves as follows.
class TanhFunction(chainer.Function):
def forward(self, inputs):
self.retain_outputs((0,))
xp = chainer.cuda.get_array_module(inputs[0])
return xp.tanh(inputs[0]),
def backward(self, inputs, grad_outputs):
y = self.output_data[0]
one = y.dtype.type(1) # avoid type promotion
return grad_outputs[0] * (one - y * y)
Link/Chain/ChainList¶
wscale option is removed from links¶
The wscale
option has been removed from links since Chainer v2.
If you are using wscale option, you have to update your code.
The recommended way is to explicitly set the initializer.
Example
Consider the case of adding a Linear
link with the weight initialized by 0.5x of the default initialization.
# Chainer v1
linear = chainer.links.Linear(10, 5, wscale=0.5)
Note that the default initializer of the weight matrix of Linear
is a normal distribution of the standard deviation \(1 / \sqrt{fan in}\).
Therefore, it can be fixed as follows.
# Chainer v2
linear = chainer.links.Linear(10, 5, initialW=chainer.initializers.Normal(0.5 / math.sqrt(10)))
Or, by using the fact that initializers.HeNormal
provides the initialization with a normal distribution of the standard deviation \(scale * \sqrt{2 / fan in}\), the following code is also equivalent to the original.
# Chainer v2, using HeNormal
linear = chainer.links.Linear(10, 5, initialW=chainer.initializers.HeNormal(0.5 / math.sqrt(2))
bias option is removed from links¶
In Chainer v2, the bias
option is removed from the following links: Linear
, Convolution2D
, Deconvolution2D
, and DilatedConvolution2D
.
The effect of this argument was duplicated with the initial_bias
option.
Use initial_bias
instead.
The bias vector is enabled by default in N-dimensional convolution links¶
In Chainer v2, the bias parameter is enabled by default in ConvolutionND
and DeconvolutionND
.
It was unintentionally disabled by default in Chainer v1.
If you are using ConvolutionND or DeconvolutionND without specifying the initial_bias
argument, you have to fix your code.
If you want to keep the old behavior (i.e., no bias vector is created by the link), pass nobias=True
to the link at the construction.
Otherwise it will automatically create a bias vector.
init_weight function is removed¶
The chainer.initializers.init_weight
function that was used on weight initialization has been removed since Chainer v2.
You have to update your code if you are using init_weight
.
In most cases, the update is simple: pass an initializer to Parameter
.
Example
Consider the following code that initializes a weight matrix randomly and a bias vector by zero.
# Chainer v1
class MyLink(chainer.Link):
def __init__(self):
super(MyLink, self).__init__(
W=(10, 5),
b=(5,),
)
chainer.initializers.init_weight(self.W, chainer.initializers.Normal(0.05))
self.b.data.fill(0)
...
This code should be fixed as follows (see the next topic for the use of Parameter
).
# Chainer v2
class MyLink(chainer.Link):
def __init__(self):
super(MyLink, self).__init__()
self.W = chainer.Parameter(chainer.initializers.Normal(0.05), (10, 5))
self.b = chainer.Parameter(0, (5,))
...
The order of arguments of GRU is changed¶
In Chainer v2, the first two arguments of GRU
is the input size and the output size.
It was reversed in Chainer v1, causing an inconsistent interface compared to other links including LSTM
.
If you are using GRU
, you have to update your code.
The update is done by simply flipping the first two arguments.
Example
Consider the following code that creates a GRU
link.
# Chainer v1
gru = chainer.links.GRU(20, 10)
It should be fixed into the following code.
# Chainer v2
gru = chainer.links.GRU(10, 20)
Note that if you were omitting the output size, the code works as is because GRU
supports the omitted input size.
# Chainer v1/v2
gru = chainer.links.GRU(20)
The default value of the forget bias for LSTM and StatelessLSTM is changed to 1¶
In Chainer v2, the default forget bias value of LSTM
and StatelessLSTM
links is changed to 1.
This change is based on the paper reporting that using a large forget bias improves the training performance.
The new behavior is also consistent with the implementation of BasicLSTMCell in TensorFlow.
It will improve the most use cases of LSTMs, although this change would break the reproducibility of the existing experiments.
If you want to keep the same initialization procedure, you have to update your code.
The change is simple: pass forget_bias_init=0
to LSTM
and StatelessLSTM
.
The interfaces of GRU and LSTM are aligned¶
In Chainer v1, GRU
was stateless, as opposed to the current implementation.
To align with the naming convention of LSTM links, we have changed the naming convention from Chainer v2 so that the shorthand name points the stateful links.
If you are using StatelessGRU
for stateless version, whose implementation is identical to chainer.linksGRU
in v1.
Aliases of links in chainer.functions are removed¶
For the compatibility reason, there were some links that have aliases in the chainer.functions
module.
These aliases are removed in Chainer v2.
Use chainer.links
instead.
Parameter link is removed¶
The chainer.links.Parameter
link is removed in Chainer v2.
This link existed in Chainer v1 only for the backward compatibility.
Use chainer.Parameter
instead (for the new Parameter
class, see Parameter has to be an instance of Parameter class).
New-style parameter registration APIs are added to Link¶
In Chainer v2, Link.init_scope()
method returns a context manager that automatically registers a Parameter
object to the link at setting it to an attribute.
If you are using IDE like PyCharm, it is recommended to use this new-style parameter registration so that IDEs can easily detect the existence of the parameter as an attribute.
It is also a good practice to use the new-style API even if you are not using IDEs, if you are planning to make the code public.
Note
The existing code that uses the conventional way of registering parameters are still valid.
Example
For example, the following link initialization code
# Chainer v1
class MyLink(chainer.Link):
def __init__(self):
super(MyLink, self).__init__(
W=(10, 5),
b=(5,),
)
chainer.initializers.Normal(0.05)(self.W.data)
self.b.data.fill(0)
...
is recommended to be updated as follows.
# Chainer v2
class MyLink(chainer.Link):
def __init__(self):
super(MyLink, self).__init__()
with self.init_scope():
self.W = chainer.Parameter(chainer.initializers.Normal(0.05), (10, 5))
self.b = chainer.Parameter(0, (5,)) # initialize by zero
...
Note
To keep a Parameter
object as an attribute without registration, you can set the attribute without using the with self.init_scope():
block.
New-style child link registration APIs are added to Chain¶
Like Parameter
, a Link
object is also automatically registered to a Chain
object by substitution to an attribute within a init_scope()
scope.
If you are using IDE like PyCharm, it is recommended to use the new-style child link registration so that IDEs can easily detect the existence of the child link as an attribute.
It is also a good practice to use the new-style API even if you are not using IDEs, if you are planning to make the code public.
Note
The existing code that uses the conventional way of registering child links are still valid.
Example
For example, the following chain initialization code
# Chainer v1
class MyMLP(chainer.Chain):
def __init__(self):
super(MyMLP, self).__init__(
layer1=L.Linear(None, 20),
layer2=L.Linear(None, 30),
)
...
is recommended to be updated as follows.
# Chainer v2
class MyMLP(chainer.Chain):
def __init__(self):
super(MyMLP, self).__init__()
with self.init_scope():
self.layer1 = L.Linear(20)
self.layer2 = L.Linear(30)
Note that this example also demonstrates the use of new APIs with the omitted input size, explained below.
Note
To keep a Link
object as an attribute without registration, you can set the attribute without using the with self.init_scope():
block.
The input-size placeholder of links are made optional¶
In Chainer v2, the input size of many links, including Linear
and Convolution2D
, is made optional.
In Chainer v1, we had to use None
as the placeholder to specify that the input size should be determined at the first iteration.
The placeholder can also be used in Chainer v2, although it is easier to just omit the input size.
See the previous item for the example of omitting the input size of Linear
.
The following links currently support the omitted input size.
Optimizer¶
Deprecated methods of Optimizer are removed¶
The following methods are removed from Optimizer
.
These methods have been already deprecated in the past versions.
If you are using these methods, you have to update your code.
zero_grads
: useLink.zerograds()
instead.compute_grads_norm
: you can compute the gradient norm by iterating the list of parameters byLink.params()
.clip_grads
: useGradientClipping
instead.weight_decay
: useWeightDecay
instead.accumulate_grads
: useLink.addgrads()
instead.
GradientMethod uses Link.cleargrads instead of Link.zerograds by default¶
In Chainer v2, GradientMethod
clears the gradient before running backprop by Link.cleargrads()
.
It means that the gradient of each parameter is initialized by None
instead of a zero array.
Note that all the optimizer implementations provided by Chainer are subclasses of GradientMethod
, and therefore this change affects all of them.
In most cases, you do not need to update your code.
If your code relies on the zeroing initialization, you have to fix your code to explicitly initialize the gradient by zero, or to pass False
to GradientMethod.use_cleargrads()
.
GradientMethod is redesigned to allow parameter-specific update rules¶
In Chainer v2, the new class UpdateRule
is used to define an update rule specific to each Parameter
object.
The UpdateRule
is set to each Parameter
object, and is used at each update step.
This object implements an update formula using the data and gradient arrays.
Each UpdateRule
object has enabled
flag, which configures if the update rule should be applied to that parameter on update.
By setting the flag to False
, you can freeze the parameter.
There is also a convenient method Link.enable_update()
and Link.disable_update()
, which configure the flag of each parameter under the link hierarchy.
In other frameworks, a similar feature is called layer freezing.
In Chainer v2, this is officially supported by these methods.
Each UpdateRule
object can also hold its own hook functions similar to Optimizer
.
The built-in hook functions except for GradientClipping
can also be used as a hook function of UpdateRule
.
In most cases, you do not have to update your code because each optimizer automatically sets up an appropriate UpdaterRule
object to each parameter.
If you are using a custom gradient-based optimizer implementation, you need to update the implementation. The following list shows what you have to do.
- Write a subclass of
UpdateRule
that implements the update rule. - Rewrite your
GradientMethod
implementation. The new implementation only has to set up the update rule for each parameter in the target link.
You can see live examples in the optimizer implementations provided by Chainer.
Serializer¶
None is serializable¶
In Chainer v2, all serializers start supporting None
value to be serialized and deserialized.
Users’ code can rely on this feature, i.e., it can serialize and deserialize None
value with any given serializer.
This change only affects your code if it provides its own serializer implementations.
Trainer and Extension¶
Updater and Evaluator pass raw data arrays to the loss function¶
In Chainer v2, Updater
and Evaluator
pass raw data arrays to the loss function without wrapping them with Variable
.
You might need to update your code so that the loss function (in most cases, the model’s __call__
) accepts raw arrays.
Note that raw arrays can be directly passed to any Function
; they are automatically wrapped by Variable
.
For example, if the input is directly passed to a Function
object (or any function under chainer.functions
), you do not need to update the code.
Example
Consider the following code that obtains the shape of the input via Variable.data
.
# Chainer v1
class MyLink(chainer.Link):
def __call__(self, x):
shape = x.data.shape # valid if x is Variable, invalid if x is ndarray
...
It should be updated so that the link also accepts a raw array as the input.
In this case, we have Variable.shape
which is equivalent to data.shape
, so you can simply write as follows.
# Chainer v2
class MyLink(chainer.Link):
def __call__(self, x):
shape = x.shape # valid regardless of x being Variable or ndarray
...
trigger option is removed from snapshot and snapshot_object¶
In Chainer v2, the trigger
option is removed from the snapshot()
and snapshot_object()
extensions.
The effect of the option was duplicated with the trigger
option of Trainer.extend
.
If you are passing the trigger
argument to these extensions, you have to update your code.
The update can be done by passing the value to the corresponding Trainer.extend
.
Example
Assume that trainer
is an instance of Trainer
, and consider that you were adding a snapshot()
extension as follows.
# Chainer v1
trainer.extend(chainer.training.extensions.snapshot(trigger=(1000, 'iteration')))
It should be updated as follows (note that this code also works with Chainer v1).
# Chainer v1/v2
trainer.extend(chainer.training.extensions.snapshot(), trigger=(1000, 'iteration'))
Extension.invoke_before_training is removed¶
In Chainer v2, The attribute invoke_before_training
of Extension
is removed.
Instead, the Extension.initialize
method is added.
This method is called by Trainer.run
before entering the training loop.
In Chainer v1, the extension is just called before entering the training loop when invoke_before_training
is True
.
If you have a custom extension that has invoke_before_training=True``**, you have to update the code.**
What you have to do is to remove the ``invoke_before_training
flag and override initialize()
method.
If you are using the make_extension()
decorator, you can set the initialize
function by passing the initializer
argument to make_extension()
.
The dump_graph extension dumps the valid graph only at its first invocation¶
In Chainer v2, the dump_graph()
extension dumps the valid computational graph only at its first invocation.
If you want to dump the graph more than once, you have to fix the code.
The easiest fix is setting the chainer.config.keep_graph_on_report
flag to True
.
Note that this fix will cancel the improvement on the memory consumption made in Chainer v2.
More memory-efficient fix is to dump the graph without using an extension, e.g. by customizing the loss function or the updater.
Here is the background of this change.
In Chainer v2, the Reporter copies reported variables with purging the computational graph by default.
On the other hand, the dump_graph()
extension requires the computational graph reachable from the reported variable.
In order to make the graph available, the dump_graph()
extension turns on the chainer.config.keep_graph_on_report
flag at its initializer (i.e., it turns on the graph before entering the training loop).
Since we also wanted to achieve the memory efficiency, the dump_graph()
extension turns off the flag after dumping the graph at its first invocation (strictly speaking, it recovers the original value).
As a result, the computational graph is not available from the second invocation.
Since the dump_graph()
recovers the original flag value at its invocation, you can keep the graph dumped more than once by changing the original flag value.
Reporter¶
When a variable is reported, the variable is copied with the graph purged¶
In Chainer v2, when a Variable
object is reported using report()
function (or directly using Reporter
), a copy of the variable is made without preserving the computational graph.
If your code depends on the reachability of the computational graph from the reported variable, you have to update your code.
The easiest way to update your code is setting chainer.config.keep_graph_on_report
to True
, then Chainer will keep the computational graph reachable from the reported variable.
The possible examples that are affected by this change are as follows (not exhaustive).
- A custom extension that runs backprop from a reported variable. It is definitely an example of assuming the reachability of the computational graph from the reported variable.
- An extension that visualizes the computational graph from a reported variable.
If you are writing such an extension by yourself, you have to turn on the
keep_graph_on_report
flag. Thedump_graph()
extension is another example, for which see the above item for the details.
This change is made for the memory performance reason; with this change, the memory used by the computational graph for training is immediately released before invoking extensions.
Therefore, changing the behavior by overwriting chainer.config.keep_graph_on_report
may increase the memory consumption.
It may cause an out-of-memory error if the computational graph of the loss function consumes almost all the memory available in your environment and there is an extension that uses a certain amount of memory (e.g. Evaluator
).
Other utilities¶
Some obsolete classes and functions are removed¶
The following classes and functions are removed in Chainer v2.
chainer.Flag
chainer.FunctionSet
(UseChain
orChainList
instead)chainer.cuda.init
(It did nothing except for callingcheck_cuda_available()
)chainer.cuda.empty
(Usecupy.empty()
)chainer.cuda.empty_like
(Usecupy.empty_like()
)chainer.cuda.full
(Usecupy.full()
)chainer.cuda.full_like
(Usecupy.full_like()
)chainer.cuda.ones
(Usecupy.ones()
)chainer.cuda.ones_like
(Usecupy.ones_like()
)chainer.cuda.zeros
(Usecupy.zeros()
)chainer.cuda.zeros_like
(Usecupy.zeros_like()
)
Comparison with Other Frameworks¶
A table for quick comparison¶
This table compares Chainer with other actively developed deep learning frameworks. Content is current as of July 2017.
Chainer | PyTorch | TensorFlow | Theano-based | Caffe1/Caffe2 | Torch7 | MXNet | DyNet | PaddlePaddle | DL4J | CNTK | neon | Knet.jl | Darknet | Thinc | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Basics | Language | Python | Python | Python | Python | Python/C++/ MATLAB | LuaJIT | Python/others | Python/C++ | Python/C++ | Java | BrainScript/ Python/C++ | Python | Julia | C | Python |
Approach | define-by-run | define-by-run | symbolic autograd | symbolic autograd | static | static/ manual grads | symbolic autograd/ manual grads/ define-by-run [1] | define-by-run | symbolic autograd | static/ manual grads/ symbolic autograd [2] | static/ symbolic autograd | static/ symbolic autograd [3] | define-by-run | static | callback-based define-by-run | |
CPU backend package | NumPy | TH | Eigen | NumPy | TH | mshadow | Eigen | ND4J | NumPy | Julia | NumPy | |||||
GPU backend package | CuPy | THC | Eigen | libgpuarray | THC | mshadow | Eigen | ND4J | neon | KnetArrays | CuPy | |||||
Primary sponsor | Preferred Networks | MILA | Amazon/Apache | CMU | Baidu | Skymind | Microsoft | Intel Nervana | Koç University | Joe Redmon | Explosion AI | |||||
NNs | CNNs | full | full | full | full | full | full | full | partial | full | full | full | full | partial | full | none |
RNNs | full | full | full | full | partial | full | full | full | full | full | full | partial | partial | partial | partial | |
Reverse-mode autograd | Y | Y | Y | Y | torch-autograd | Y | Y | Y | Y | ngraph | Y | with closures | ||||
Forward-mode autograd | tensorflow-forward-ad | Y | ||||||||||||||
Higher-order grads | Y | Y | Y | Y | ||||||||||||
Variable-length loops | native | native | while_loop | scan | RNNs only | native | 2017 | native | RNNs only | none | dynamic axis | none | native | none | native | |
Different architectures per batch | native | native | fold | torch-autograd | MinPy | native | native | native | ||||||||
Performance | cuDNN support | full | full | partial | partial | full | full | full | partial | full | partial | full | N/A [4] | partial | ||
CPU/GPU generic backend | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | |||||
Multi-GPU data parallelism | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | |||
Multi-GPU model parallelism | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | ||||||
Multiprocessing [5] | full | partial | full | |||||||||||||
Distributed training | ChainerMN | THD | Y | 2017 | torch-distlearn | Y | Y | Spark | Y | Y | ||||||
Misc | Runtime debugging | debug mode, typechecking, pdb | pdb | tfdbg | Monitor | pdb | Java debuggers | cntk.debugging | Gallium.jl | gdb | pdb | |||||
Trainer abstraction | native | tnt | Blocks, Lasagne, Keras | native | torchnet | native | native | native | native | native | ||||||
Reporter abstraction | native | tnt | native | torchnet | native | native | native | |||||||||
Web interface | TensorBoard | DL4J-UI | Nervana Cloud | |||||||||||||
Graph compilation engine | 2017 | XLA | 2017 | NNVM | ngraph |
[1] | Define-by-run is in development as of June 2017 and tracked in dmlc/mxnet#5705. It is also possible using the much slower MinPy extension. |
[2] | Symbolic autograd is in development as of June 2017 and tracked in deeplearning4j/nd4j#1750. |
[3] | Symbolic autograd is available only with ngraph backend (experimental). |
[4] | Nervana provides kernels that are meant to compete with cuDNN. |
[5] | Multiprocessing provides a significant performance improvement only for frameworks that use Python at runtime. |
Benchmarks¶
Benchmarks for convolutional networks can be found at convnet-benchmarks while some NLP benchmarks are at dynet-benchmark. Chainer wraps the latest available cuDNN kernels for CNNs and RNNs, so performance of most common networks that use these kernels is typically similar to that of other modern frameworks. As Chainer’s define-by-run approach means the user’s Python code is executed directly at runtime, particularly complex networks or those with very small tensor sizes may be slower than in static-graph frameworks.
License¶
Copyright (c) 2015 Preferred Infrastructure, Inc.
Copyright (c) 2015 Preferred Networks, Inc.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.