Standard Function implementations¶
Chainer provides basic Function
implementations in the
chainer.functions
package.
Nonparameterized functions are provided as plain Python functions. These can be
used directly in forward computation without explicit handling of
Function
objects. On the other hand, parameterized functions
should be used with explicit handling of Function
objects.
Learnable connections¶

class
chainer.functions.
BinaryHierarchicalSoftmax
(in_size, tree)[source]¶ Implementation of hierarchical softmax (HSM).
In natural language applications, vocabulary size is too large to use softmax loss. Instead, the hierarchical softmax uses product of sigmoid functions. It costs only \(O(\log(n))\) time where \(n\) is the vocabulary size in average.
At first a user need to prepare a binary tree whose each leaf is corresponding to a word in a vocabulary. When a word \(x\) is given, exactly one path from the root of the tree to the leaf of the word exists. Let \(\mbox{path}(x) = ((e_1, b_1), \dots, (e_m, b_m))\) be the path of \(x\), where \(e_i\) is an index of \(i\)th internal node, and \(b_i \in \{1, 1\}\) indicates direction to move at \(i\)th internal node (1 is left, and 1 is right). Then, the probability of \(x\) is given as below:
\[\begin{split}P(x) &= \prod_{(e_i, b_i) \in \mbox{path}(x)}P(b_i  e_i) \\ &= \prod_{(e_i, b_i) \in \mbox{path}(x)}\sigma(b_i x^\top w_{e_i}),\end{split}\]where \(\sigma(\cdot)\) is a sigmoid function, and \(w\) is a weight matrix.
This function costs \(O(\log(n))\) time as an average length of paths is \(O(\log(n))\), and \(O(n)\) memory as the number of internal nodes equals \(n  1\).
Parameters:  in_size (int) – Dimension of input vectors.
 tree – A binary tree made with tuples like ((1, 2), 3).
See: Hierarchical Probabilistic Neural Network Language Model [Morin+, AISTAT2005].

class
chainer.functions.
Convolution2D
(in_channels, out_channels, ksize, stride=1, pad=0, wscale=1, bias=0, nobias=False, use_cudnn=True)[source]¶ Twodimensional convolution function.
The details of this function are described below the arguments description.
Parameters:  in_channels (int) – Number of channels of input arrays.
 out_channels (int) – Number of channels of output arrays.
 ksize (int or (int, int)) – Size of filters (a.k.a. kernels).
ksize=k
andksize=(k, k)
are equivalent.  stride (int or (int, int)) – Stride of filter applications.
stride=s
andstride=(s, s)
are equivalent.  pad (int or (int, int)) – Spatial padding width for input arrays.
pad=p
andpad=(p, p)
are equivalent.  wscale (float) – Scaling factor of the initial weight.
 bias (float) – Initial bias value.
 nobias (bool) – If True, then this function does not use the bias term.
 use_cudnn (bool) – If True, then this function uses CuDNN if available.
This function holds at most two parameter arrays:
W
andb
, which indicate the filter weight and the bias vector, respectively.The filter weight has four dimensions \((c_O, c_I, k_H, k_W)\) which indicate the number of output channels, the number of input channels, height and width of the kernels, respectively. The filter weight is initialized with i.i.d. Gaussian random samples, each of which has zero mean and deviation \(\sqrt{1/(c_I k_H k_W)}\) by default. The deviation is scaled by
wscale
if specified.The bias vector is of size \(c_O\). Each element of it is initialized by
bias
argument. Ifnobias
argument is set to True, then this function does not hold the bias parameter.The twodimensional convolution function is defined as follows. Let \(X\) be the input tensor of dimensions \((n, c_I, h, w)\), where \(n\) is the batch size, and \((h, w)\) is spatial size of the input image. Then the
Convolution2D
function computes correlations between filters and patches of size \((k_H, k_W)\) in \(X\). Note that correlation here is equivalent to the inner product between expanded vectors. Patches are extracted at positions shifted by multiples ofstride
from the first positionpad
for each spatial axis. The rightmost (or bottommost) patches do not run over the padded spatial size.Let \((s_Y, s_X)\) be the stride of filter application, and \((p_H, p_W)\) the spatial padding size. Then, the output size \((h_O, w_O)\) is determined by the following equations:
\[\begin{split}h_O &= (h + 2p_H  k_H) / s_Y + 1,\\ w_O &= (w + 2p_W  k_W) / s_X + 1.\end{split}\]

class
chainer.functions.
EmbedID
(in_size, out_size)[source]¶ Efficient linear function for onehot input.
This is a parameterized function to embed the given discrete identifier (e.g. word) into a continuous vector space. This function just holds embedding vectors for all identifiers as one large matrix
W
, which is learnable. The identifiers are directly used as indexes of the matrixW
.Parameters: Note
This function is nondifferentiable with respect to the input identifiers.

class
chainer.functions.
Linear
(in_size, out_size, wscale=1, bias=0, nobias=False)[source]¶ Implementation of a linear function (a.k.a. fullyconnected layer or affine transformation).
This function holds a weight matrix
W
and a bias vectorb
.The weight matrix
W
has shape(out_size, in_size)
. This matrix is initialized with i.i.d. Gaussian samples, each of which has zero mean and deviation \(\sqrt{1/\text{in_size}}\). The deviation is scaled by factorwscale
if specified.The bias vector
b
is of sizeout_size
. Each element is initialized with thebias
value. Ifnobias
argument is set to True, then this function does not hold a bias vector.Let \(X\) be an input matrix, and \(W, b\) the weight matrix and the bias vector, respectively. Then, the output matrix \(Y\) is computed by \(Y = XW^\top + b\), where the addition by \(b\) is broadcasted across the minibatch.
Parameters: Note
This function accepts an input variable of a nonmatrix array. In this case, the leading dimension is treated as the batch dimension, and the other dimensions are reduced to one dimension.
Array manipulation functions¶

chainer.functions.
concat
(xs, axis=1)[source]¶ Concatenates given variables along an axis.
Parameters:  xs (tuple of Variables) – Variables to be concatenated.
 axis (int) – Axis that the input arrays are concatenated along.
Returns: Output variable.
Return type:

chainer.functions.
copy
(x, dst)[source]¶ Copies the input variable onto the specified device.
This function copies the array of input variable onto the device specified by
dst
if the original array is on GPU, and otherwise just copies the array within host memory.Parameters:  x (Variable) – Variable to be copied.
 dst – Target device specifier.
Returns: Output variable.
Return type:

chainer.functions.
dropout
(x, ratio=0.5, train=True)[source]¶ Drops elements of input variable randomly.
This function drops input elements randomly with probability
ratio
and scales the remaining elements by factor1 / (1  ratio)
. In testing mode, it does nothing and just returnsx
.Parameters: Returns: Output variable.
Return type: See the paper by G. Hinton: Improving neural networks by preventing coadaptation of feature detectors.
Activation functions¶

chainer.functions.
leaky_relu
(x, slope=0.2)[source]¶ Leaky Rectified Linear Unit function.
This function is expressed as \(f(x) = \max(x, ax)\), where \(a\) is a configurable slope value.
Parameters: Returns: Output variable.
Return type:

chainer.functions.
lstm
(c_prev, x)[source]¶ Long ShortTerm Memory units as an activation function.
This function implements LSTM units with forget gates. Let the previous cell state \(c_{\text{prev}}\) and the incoming signal \(x\). Then, first the incoming signal \(x\) is split along the second dimension into four arrays \(a, i, f, o\) of the same shapes. Second, it computes outputs as:
\[\begin{split}c &= \tanh(a) \text{sigmoid}(i) + c_{\text{prev}} \text{sigmoid}(f), \\ h &= \tanh(c) \text{sigmoid}(o).\end{split}\]This function outputs these two arrays as a tuple of two variables.
Parameters: Returns: Two
Variable
objectsc
andh
.c
is the updated cell state.h
indicates the outgoing signal.Return type: See the original paper proposing LSTM with forget gates: Long ShortTerm Memory in Recurrent Neural Networks.

class
chainer.functions.
PReLU
(shape=(), init=0.25)[source]¶ Parametric ReLU function.
PReLU function is written in elementwise equation as \(PReLU(x) = \max(x, ax)\), where \(a\) is a parameter array.
When the PReLU function is combined with twodimensional convolution, the elements of parameter \(a\) are typically shared across the same filter of different pixels. In order to support such usage, this function supports the shape of parameter array that indicates leading dimensions of input arrays except the batch dimension.
Parameters:  shape (tuple of ints) – Shape of the parameter array.
 init (float) – Initial parameter value.
See detail in paper: Delving Deep into Rectifiers: Surpassing HumanLevel Performance on ImageNet Classification.

chainer.functions.
relu
(x, use_cudnn=True)[source]¶ Rectified Linear Unit function \(f(x)=\max(0, x)\).
Parameters: Returns: Output variable.
Return type:

chainer.functions.
sigmoid
(x, use_cudnn=True)[source]¶ Elementwise sigmoid logistic function \(f(x)=(1 + \exp(x))^{1}\).
Parameters: Returns: Output variable.
Return type:

chainer.functions.
softmax
(x, use_cudnn=True)[source]¶ Channelwise softmax function.
This function only accepts a two dimensional input array, and computes its softmax along the second axis. For each index \(i, j\) of the input matrix \(x\), it computes \(f_{ij}(x)={\exp(x_{ij}) \over \sum_j \exp(x_{ij})}\).
Parameters: Returns: Output variable.
Return type:
Pooling functions¶

chainer.functions.
average_pooling_2d
(x, ksize, stride=None, pad=0, use_cudnn=True)[source]¶ Spatial average pooling function.
This function acts similarly to
Convolution2D
, but it computes the average of input spatial patch for each channel without any parameter instead of computing the inner products.Parameters:  x (Variable) – Input variable.
 ksize (int or (int, int)) – Size of pooling window.
ksize=k
andksize=(k, k)
are equivalent.  stride (int or (int, int) or None) – Stride of pooling applications.
ksize=k
andksize=(k, k)
are equivalent. If None is specified, then it uses same stride as the pooling window size.  pad (int or (int, int)) – Spatial padding width for the input array.
pad=p
andpad=(p, p)
are equivalent.  use_cudnn (bool) – If True and CuDNN is enabled, then this function uses CuDNN as the core implementation.
Returns: Output variable.
Return type: Note
This function currently does not support
cover_all
mode asmax_pooling_2d()
. Average pooling runs in noncoverall mode.

chainer.functions.
max_pooling_2d
(x, ksize, stride=None, pad=0, cover_all=True, use_cudnn=True)[source]¶ Spatial max pooling function.
This function acts similarly to
Convolution2D
, but it computes the maximum of input spatial patch for each channel without any parameter instead of computing the inner products.Parameters:  x (Variable) – Input variable.
 ksize (int or (int, int)) – Size of pooling window.
ksize=k
andksize=(k, k)
are equivalent.  stride (int or (int, int) or None) – Stride of pooling applications.
ksize=k
andksize=(k, k)
are equivalent. If None is specified, then it uses same stride as the pooling window size.  pad (int or (int, int)) – Spatial padding width for the input array.
pad=p
andpad=(p, p)
are equivalent.  cover_all (bool) – If True, all spatial locations are pooled into some output pixels. It may make the output size larger.
 use_cudnn (bool) – If True and CuDNN is enabled, then this function uses CuDNN as the core implementation.
Returns: Ouptut variable.
Return type:
Normalization functions¶

class
chainer.functions.
BatchNormalization
(size, decay=0.9, eps=1e05)[source]¶ Batch normalization on outputs of linear or convolution functions.
Parameters: See: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

__call__
(x, test=False, finetune=False)[source]¶ Invokes the forward propagation of BatchNormalization.
BatchNormalization accepts additional arguments, which controlls three different running mode.
Parameters:  x (Variable) – An input variable.
 test (bool) – If
True
, BatchNormalization runs in testing mode; it normalizes the input using precomputed statistics.  finetune (bool) – If
True
, BatchNormalization runs in finetuning mode; it accumulates the input array to compute population statistics for normalization, and normalizes the input using batch statistics.
If
test
andfinetune
are bothFalse
, then BatchNormalization runs in training mode; it computes moving averages of mean and variance for evaluation during training, and normalizes the input using batch statistics.


chainer.functions.
local_response_normalization
(x, n=5, k=2, alpha=0.0001, beta=0.75)[source]¶ Local response normalization across neighboring channels.
This function implements normalization across channels. Let \(x\) an input image with \(N\) channels. Then, this function computes an output image \(y\) by following formula:
\[y_i = {x_i \over \left( k + \ \alpha \sum_{j=\max{1, i  n/2}}^{\min{N, i + n/2}} \ x_j^2 \right)^\beta}.\]Parameters: Returns: Output variable.
Return type: See: SSec. 3.3 of ImageNet Classification with Deep Convolutional Neural Networks
Loss, evaluation and aggregation¶

chainer.functions.
accuracy
(y, t)[source]¶ Computes muticlass classification accuracy of the minibatch.
Parameters: Returns: A variable holding a scalar array of the accuracy.
Return type: Note
This function is nondifferentiable.

chainer.functions.
mean_squared_error
(x0, x1)[source]¶ Mean squared error function.
This function computes mean squared error between two variables. The mean is taken over the minibatch. Note that the error is not scaled by 1/2.

chainer.functions.
softmax_cross_entropy
(x, t, use_cudnn=True)[source]¶ Computes cross entropy loss on softmax of the prediction using the groundtruth label vector.
Parameters: Returns: A variable holding a scalar array of the cross entropy loss.
Return type: Note
This function is differentiable only by
x
.
Reusable subnetwork of complex architectures¶

class
chainer.functions.
Inception
(in_channels, out1, proj3, out3, proj5, out5, proj_pool)[source]¶ Inception module of GoogLeNet.
It applies four different functions to the input array and concatenates their outputs along the channel dimension. Three of them are 2D convolutions of sizes 1x1, 3x3 and 5x5. Convolution paths of 3x3 and 5x5 sizes have 1x1 convolutions (called projections) ahead of them. The other path consists of 1x1 convolution (projection) and 3x3 max pooling.
The output array has the same spatial size as the input. In order to satisfy this, Inception module uses appropriate padding for each convolution and pooling.
See: Going Deeper with Convolutions.
Parameters:  in_channels (int) – Number of channels of input arrays.
 out1 (int) – Output size of 1x1 convolution path.
 proj3 (int) – Projection size of 3x3 convolution path.
 out3 (int) – Output size of 3x3 convolution path.
 proj5 (int) – Projection size of 5x5 convolution path.
 out5 (int) – Output size of 5x5 convolution path.
 proj_pool (int) – Projection size of max pooling path.
Returns: Output variable. Its array has the same spatial size and the same minibatch size as the input array. The channel dimension has size
out1 + out3 + out5 + proj_pool
.Return type: Note
This function inserts the full computation graph of the Inception module behind the input array. This function itself is not inserted into the computation graph.