Standard Function implementations¶
Chainer provides basic Function
implementations in the
chainer.functions
package. Most of them are wrapped by plain Python
functions, which users should use.
Note
As of v1.5, the concept of parameterized functions are gone, and they are
replaced by corresponding Link
implementations. They are
still put in the functions
namespace for backward
compatibility, though it is strongly recommended to use them via the
chainer.links
package.
Activation functions¶
clipped_relu¶
crelu¶

chainer.functions.
crelu
(x, axis=1)[source]¶ Concatenated Rectified Linear Unit function.
This function is expressed as \(f(x) = (\max(0, x), \max(0, x))\), where two output values are concatenated along an axis.
See: http://arxiv.org/abs/1603.05201
Parameters: Returns: Output variable.
Return type:
elu¶

chainer.functions.
elu
(x, alpha=1.0)[source]¶ Exponential Linear Unit function.
This function is expressed as
\[\begin{split}f(x) = \left \{ \begin{array}{ll} x & {\rm if}~ x \ge 0 \\ \alpha (\exp(x)  1) & {\rm if}~ x < 0, \end{array} \right.\end{split}\]where \(\alpha\) is a parameter. See: http://arxiv.org/abs/1511.07289
Parameters: Returns: Output variable.
Return type:
hard_sigmoid¶

chainer.functions.
hard_sigmoid
(x)[source]¶ Elementwise hardsigmoid function.
This function is defined as
\[\begin{split}f(x) = \left \{ \begin{array}{ll} 0 & {\rm if}~ x < 2.5 \\ 0.2 x + 0.5 & {\rm if}~ 2.5 < x < 2.5 \\ 1 & {\rm if}~ 2.5 < x. \end{array} \right.\end{split}\]Parameters: x (Variable) – Input variable. Returns: Output variable. Return type: Variable
leaky_relu¶
log_softmax¶

chainer.functions.
log_softmax
(x, use_cudnn=True)[source]¶ Channelwise logsoftmax function.
This function computes its logarithm of softmax along the second axis. Let \(x = (x_1, x_2, \dots, x_d)^{\top}\) be the d dimensional index array and \(f(x_1, \dots, x_d)\) be the corresponding input array. For each index \(x\) in the input array, it computes the logarithm of the probability \(\log p(x)\) defined as
\[p(x) = {\exp(f(x_1, x_2, \dots, x_d)) \over \sum_{x'_2} \exp(f(x_1, x'_2, \dots, x_d))}.\]This method is theoretically equivalent to
log(softmax(x))
but is more stable.Note
log(softmax(x))
may cause underflow whenx
is too small, becausesoftmax(x)
may returns0
.log_softmax
method is more stable.Parameters: Returns: Output variable.
Return type: See also
lstm¶

chainer.functions.
lstm
(c_prev, x)[source]¶ Long ShortTerm Memory units as an activation function.
This function implements LSTM units with forget gates. Let the previous cell state \(c_{\text{prev}}\) and the incoming signal \(x\).
First, the incoming signal \(x\) is split into four arrays \(a, i, f, o\) of the same shapes along the second axis. It means that \(x\) ‘s second axis must have 4 times the length of \(c_{\text{prev}}\).
The split input signals are corresponding to:
 \(a\) : sources of cell input
 \(i\) : sources of input gate
 \(f\) : sources of forget gate
 \(o\) : sources of output gate
Second, it computes outputs as:
\[\begin{split}c &= \tanh(a) \text{sigmoid}(i) + c_{\text{prev}} \text{sigmoid}(f), \\ h &= \tanh(c) \text{sigmoid}(o).\end{split}\]These are returned as a tuple of two variables.
This function supports variable length inputs. The minibatch size of the current input must be equal to or smaller than that of the previous one. When minibatch size of
x
is smaller than that ofc
, this function only updatesc[0:len(x)]
and doesn’t change the rest ofc
,c[len(x):]
. So, please sort input sequences in descending order of lengths before applying the function.Parameters: Returns:  Two
Variable
objectsc
andh
.c
is the updated cell state.
h
indicates the outgoing signal.
Return type: See the original paper proposing LSTM with forget gates: Long ShortTerm Memory in Recurrent Neural Networks.
Example
Assuming
y
is the current input signal,c
is the previous cell state, andh
is the previous output signal from anlstm
function. Each ofy
,c
andh
hasn_units
channels. Most typical preparation ofx
is:>>> n_units = 100 >>> y = chainer.Variable(np.zeros((1, n_units), 'f')) >>> h = chainer.Variable(np.zeros((1, n_units), 'f')) >>> c = chainer.Variable(np.zeros((1, n_units), 'f')) >>> model = chainer.Chain(w=F.Linear(n_units, 4 * n_units), ... v=F.Linear(n_units, 4 * n_units),) >>> x = model.w(y) + model.v(h) >>> c, h = F.lstm(c, x)
It corresponds to calculate the input sources \(a, i, f, o\) from the current input
y
and the previous outputh
. Different parameters are used for different kind of input sources.
maxout¶

chainer.functions.
maxout
(x, pool_size, axis=1)[source]¶ Maxout activation function.
It accepts an input tensor
x
, reshapes theaxis
dimension (say the size beingM * pool_size
) into two dimensions(M, pool_size)
, and takes maximum along theaxis
dimension. The output of this function is same asx
except thataxis
dimension is transformed fromM * pool_size
toM
.Typically,
x
is the output of a linear layer or a convolution layer. The following is the example where we usemaxout()
in combination with a Linear link.>>> in_size, out_size, pool_size = 100, 100, 100 >>> l = L.Linear(in_size, out_size * pool_size) >>> x = chainer.Variable(np.zeros((1, in_size), 'f')) # prepare data >>> x = l(x) >>> y = F.maxout(x, pool_size)
Parameters: x (Variable) – Input variable. Its first dimension is assumed to be the minibatch dimension. The other dimensions are treated as one concatenated dimension. Returns: Output variable. Return type: Variable See also
prelu¶

chainer.functions.
prelu
(x, W)[source]¶ Parametric ReLU function.
It accepts two arguments: an input
x
and a weight arrayW
and computes the output as \(PReLU(x) = \max(x, W*x)\), where \(*\) is an elementwise multiplication for each sample in the batch.When the PReLU function is combined with twodimensional convolution, the elements of parameter \(a\) are typically shared across the same filter of different pixels. In order to support such usage, this function supports the shape of parameter array that indicates leading dimensions of input arrays except the batch dimension.
For example \(W\) has the shape of \((2, 3, 4)\), \(x\) must have the shape of \((B, 2, 3, 4, S1, ..., SN)\) where B is batch size and the number of trailing S’s is arbitrary nonnegative integer.
Parameters: Returns: Output variable
Return type: See also
relu¶
sigmoid¶
slstm¶

chainer.functions.
slstm
(c_prev1, c_prev2, x1, x2)[source]¶ SLSTM units as an activation function.
This function implements SLSTM unit. It is an extension of LSTM unit applied to tree structures. The function is applied to binary trees. Each node has two child nodes. It gets four arguments, previous cell states \(c_1\) and \(c_2\), and incoming signals \(x_1\) and \(x_2\).
First both input signals \(x_1\) and \(x_2\) are split into eight arrays \(a_1, i_1, f_1, o_1\), and \(a_2, i_2, f_2, o_2\). They have the same shape along the second axis. It means that \(x_1\) and \(x_2\) ‘s second axis must have 4 times the length of \(c_{1 \text{prev}}\) and \(c_{2 \text{prev}}\).
The split input signals are corresponding to:
 \(a_i\) : sources of cell input
 \(i_i\) : sources of input gate
 \(f_i\) : sources of forget gate
 \(o_i\) : sources of output gate
It computes outputs as:
\[\begin{split}c &= \tanh(a_1 + a_2) \sigma(i_1 + i_2) + c_{1 \text{prev}} \sigma(f_1) + c_{2 \text{prev}} \sigma(f_2), \\ h &= \tanh(c) \sigma(o_1 + o_2),\end{split}\]where \(\sigma\) is the elementwise sigmoid function. The function returns \(c\) and \(h\) as a tuple.
Parameters:  c_prev1 (Variable) – Variable that holds the previous cell state of the first child node. The cell state should be a zero array or the output of the previous call of LSTM.
 c_prev2 (Variable) – Variable that holds the previous cell state of the second child node.
 x1 (Variable) – Variable that holds the incoming signal from the first child node. It must have the second dimension four times of that of the cell state,
 x2 (Variable) – Variable that holds the incoming signal from the second child node.
Returns:  Two
Variable
objectsc
andh
.c
is the cell state.
h
indicates the outgoing signal.
Return type: See detail in paper: Long ShortTerm Memory Over Tree Structures.
softmax¶

chainer.functions.
softmax
(x, use_cudnn=True)[source]¶ Channelwise softmax function.
This function computes its softmax along the second axis. Let \(x = (x_1, x_2, \dots, x_d)^{\top}\) be the d dimensional index array and \(f(x)\) be the d dimensional input array. For each index \(x\) of the input array \(f(x)\), it computes the probability \(p(x)\) defined as \(p(x) = {\exp(f(x)) \over \sum_{x_2} \exp(f(x))}\).
Parameters: Returns: Output variable.
Return type:
Array manipulations¶
broadcast¶
broadcast_to¶
cast¶
concat¶
copy¶

chainer.functions.
copy
(x, dst)[source]¶ Copies the input variable onto the specified device.
This function copies the array of input variable onto the device specified by
dst
. Whendst == 1
, it copies the array onto the host memory. This function supports copies from host to device, from device to device and from device to host.Parameters:  x (Variable) – Variable to be copied.
 dst – Target device specifier.
Returns: Output variable.
Return type:
dstack¶
expand_dims¶
flatten¶
get_item¶
hstack¶
permutate¶

chainer.functions.
permutate
(x, indices, axis=0, inv=False)[source]¶ Permutates a given variable along an axis.
This function permutate
x
with givenindices
. That meansy[i] = x[indices[i]]
for alli
. Note that this result is same asy = x.take(indices)
.indices
must be a permutation of[0, 1, ..., len(x)  1]
.When
inv
isTrue
,indices
is treated as its inverse. That meansy[indices[i]] = x[i]
.Parameters: Returns: Output variable.
Return type:
reshape¶
rollaxis¶
select_item¶
separate¶

chainer.functions.
separate
(x, axis=0)[source]¶ Separates an array along a given axis.
This function separates an array along a given axis. For example, shape of an array is
(2, 3, 4)
. When it separates the array withaxis=1
, it returns three(2, 4)
arrays.This function is an inverse of
chainer.functions.stack()
.Parameters:  x (chainer.Variable) – Variable to be separated.
 axis (int) – Axis along which variables are separated.
Returns: Output variables.
Return type: tuple of chainer.Variable
See also
split_axis¶

chainer.functions.
split_axis
(x, indices_or_sections, axis, force_tuple=False)[source]¶ Splits given variables along an axis.
Parameters:  x (tuple of Variables) – Variables to be split.
 indices_or_sections (int or 1D array) – If this argument is an integer, N, the array will be divided into N equal arrays along axis. If it is a 1D array of sorted integers, it indicates the positions where the array is split.
 axis (int) – Axis that the input array is split along.
 force_tuple (bool) – If
True
, this method returns a tuple even when the number of outputs is one.
Returns: Return type: Note
This function raises
ValueError
if at least one of the outputs is split to zerosize (i.e.axis
th value of its shape is zero).
squeeze¶

chainer.functions.
squeeze
(x, axis=None)[source]¶ Remove demensions of size one from the shape of a ndarray.
Parameters:  x (chainer.Variable or :class:
numpy.ndarray
or cupy.ndarray) – Input data.  axis (None or int or tuple of ints) – A subset of the singledimensional
entries in the shape to remove. If
None
is supplied, all of them are removed. The dimension index starts at zero. If an axis with dimension greater than one is selected, an error is raised.
Returns: Variable whose dimensions of size 1 are removed.
Return type:  x (chainer.Variable or :class:
stack¶
swapaxes¶
tile¶

chainer.functions.
tile
(x, reps)[source]¶ Construct an array by tiling a given array.
Parameters:  x (chainer.Variable or
numpy.ndarray
or cupy.ndarray) – Input data.  reps (int or tuple of ints) – The number of times for each axis with which x is replicated.
Returns: Variable tiled the given array.
Return type:  x (chainer.Variable or
transpose¶

chainer.functions.
transpose
(x, axes=None)[source]¶ Permute the dimensions of an input variable without copy.
Parameters:  x (Variable) – Input variable.
 axes (tuple of ints) – By default, reverse the dimensions, otherwise permute the axes according to the values given.
Returns: Variable whose axes are permuted.
Return type:
transpose_sequence¶

chainer.functions.
transpose_sequence
(xs)[source]¶ Transpose a list of Variables.
This function transposes a list of
Variable
s and returns a list ofVariable
s. For example a user gives[(0, 1, 2, 3), (4, 5), (6)]
, the function returns[(0, 4, 6), (1, 5), (2), (3)]
. Note that a given list needs to be sorted by each length ofVariable
.Parameters: xs (list of ~chainer.Variable) – Variables to transpose. Returns: Transposed list. Return type: tuple or Variable
vstack¶
Neural network connections¶
bilinear¶

chainer.functions.
bilinear
(e1, e2, W, V1=None, V2=None, b=None)[source]¶ Applies a bilinear function based on given parameters.
This is a building block of Neural Tensor Network (see the reference paper below). It takes two input variables and one or four parameters, and outputs one variable.
To be precise, denote six input arrays mathematically by \(e^1\in \mathbb{R}^{I\cdot J}\), \(e^2\in \mathbb{R}^{I\cdot K}\), \(W\in \mathbb{R}^{J \cdot K \cdot L}\), \(V^1\in \mathbb{R}^{J \cdot L}\), \(V^2\in \mathbb{R}^{K \cdot L}\), and \(b\in \mathbb{R}^{L}\), where \(I\) is minibatch size. In this document, we call \(V^1\), \(V^2\), and \(b\) linear parameters.
The output of forward propagation is calculated as
\[y_{il} = \sum_{jk} e^1_{ij} e^2_{ik} W_{jkl} + \ \sum_{j} e^1_{ij} V^1_{jl} + \sum_{k} e^2_{ik} V^2_{kl} + b_{l}.\]Note that V1, V2, b are optional. If these are not given, then this function omits the last three terms in the above equation.
Note
This function accepts an input variable
e1
ore2
of a nonmatrix array. In this case, the leading dimension is treated as the batch dimension, and the other dimensions are reduced to one dimension.Note
In the original paper, \(J\) and \(K\) must be equal and the author denotes \([V^1 V^2]\) (concatenation of matrices) by \(V\).
Parameters: Returns: Output variable.
Return type:  See:
 Reasoning With Neural Tensor Networks for Knowledge Base Completion [Socher+, NIPS2013].
convolution_2d¶

chainer.functions.
convolution_2d
(x, W, b=None, stride=1, pad=0, use_cudnn=True, cover_all=False, deterministic=False)[source]¶ Twodimensional convolution function.
This is an implementation of twodimensional convolution in ConvNets. It takes three variables: the input image
x
, the filter weightW
, and the bias vectorb
.Notation: here is a notation for dimensionalities.
 \(n\) is the batch size.
 \(c_I\) and \(c_O\) are the number of the input and output channels, respectively.
 \(h\) and \(w\) are the height and width of the input image, respectively.
 \(k_H\) and \(k_W\) are the height and width of the filters, respectively.
Parameters:  x (Variable) – Input variable of shape \((n, c_I, h, w)\).
 W (Variable) – Weight variable of shape \((c_O, c_I, k_H, k_W)\).
 b (Variable) – Bias variable of length \(c_O\) (optional).
 stride (int or pair of ints) – Stride of filter applications.
stride=s
andstride=(s, s)
are equivalent.  pad (int or pair of ints) – Spatial padding width for input arrays.
pad=p
andpad=(p, p)
are equivalent.  use_cudnn (bool) – If
True
, then this function uses cuDNN if available.  cover_all (bool) – If
True
, all spatial locations are convoluted into some output pixels. It may make the output size larger.  deterministic (bool) – The output of this function can be
nondeterministic when it uses cuDNN.
If this option is
True
, then it forces cuDNN to use a deterministic algorithm. This option is only available for cuDNN version >= v4.
Returns: Output variable.
Return type: The twodimensional convolution function is defined as follows. Then the
Convolution2D
function computes correlations between filters and patches of size \((k_H, k_W)\) inx
. Note that correlation here is equivalent to the inner product between expanded vectors. Patches are extracted at positions shifted by multiples ofstride
from the first positionpad
for each spatial axis. The rightmost (or bottommost) patches do not run over the padded spatial size.Let \((s_Y, s_X)\) be the stride of filter application, and \((p_H, p_W)\) the spatial padding size. Then, the output size \((h_O, w_O)\) is determined by the following equations:
\[\begin{split}h_O &= (h + 2p_H  k_H) / s_Y + 1,\\ w_O &= (w + 2p_W  k_W) / s_X + 1.\end{split}\]If the bias vector is given, then it is added to all spatial locations of the output of convolution.
See also
convolution_nd¶

chainer.functions.
convolution_nd
(x, W, b=None, stride=1, pad=0, use_cudnn=True, cover_all=False)[source]¶ Ndimensional convolution function.
This is an implementation of Ndimensional convolution which is generalized twodimensional convolution in ConvNets. It takes three variables: the input
x
, the filter weightW
and the bias vectorb
.Notation: here is a notation for dimensionalities.
 \(N\) is the number of spatial dimensions.
 \(n\) is the batch size.
 \(c_I\) and \(c_O\) are the number of the input and output channels, respectively.
 \(d_1, d_2, ..., d_N\) are the size of each axis of the input’s spatial dimensions, respectively.
 \(k_1, k_2, ..., k_N\) are the size of each axis of the filters, respectively.
Parameters:  x (Variable) – Input variable of shape \((n, c_I, d_1, d_2, ..., d_N)\).
 W (Variable) – Weight variable of shape \((c_O, c_I, k_1, k_2, ..., k_N)\).
 b (Variable) – Onedimensional bias variable with length \(c_O\) (optional).
 stride (int or tuple of ints) – Stride of filter applications
\((s_1, s_2, ..., s_N)\).
stride=s
is equivalent to(s, s, ..., s)
.  pad (int or tuple of ints) – Spatial padding width for input arrays
\((p_1, p_2, ..., p_N)\).
pad=p
is equivalent to(p, p, ..., p)
.  use_cudnn (bool) – If
True
, then this function uses cuDNN if available. See below for the excact conditions.  cover_all (bool) – If
True
, all spatial locations are convoluted into some output pixels. It may make the output size larger. cover_all needs to beFalse
if you want to use cuDNN.
Returns: Output variable.
Return type: This function uses cuDNN implementation for its forward and backward computation if ALL of the following conditions are satisfied:
cuda.cudnn_enabled
isTrue
use_cudnn
isTrue
 The number of spatial dimensions is more than one.
cover_all
isFalse
 The input’s
dtype
is equal to the filter weight’s.  The
dtype
is FP32, FP64 or FP16(cuDNN version is equal to or greater than v3)
See also
deconvolution_2d¶

chainer.functions.
deconvolution_2d
(x, W, b=None, stride=1, pad=0, outsize=None, use_cudnn=True, deterministic=False)[source]¶ Two dimensional deconvolution function.
This is an implementation of twodimensional deconvolution. It takes three variables: input image
x
, the filter weightW
, and the bias vectorb
.Parameters:  x (Variable) – Input variable of shape \((n, c_I, h, w)\).
 W (Variable) – Weight variable of shape \((c_I, c_O, k_H, k_W)\).
 b (Variable) – Bias variable of length \(c_O\) (optional).
 stride (int or pair of ints) – Stride of filter applications.
stride=s
andstride=(s, s)
are equivalent.  pad (int or pair of ints) – Spatial padding width for input arrays.
pad=p
andpad=(p, p)
are equivalent.  outsize (tuple) – Expected output size of deconvolutional operation.
It should be pair of height and width \((out_H, out_W)\).
Default value is
None
and the outsize is estimated by input size, stride and pad.  use_cudnn (bool) – If
True
, then this function uses cuDNN if available.  deterministic (bool) – The output of this function can be
nondeterministic when it uses cuDNN.
If this option is
True
, then it forces cuDNN to use a deterministic algorithm. This option is only available for cuDNN version >= v4.
The filter weight has four dimensions \((c_I, c_O, k_H, k_W)\) which indicate the number of input channels, output channels, height and width of the kernels, respectively.
The bias vector is of size \(c_O\).
Let \(X\) be the input tensor of dimensions \((n, c_I, h, w)\), \((s_Y, s_X)\) the stride of filter application, and \((p_H, p_W)\) the spatial padding size. Then, the output size \((h_O, w_O)\) is determined by the following equations:
\[\begin{split}h_O &= s_Y (h  1) + k_H  2p_H,\\ w_O &= s_X (w  1) + k_W  2p_W.\end{split}\]
deconvolution_nd¶

chainer.functions.
deconvolution_nd
(x, W, b=None, stride=1, pad=0, outsize=None, use_cudnn=True)[source]¶ Ndimensional deconvolution function.
This is an implementation of Ndimensional deconvolution which generalizes twodimensional one. It takes three variables: input
x
, the filter weightW
, and the bias vectorb
.Parameters:  x (chainer.Variable or
numpy.ndarray
or cupy.ndarray) – Input data of shape \((n, c_I, d_1, d_2, ..., d_N)\).  W (chainer.Variable or
numpy.ndarray
or cupy.ndarray) – Weight data of shape \((c_I, c_O, k_1, k_2, ..., k_N)\).  b (chainer.Variable or
numpy.ndarray
or cupy.ndarray) – Bias vector of length \(c_O\) (optional).  stride (int or tuple of ints) – Stride of filter applications
\((s_1, s_2, ..., s_N)\).
stride=s
is equivalent to(s, s, ..., s)
.  pad (int or tuple of ints) – Spatial padding size for input arrays
\((p_1, p_2, ..., p_N)\).
pad=p
is equivalent to(p, p, ..., p)
.  outsize (tuple of ints) – Expected output size of deconvolutional
operation. It should be a tuple of ints
\((out_1, out_2, ..., out_N)\). Default value is
None
and the outsize is estimated by input size, stride and pad.  use_cudnn (bool) – If
True
, then this function uses cuDNN if available. Note that cuDNN supports more than onedimensional deconvolution operations only.
Returns: Output variable.
Return type: The filter weight has the following dimensions \((c_I, c_O, k_1, k_2, ..., k_N)\) which indicate the number of input channels, that of output channels and the filter’s spatial sizes, respectively.
The onedimensional bias vector is of size \(c_O\).
Let \(X\) be the input tensor of dimensions \((n, c_I, d_1, d_2, ..., d_N)\), \((s_1, s_2, ..., s_N)\) the stride of filter applications, and \((p_1, p_2, ..., p_N)\) the spacial padding size. Then the output size \((out_1, out_2, ..., out_N)\) is determined by the following equations:
\[\begin{split}out_1 &= s_1 (d_1  1) + k_1  2 p_1,\\ out_2 &= s_2 (d_2  1) + k_2  2 p_2,\\ ...,\\ out_N &= s_N (d_N  1) + k_N  2 p_N.\end{split}\]See also
links.DeconvolutionND
,deconvolution_2d()
 x (chainer.Variable or
dilated_convolution_2d¶

chainer.functions.
dilated_convolution_2d
(x, W, b=None, stride=1, pad=0, dilate=1, use_cudnn=True, cover_all=False)[source]¶ Twodimensional dilated convolution function.
This is an implementation of twodimensional dilated convolution in ConvNets. It takes three variables: the input image
x
, the filter weightW
, and the bias vectorb
.Notation: here is a notation for dimensionalities.
 \(n\) is the batch size.
 \(c_I\) and \(c_O\) are the number of the input and output, respectively.
 \(h\) and \(w\) are the height and width of the input image, respectively.
 \(k_H\) and \(k_W\) are the height and width of the filters, respectively.
Parameters:  x (Variable) – Input variable of shape \((n, c_I, h, w)\).
 W (Variable) – Weight variable of shape \((c_O, c_I, k_H, k_W)\).
 b (Variable) – Bias variable of length \(c_O\) (optional).
 stride (int or pair of ints) – Stride of filter applications.
stride=s
andstride=(s, s)
are equivalent.  pad (int or pair of ints) – Spatial padding width for input arrays.
pad=p
andpad=(p, p)
are equivalent.  dilate (int or pair of ints) – Dilation factor of filter applications.
dilate=d
anddilate=(d, d)
are equivalent.  use_cudnn (bool) – If
True
, then this function uses cuDNN if available.  cover_all (bool) – If
True
, all spatial locations are convoluted into some output pixels. It may make the output size larger.
Returns: Output variable.
Return type: The twodimensional dilated convolution function is defined as follows. Then the
DilatedConvolution2D
function computes correlations between filters and patches of size \((k_H, k_W)\) inx
. Patches here are extracted at intervals of the dilation factor. Note that correlation here is equivalent to the inner product between expanded vectors. Patches are extracted at intervals of the dilation factor and at positions shifted by multiples ofstride
from the first positionpad
for each spatial axis. The rightmost (or bottommost) patches do not run over the padded spatial size.Let \((s_Y, s_X)\) be the stride of filter application, \((p_H, p_W)\) the spatial padding size, and \((d_Y, d_X)\) the dilation factor of filter application. Then, the output size \((h_O, w_O)\) is determined by the following equations:
\[\begin{split}h_O &= (h + 2p_H  k_H  (k_H  1) * (d_Y  1)) / s_Y + 1,\\ w_O &= (w + 2p_W  k_W  (k_W  1) * (d_X  1)) / s_X + 1.\end{split}\]If the bias vector is given, then it is added to all spatial locations of the output of convolution.
See also
DilatedConvolution2D
embed_id¶

chainer.functions.
embed_id
(x, W, ignore_label=None)[source]¶ Efficient linear function for onehot input.
This function implements so called word embedding. It takes two arguments: a set of IDs (words)
x
in \(B\) dimensional integer vector, and a set of all ID (word) embeddingsW
in \(V \times d\) float32 matrix. It outputs \(B \times d\) matrix whosei
th column is thex[i]
th column ofW
.This function is only differentiable on the input
W
.Parameters: Returns: Output variable.
Return type: See also
linear¶
n_step_lstm¶

chainer.functions.
n_step_lstm
(n_layers, dropout_ratio, hx, cx, ws, bs, xs, train=True, use_cudnn=True)[source]¶ Stacked Long ShortTerm Memory function for sequence inputs.
This function calculates stacked LSTM with sequences. This function gets an initial hidden state \(h_0\), an initial cell state \(c_0\), an input sequence \(x\), weight matrices \(W\), and bias vectors \(b\). This function calculates hidden states \(h_t\) and \(c_t\) for each time \(t\) from input \(x_t\).
\[\begin{split}i_t &= \sigma(W_0 x_t + W_4 h_{t1} + b_0 + b_4) \\ f_t &= \sigma(W_1 x_t + W_5 h_{t1} + b_1 + b_5) \\ o_t &= \sigma(W_2 x_t + W_6 h_{t1} + b_2 + b_6) \\ a_t &= \tanh(W_3 x_t + W_7 h_{t1} + b_3 + b_7) \\ c_t &= f_t \dot c_{t1} + i_t \dot a_t \\ h_t &= o_t \dot \tanh(c_t)\end{split}\]As the function accepts a sequence, it calculates \(h_t\) for all \(t\) with one call. Eight weight matrices and eight bias vectors are required for each layers. So, when \(S\) layers exists, you need to prepare \(8S\) weigth matrices and \(8S\) bias vectors.
If the number of layers
n_layers
is greather than \(1\), input ofk
th layer is hidden stateh_t
ofk1
th layer. Note that all input variables except first layer may have different shape from the first layer.Parameters:  n_layers (int) – Number of layers.
 dropout_ratio (float) – Dropout ratio.
 hx (chainer.Variable) – Variable holding stacked hidden states.
Its shape is
(S, B, N)
whereS
is number of layers and is equal ton_layers
,B
is minibatch size, andN
is dimention of hidden units.  cx (chainer.Variable) – Variable holding stacked cell states.
It has the same shape as
hx
.  ws (list of list of chainer.Variable) – Weight matrices.
ws[i]
represents weights for ith layer. Eachws[i]
is a list containing eight matrices.ws[i][j]
is corresponding withW_j
in the equation. Onlyws[0][j]
where0 <= j < 4
is(I, N)
shape as they are multiplied with input variables. All other matrices has(N, N)
shape.  bs (list of list of chainer.Variable) – Bias vectors.
bs[i]
represnents biases for ith layer. Eachbs[i]
is a list containing eight vectors.bs[i][j]
is corresponding withb_j
in the equation. Shape of each matrix is(N,)
whereN
is dimention of hidden units.  xs (list of chainer.Variable) – A list of
Variable
holding input values. Each elementxs[t]
holds input value for timet
. Its shape is(B_t, I)
, whereB_t
is minibatch size for timet
, andI
is size of input units. Note that this functions supports variable length sequences. When sequneces has different lengths, sort sequences in descending order by length, and transpose the sorted sequence.transpose_sequence()
transpose a list ofVariable()
holding sequence. Soxs
needs to satisfyxs[t].shape[0] >= xs[t + 1].shape[0]
.  train (bool) – If
True
, this function executes dropout.  use_cudnn (bool) – If
True
, this function uses cuDNN if available.
Returns:  This functions returns a tuple concaining three elements,
hy
,cy
andys
.hy
is an updated hidden states whose shape is same ashx
.cy
is an updated cell states whose shape is same ascx
.ys
is a list ofVariable
. Each elementys[t]
holds hidden states of the last layer corresponding to an inputxs[t]
. Its shape is(B_t, N)
whereB_t
is minibatch size for timet
, andN
is size of hidden units. Note thatB_t
is the same value asxs[t]
.
Return type: See also
Evaluation functions¶
accuracy¶
binary_accuracy¶
classification_summary¶

chainer.functions.
classification_summary
(y, t, label_num=None, beta=1.0, ignore_label=1)[source]¶ Calculates Precision, Recall, F beta Score, and support.
This function calculates the following quantities for each class.
 Precision: \(\frac{\mathrm{tp}}{\mathrm{tp} + \mathrm{fp}}\)
 Recall: \(\frac{\mathrm{tp}}{\mathrm{tp} + \mathrm{tn}}\)
 F beta Score: The weighted harmonic average of Precision and Recall.
 Support: The number of instances of each ground truth label.
Here,
tp
,fp
, andtn
stand for the number of true positives, false positives, and true negative, respectively.label_num
specifies the number of classes, that is, each value int
must be an integer in the range of[0, label_num)
. Iflabel_num
isNone
, this function regardslabel_num
as a maximum of int
plus one.ignore_label
determines which instances should be ignored. Specifically, instances with the given label are not taken into account for calculating the above quantities. By default, it is set to 1 so that all instances are taken into consideration, as labels are supposed to be nonnegative integers. Settingignore_label
to a nonnegative integer less thanlabel_num
is illegal and yields undefined behavior. In the current implementation, it arisesRuntimeWarning
andignore_label
th entries in output arrays do not contain correct quantities.Parameters:  y (Variable) – Variable holding a vector of scores.
 t (Variable) – Variable holding a vector of ground truth labels.
 label_num (int) – The number of classes.
 beta (float) – The parameter which determines the weight of precision in the Fbeta score.
 ignore_label (int) – Instances with this label are ignored.
Returns: 4tuple of ~chainer.Variable of size
(label_num,)
. Each element represents precision, recall, F beta score, and support of this minibatch.
r2_score¶

chainer.functions.
r2_score
(pred, true, sample_weight=None, multioutput='uniform_average')[source]¶ Computes R^2(coefficient of determination) regression score function.
Parameters:  pred (Variable) – Variable holding a vector, matrix or tensor of estimated target values.
 true (Variable) – Variable holding a vector, matrix or tensor of correct target values.
 sample_weight – This argument is for compatibility with scikitlearn’s implementation of r2_score. Current implementation admits None only.
 multioutput (string) – [‘uniform_average’, ‘raw_values’]. if ‘uniform_average’, this function returns an average of R^2 score of multiple output. If ‘raw_average’, this function return a set of R^2 score of multiple output.
Returns: A Variable holding a scalar array of the R^2 score if ‘multioutput’ is ‘uniform_average’ or a vector of R^2 scores if ‘multioutput’ is ‘raw_values’.
Return type: Note
This function is nondifferentiable.
Loss functions¶
bernoulli_nll¶

chainer.functions.
bernoulli_nll
(x, y)[source]¶ Computes the negative loglikelihood of a Bernoulli distribution.
This function calculates the negative loglikelihood of a Bernoulli distribution.
\[\log B(x; p) = \sum_i \{x_i \log(p_i) + (1  x_i)\log(1  p_i)\},\]where \(p = \sigma(y)\), \(\sigma(\cdot)\) is a sigmoid function, and \(B(x; p)\) is a Bernoulli distribution.
Note
As this function uses a sigmoid function, you can pass a result of fullyconnected layer (that means
Linear
) to this function directly.Parameters: Returns: A variable representing negative loglikelihood.
Return type:
black_out¶

chainer.functions.
black_out
(x, t, W, samples)[source]¶ BlackOut loss function.
BlackOut loss function is defined as
\[\log(p(t))  \sum_{s \in S} \log(1  p(s)),\]where \(t\) is the correct label, \(S\) is a set of negative examples and \(p(\cdot)\) is likelihood of a given label. And, \(p\) is defined as
\[p(y) = \frac{\exp(W_y^\top x)}{ \sum_{s \in samples} \exp(W_s^\top x)}.\]Parameters: Returns: Loss value.
Return type: See: BlackOut: Speeding up Recurrent Neural Network Language Models With Very Large Vocabularies
See also
connectionist_temporal_classification¶

chainer.functions.
connectionist_temporal_classification
(x, t, blank_symbol, input_length=None, label_length=None)[source]¶ Connectionist Temporal Classification loss function.
Connectionist Temporal Classification(CTC) [Graves2006] is a loss function of sequence labeling where the alignment between the inputs and target is unknown. See also [Graves2012]
Parameters:  x (sequence of Variable) – RNN output at each time.
x
must be a list ofVariable
s. Each element ofx
,x[i]
is aVariable
representing output of RNN at timei
.  t (Variable) – Expected label sequence.
 blank_symbol (int) – Index of blank_symbol. This value must be nonnegative.
 input_length (Variable) – Length of valid sequence for each of mini
batch
x
(optional). If input_length is skipped, It regards that all ofx
is valid input.  label_length (Variable) – Length of valid sequence for each of mini
batch
t
(optional). If label_length is skipped, It regards that all oft
is valid input.
Returns: A variable holding a scalar value of the CTC loss.
Return type: Note
You need to input
x
without applying to activation functions(e.g. softmax function), because this function applies softmax functions tox
before calculating CTC loss to avoid numerical limitations. You also need to apply softmax function to forwarded values before you decode it.Note
This function is differentiable only by
x
.Note
This function supports (batch, sequence, 1dimensional input)data.
[Graves2006] Alex Graves, Santiago Fernandez, Faustino Gomez, Jurgen Schmidhuber, Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks [Graves2012] Alex Graves, Supervised Sequence Labelling with Recurrent Neural Networks  x (sequence of Variable) – RNN output at each time.
contrastive¶

chainer.functions.
contrastive
(x0, x1, y, margin=1)[source]¶ Computes contrastive loss.
It takes a pair of variables and a label as inputs. The label is 1 when those two input variables are similar, or 0 when they are dissimilar. Let \(N\) and \(K\) denote minibatch size and the dimension of input variables, respectively. The shape of both input variables should be
(N, K)
.\[L = \frac{1}{2N} \left( \sum_{n=1}^N y_n d_n^2 + (1  y_n) \max ({\rm margin}  d_n, 0)^2 \right)\]where \(d_n = \ {\bf x_0}_n  {\bf x_1}_n \_2\). \(N\) denotes the minibatch size. Input variables, x0 and x1, have \(N\) vectors, and each vector is Kdimensional. Therefore, \({\bf x_0}_n\) and \({\bf x_1}_n\) are \(n\)th Kdimensional vectors of x0 and x1.
Parameters:  x0 (Variable) – The first input variable. The shape should be
(N, K), where N denotes the minibatch size, and K denotes the
dimension of
x0
.  x1 (Variable) – The second input variable. The shape should be
the same as
x0
.  y (Variable) – Labels. All values should be 0 or 1. The shape
should be
(N,)
, where N denotes the minibatch size.  margin (float) – A parameter for contrastive loss. It should be positive value.
Returns:  A variable holding a scalar that is the loss value
calculated by the above equation.
Return type: Note
This cost can be used to train siamese networks. See Learning a Similarity Metric Discriminatively, with Application to Face Verification for details.
 x0 (Variable) – The first input variable. The shape should be
(N, K), where N denotes the minibatch size, and K denotes the
dimension of
crf1d¶

chainer.functions.
crf1d
(cost, xs, ys)[source]¶ Calculates negative loglikelihood of linearchain CRF.
It takes a transition cost matrix, a sequence of costs, and a sequence of labels. Let \(c_{st}\) be a transition cost from a label \(s\) to a label \(t\), \(x_{it}\) be a cost of a label \(t\) at position \(i\), and \(y_i\) be an expected label at position \(i\). The negative loglikelihood of linearchain CRF is defined as
\[L = \left( \sum_{i=1}^l x_{iy_i} + \ \sum_{i=1}^{l1} c_{y_i y_{i+1}}  {\log(Z)} \right) ,\]where \(l\) is the length of the input sequence and \(Z\) is the normalizing constant called partition function.
Note
When you want to calculate the negative loglikelihood of sequences which have different lengths, sort the sequences in descending order of lengths and transpose the sequences. For example, you have three input seuqnces:
>>> a1 = a2 = a3 = a4 = np.random.uniform(1, 1, 3).astype('f') >>> b1 = b2 = b3 = np.random.uniform(1, 1, 3).astype('f') >>> c1 = c2 = np.random.uniform(1, 1, 3).astype('f')
>>> a = [a1, a2, a3, a4] >>> b = [b1, b2, b3] >>> c = [c1, c2]
where
a1
and all other variables are arrays with(K,)
shape. Make a transpose of the sequences:>>> x1 = np.stack([a1, b1, c1]) >>> x2 = np.stack([a2, b2, c2]) >>> x3 = np.stack([a3, b3]) >>> x4 = np.stack([a4])
and make a list of the arrays:
>>> xs = [x1, x2, x3, x4]
You need to make label sequences in the same fashion. And then, call the function:
>>> cost = chainer.Variable( ... np.random.uniform(1, 1, (3, 3)).astype('f')) >>> ys = [np.zeros(x.shape[0:1], dtype='i') for x in xs] >>> loss = F.crf1d(cost, xs, ys)
It calculates sum of the negative loglikelihood of the three sequences.
Parameters:  cost (Variable) – A \(K \times K\) matrix which holds transition cost between two labels, where \(K\) is the number of labels.
 xs (list of Variable) – Input vector for each label.
len(xs)
denotes the length of the sequence, and eachVariable
holds a \(B \times K\) matrix, where \(B\) is minibatch size, \(K\) is the number of labels. Note that \(B\) s in all the variables are not necessary the same, i.e., it accepts the input sequences with different lengths.  ys (list of Variable) – Expected output labels. It needs to have the
same length as
xs
. EachVariable
holds a \(B\) integer vector. Whenx
inxs
has the different \(B\), correspodingy
has the same \(B\). In other words,ys
must satisfyys[i].shape == xs[i].shape[0:1]
for alli
.
Returns:  A variable holding the average negative
loglikelihood of the input sequences.
Return type: Note
See detail in the original paper: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data.

chainer.functions.
argmax_crf1d
(cost, xs)[source]¶ Computes a state that maximizes a joint probability of the given CRF.
Parameters:  cost (Variable) – A \(K \times K\) matrix which holds transition cost between two labels, where \(K\) is the number of labels.
 xs (list of Variable) – Input vector for each label.
len(xs)
denotes the length of the sequence, and eachVariable
holds a \(B \times K\) matrix, where \(B\) is minibatch size, \(K\) is the number of labels. Note that \(B\) s in all the variables are not necessary the same, i.e., it accepts the input sequences with different lengths.
Returns:  A tuple of
Variable
objects
and a list
ps
. The shape ofs
is(B,)
, whereB
is the minibatch size. ith element ofs
,s[i]
, represents loglikelihood of ith data.ps
is a list ofnumpy.ndarray
orcupy.ndarray
, and denotes the state that maximizes the point probability.len(ps)
is equal tolen(xs)
, and shape of eachps[i]
is the minibatch size of the correspondingxs[i]
. That means,ps[i].shape == xs[i].shape[0:1]
.
Return type:
cross_covariance¶

chainer.functions.
cross_covariance
(y, z)[source]¶ Computes the sumsquared crosscovariance penalty between
y
andz
Parameters: Returns: A variable holding a scalar of the cross covariance loss.
Return type: Note
This cost can be used to disentangle variables. See http://arxiv.org/abs/1412.6583v3 for details.
gaussian_kl_divergence¶

chainer.functions.
gaussian_kl_divergence
(mean, ln_var)[source]¶ Computes the KLdivergence of Gaussian variables from the standard one.
Given two variable
mean
representing \(\mu\) andln_var
representing \(\log(\sigma^2)\), this function returns a variable representing the KLdivergence between the given multidimensional Gaussian \(N(\mu, S)\) and the standard Gaussian \(N(0, I)\)\[D_{\mathbf{KL}}(N(\mu, S) \ N(0, I)),\]where \(S\) is a diagonal matrix such that \(S_{ii} = \sigma_i^2\) and \(I\) is an identity matrix.
Parameters: Returns:  A variable representing KLdivergence between
given gaussian distribution and the standard gaussian.
Return type:
gaussian_nll¶

chainer.functions.
gaussian_nll
(x, mean, ln_var)[source]¶ Computes the negative loglikelihood of a Gaussian distribution.
Given two variable
mean
representing \(\mu\) andln_var
representing \(\log(\sigma^2)\), this function returns the negative loglikelihood of \(x\) on a Gaussian distribution \(N(\mu, S)\),\[\log N(x; \mu, \sigma^2) = \log\left(\sqrt{(2\pi)^D S}\right) + \frac{1}{2}(x  \mu)^\top S^{1}(x  \mu),\]where \(D\) is a dimension of \(x\) and \(S\) is a diagonal matrix where \(S_{ii} = \sigma_i^2\).
Parameters: Returns: A variable representing the negative loglikelihood.
Return type:
hinge¶

chainer.functions.
hinge
(x, t, norm='L1')[source]¶ Computes the hinge loss for a oneofmany classification task.
\[L = \frac{1}{N} \sum_{n=1}^N \sum_{k=1}^K \left[ \max(0, 1  \delta\{l_n = k\} t_{nk}) \right]^p\]where \(N\) denotes the batch size, \(K\) is the number of classes of interest,
\[\begin{split}\delta \{ {\rm condition} \} = \left \{ \begin{array}{cc} 1 & {\rm if~condition} \\ 1 & {\rm otherwise,} \end{array} \right.\end{split}\]and
\[\begin{split}p = \left \{ \begin{array}{cc} 1 & {\rm if~norm} = {\rm 'L1'} \\ 2 & {\rm if~norm} = {\rm 'L2'.} \end{array} \right.\end{split}\]Parameters: Returns:  A variable object holding a scalar array of the
hinge loss \(L\).
Return type:
huber_loss¶

chainer.functions.
huber_loss
(x, t, delta)[source]¶ Loss function which is less sensitive to outliers in data than MSE.
\[a = x  t\]and
\[\begin{split}L_{\delta}(a) = \left \{ \begin{array}{cc} \frac{1}{2} a^2 & {\rm if~a \leq \delta} \\ \delta (a  \frac{1}{2} \delta) & {\rm otherwise,} \end{array} \right.\end{split}\]Parameters: Returns:  A variable object holding a scalar array of the
huber loss \(L_{\delta}\).
Return type:  See:
 Huber loss  Wikipedia.
mean_absolute_error¶
mean_squared_error¶
negative_sampling¶

chainer.functions.
negative_sampling
(x, t, W, sampler, sample_size)[source]¶ Negative sampling loss function.
In natural language processing, especially language modeling, the number of words in a vocabulary can be very large. Therefore, you need to spend a lot of time calculating the gradient of the embedding matrix.
By using the negative sampling trick you only need to calculate the gradient for a few sampled negative examples.
The objective function is below:
\[f(x, p) = \log \sigma(x^\top w_p) + \ k E_{i \sim P(i)}[\log \sigma( x^\top w_i)],\]where \(\sigma(\cdot)\) is a sigmoid function, \(w_i\) is the weight vector for the word \(i\), and \(p\) is a positive example. It is approximated with \(k\) examples \(N\) sampled from probability \(P(i)\), like this:
\[f(x, p) \approx \log \sigma(x^\top w_p) + \ \sum_{n \in N} \log \sigma(x^\top w_n).\]Each sample of \(N\) is drawn from the word distribution \(P(w)\). This is calculated as \(P(w) = \frac{1}{Z} c(w)^\alpha\), where \(c(w)\) is the unigram count of the word \(w\), \(\alpha\) is a hyperparameter, and \(Z\) is the normalization constant.
Parameters:  x (Variable) – Batch of input vectors.
 t (Variable) – Vector of ground truth labels.
 W (Variable) – Weight matrix.
 sampler (FunctionType) – Sampling function. It takes a shape and
returns an integer array of the shape. Each element of this array
is a sample from the word distribution.
A
WalkerAlias
object built with the power distribution of word frequency is recommended.  sample_size (int) – Number of samples.
See: Distributed Representations of Words and Phrases and their Compositionality
See also
sigmoid_cross_entropy¶

chainer.functions.
sigmoid_cross_entropy
(x, t, use_cudnn=True, normalize=True)[source]¶ Computes cross entropy loss for presigmoid activations.
Parameters:  x (Variable) – A variable object holding a matrix whose (i, j)th element indicates the unnormalized log probability of the jth unit at the ith example.
 t (Variable) – Variable holding an int32 vector of ground truth labels.
If
t[i] == 1
, correspondingx[i]
is ignored. Loss is zero if all ground truth labels are1
.  normalize (bool) – Variable holding a boolean value which determines the normalization constant. If true, this function normalizes the cross entropy loss across all instances. If else, it only normalizes along a batch size.
Returns:  A variable object holding a scalar array of the cross entropy
loss.
Return type: Note
This function is differentiable only by
x
.
softmax_cross_entropy¶

chainer.functions.
softmax_cross_entropy
(x, t, use_cudnn=True, normalize=True, cache_score=True, class_weight=None)[source]¶ Computes cross entropy loss for presoftmax activations.
Parameters:  x (Variable) – Variable holding a multidimensional array whose element indicates unnormalized log probability: the first axis of the variable represents the number of samples, and the second axis represents the number of classes. While this function computes a usual softmax cross entropy if the number of dimensions is equal to 2, it computes a cross entropy of the replicated softmax if the number of dimensions is greater than 2.
 t (Variable) – Variable holding an int32 vector of ground truth
labels. If
t[i] == 1
, correspondingx[i]
is ignored.  normalize (bool) – If
True
, this function normalizes the cross entropy loss across all instances. IfFalse
, it only normalizes along a batch size.  cache_score (bool) – When it is
True
, the function stores result of forward computation to use it on backward computation. It reduces computational cost though consumes more memory.  class_weight (ndarray or ndarray) – An array
that contains constant weights that will be multiplied with the
loss values along with the second dimension. The shape of this
array should be
(x.shape[1],)
. If this is notNone
, each class weightclass_weight[i]
is actually multiplied toy[:, i]
that is the corresponding logsoftmax output ofx
and has the same shape asx
before calculating the actual loss value.
Returns: A variable holding a scalar array of the cross entropy loss.
Return type: Note
This function is differentiable only by
x
.
triplet¶

chainer.functions.
triplet
(anchor, positive, negative, margin=0.2)[source]¶ Computes triplet loss.
It takes a triplet of variables as inputs, \(a\), \(p\) and \(n\): anchor, positive example and negative example respectively. The triplet defines a relative similarity between samples. Let \(N\) and \(K\) denote minibatch size and the dimension of input variables, respectively. The shape of all input variables should be \((N, K)\).
\[L(a, p, n) = \frac{1}{N} \left( \sum_{i=1}^N \max \{d(a_i, p_i)  d(a_i, n_i) + {\rm margin}, 0\} \right)\]where \(d(x_i, y_i) = \ {\bf x}_i  {\bf y}_i \_2^2\).
Parameters:  anchor (Variable) – The anchor example variable. The shape should be \((N, K)\), where \(N\) denotes the minibatch size, and \(K\) denotes the dimension of the anchor.
 positive (Variable) – The positive example variable. The shape should be the same as anchor.
 negative (Variable) – The negative example variable. The shape should be the same as anchor.
 margin (float) – A parameter for triplet loss. It should be a positive value.
Returns:  A variable holding a scalar that is the loss value
calculated by the above equation.
Return type: Note
This cost can be used to train triplet networks. See Learning Finegrained Image Similarity with Deep Ranking for details.
Mathematical functions¶
arccos¶
arcsin¶
arctan¶
argmax¶
argmin¶
batch_inv¶

chainer.functions.
batch_inv
(a)[source]¶ Computes the inverse of a batch of square matrices.
Parameters: a (Variable) – Input array to compute the inverse for. Shape of the array should be (m, n, n)
wherem
is the number of matrices in the batch, andn
is the dimensionality of a square matrix.Returns: Inverse of every matrix in the batch of matrices. Return type: Variable
batch_l2_norm_squared¶

chainer.functions.
batch_l2_norm_squared
(x)[source]¶ L2 norm (a.k.a. Euclidean norm) squared.
This function implements the square of L2 norm on a vector. No reduction along batch axis is done.
Parameters: x (Variable) – Input variable. The first dimension is assumed to be the minibatch dimension. If x
has more than two dimensions all but the first dimension are flattened to one dimension.Returns: Two dimensional output variable. Return type: Variable
batch_matmul¶

chainer.functions.
batch_matmul
(a, b, transa=False, transb=False)[source]¶ Computes the batch matrix multiplications of two sets of arrays.
Parameters:  a (Variable) – The left operand of the batch matrix multiplications.
A 2D array of shape
(B, N)
is considered as B \(N \times 1\) matrices. A 3D array of shape(B, M, N)
is considered as B \(M \times N\) matrices.  b (Variable) – The right operand of the batch matrix multiplications.
Its array is treated as matrices in the same way as
a
‘s array.  transa (bool) – If
True
, transpose each matrix ina
.  transb (bool) – If
True
, transpose each matrix inb
.
Returns:  The result of the batch matrix multiplications as a
3D array.
Return type:  a (Variable) – The left operand of the batch matrix multiplications.
A 2D array of shape
bias¶

chainer.functions.
bias
(x, y, axis=1)[source]¶ Elementwise summation with broadcasting.
Computes a elementwise summation of two input variables, with the shape of the latter variable broadcasted to match the shape of the former.
axis
is the first axis of the first variable along which the second variable is applied.The term “broadcasting” here comes from Caffe’s bias layer so the “broadcasting” with the following arguments:
x : 100 x 3 x 40 x 60 y : 3 x 40 axis : 1
is equivalent to the following numpy broadcasting:
x : 100 x 3 x 40 x 60 y : 1 x 3 x 40 x 1
Note that how the
axis
indicates to which axis ofx
we applyy
.Parameters: Returns: Output variable.
Return type:
ceil¶
clip¶
cosh¶
floor¶
inv¶
linear_interpolate¶
log10¶
log2¶
logsumexp¶
matmul¶

chainer.functions.
matmul
(a, b, transa=False, transb=False)[source]¶ Computes the matrix multiplication of two arrays.
Parameters:  a (Variable) – The left operand of the matrix multiplication.
A 1D array of shape
(N,)
is considered as an \(N \times 1\) matrix. A 2D array of shape(M, N)
is considered as an \(M \times N\) matrix.  b (Variable) – The right operand of the matrix multiplication.
Its array is treated as a matrix in the same way as
a
‘s array.  transa (bool) – If
True
, transposea
.  transb (bool) – If
True
, transposeb
.
Returns:  The result of the matrix multiplication as a 2D
array.
Return type:  a (Variable) – The left operand of the matrix multiplication.
A 1D array of shape
max¶
maximum¶
min¶
minimum¶
rsqrt¶
scale¶

chainer.functions.
scale
(x, y, axis=1)[source]¶ Elementwise product with broadcasting.
Computes a elementwise product of two input variables, with the shape of the latter variable broadcasted to match the shape of the former.
axis
is the first axis of the first variable along which the second variable is applied.The term “broadcasting” here comes from Caffe’s scale layer so the “broadcasting” with the following arguments:
x : 100 x 3 x 40 x 60 y : 3 x 40 axis : 1
is equivalent to the following numpy broadcasting:
x : 100 x 3 x 40 x 60 y : 1 x 3 x 40 x 1
Note that how the
axis
indicates to which axis ofx
we applyy
.Parameters: Returns: Output variable.
Return type:
sinh¶
sqrt¶
square¶

chainer.functions.
square
(x)[source]¶ Elementwise square function.
\[y_i = x_i ^ 2.\]Parameters: x (chainer.Variable or numpy.ndarray
or cupy.ndarray) – Input variable.Returns: Output variable. Return type: Variable
squared_difference¶
Noise injections¶
dropout¶

chainer.functions.
dropout
(x, ratio=0.5, train=True)[source]¶ Drops elements of input variable randomly.
This function drops input elements randomly with probability
ratio
and scales the remaining elements by factor1 / (1  ratio)
. In testing mode, it does nothing and just returnsx
.Parameters: Returns: Output variable.
Return type: See the paper by G. Hinton: Improving neural networks by preventing coadaptation of feature detectors.
Normalization functions¶
batch_normalization¶

chainer.functions.
batch_normalization
(x, gamma, beta, eps=2e05, running_mean=None, running_var=None, decay=0.9, use_cudnn=True)[source]¶ Batch normalization function.
It takes the input variable
x
and two parameter variablesgamma
andbeta
. The input must have the batch size and the features (or channels) as the first two dimensions of its shape. The input can have more than two dimensions, where the remaining dimensions are considered as spatial dimensions, which are considered as a part of the batch size. That is, the total batch size will be considered to be the product of all dimensions except the second dimension.Note: If this function is called, it will not be possible to access the updated running mean and variance statistics, because they are members of the function object, which cannot be accessed by the caller. If it is desired to access the updated running statistics, it is necessary to get a new instance of the function object, call the object, and then access the running_mean and/or running_var attributes. See the corresponding Link class for an example of how to do this.
Parameters:  x (Variable) – Input variable.
 gamma (Variable) – Scaling parameter of normalized data.
 beta (Variable) – Shifting parameter of scaled normalized data.
 eps (float) – Epsilon value for numerical stability.
 running_mean (array) – Running average of the mean. This is a
running average of the mean over several minibatches using
the decay parameter. If
None
, the running average is not computed. If this isNone
, thenrunnng_var
must also beNone
.  running_var (array) – Running average of the variance. This is a
running average of the variance over several minibatches using
the decay parameter. If
None
, the running average is not computed. If this isNone
, thenrunning_mean
must also beNone
.  decay (float) – Decay rate of moving average. It is used during training.
 use_cudnn (bool) – If
True
and cuDNN is enabled, then this function uses cuDNN as the core implementation.
See: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
See also
links.BatchNormalization
fixed_batch_normalization¶

chainer.functions.
fixed_batch_normalization
(x, gamma, beta, mean, var, eps=2e05, use_cudnn=True)[source]¶ Batch normalization function with fixed statistics.
This is a variant of batch normalization, where the mean and variance statistics are given by the caller as fixed variables. This is used on testing mode of the batch normalization layer, where batch statistics cannot be used for prediction consistency.
Parameters:  x (Variable) – Input variable.
 gamma (Variable) – Scaling parameter of normalized data.
 beta (Variable) – Shifting parameter of scaled normalized data.
 mean (Variable) – Shifting parameter of input.
 var (Variable) – Square of scaling parameter of input.
 eps (float) – Epsilon value for numerical stability.
 use_cudnn (bool) – If
True
and cuDNN is enabled, then this function uses cuDNN as the core implementation.
See also
functions.batch_normalization()
,links.BatchNormalization
local_response_normalization¶

chainer.functions.
local_response_normalization
(x, n=5, k=2, alpha=0.0001, beta=0.75)[source]¶ Local response normalization across neighboring channels.
This function implements normalization across channels. Let \(x\) an input image with \(N\) channels. Then, this function computes an output image \(y\) by following formula:
\[y_i = {x_i \over \left( k + \ \alpha \sum_{j=\max{1, i  n/2}}^{\min{N, i + n/2}} \ x_j^2 \right)^\beta}.\]Parameters: Returns: Output variable.
Return type: See: Section 3.3 of ImageNet Classification with Deep Convolutional Neural Networks
normalize¶

chainer.functions.
normalize
(x, eps=1e05)[source]¶ L2 norm squared (a.k.a. Euclidean norm).
This function implements L2 normalization on a 1D vector. No reduction is done along batch axis. Let \(x\) be an input vector of dimension \((N, K)\), where \(N\) and \(K\) denote minibatch size and the dimension of the input variable. Then, this function computes an output vector \(y\) by the following equation:
\[y_i = {x_i \over \ x_i \_2}\]\(eps\) is used to avoid division by zero when \(x_i=0\)
Parameters: Returns:  Two dimensional output variable, the same shape
as \(x\).
Return type:
Spatial pooling¶
average_pooling_2d¶

chainer.functions.
average_pooling_2d
(x, ksize, stride=None, pad=0, use_cudnn=True)[source]¶ Spatial average pooling function.
This function acts similarly to
Convolution2D
, but it computes the average of input spatial patch for each channel without any parameter instead of computing the inner products.Parameters:  x (Variable) – Input variable.
 ksize (int or pair of ints) – Size of pooling window.
ksize=k
andksize=(k, k)
are equivalent.  stride (int or pair of ints or None) – Stride of pooling applications.
stride=s
andstride=(s, s)
are equivalent. IfNone
is specified, then it uses same stride as the pooling window size.  pad (int or pair of ints) – Spatial padding width for the input array.
pad=p
andpad=(p, p)
are equivalent.  use_cudnn (bool) – If
True
and cuDNN is enabled, then this function uses cuDNN as the core implementation.
Returns: Output variable.
Return type: Note
This function currently does not support
cover_all
mode asmax_pooling_2d()
. Average pooling runs in noncoverall mode.
average_pooling_nd¶

chainer.functions.
average_pooling_nd
(x, ksize, stride=None, pad=0, use_cudnn=True)[source]¶ Ndimensionally spatial average pooling function.
This function provides a Ndimensionally generalized version of
average_pooling_2d()
. This acts similarly toConvolutionND
, but it computes the average of input spatial patch for each channel without any parameter instead of computing the inner products.Parameters:  x (Variable) – Input variable.
 ksize (int or tuple of ints) – Size of pooling window.
ksize=k
andksize=(k, k, ..., k)
are equivalent.  stride (int or tuple of ints or None) – Stride of pooling applications.
stride=s
andstride=(s, s, ..., s)
are equivalent. IfNone
is specified, then it uses same stride as the pooling window size.  pad (int or tuple of ints) – Spatial padding width for the input array.
pad=p
andpad=(p, p, ..., p)
are equivalent.  use_cudnn (bool) – If
True
and cuDNN is enabled, then this function uses cuDNN as the core implementation. cuDNN supports more than onedimensional pooling.
Returns: Output variable.
Return type: Note
This function currently does not support
cover_all
mode asmax_pooling_nd()
. Average pooling runs in noncoverall mode.
max_pooling_2d¶

chainer.functions.
max_pooling_2d
(x, ksize, stride=None, pad=0, cover_all=True, use_cudnn=True)[source]¶ Spatial max pooling function.
This function acts similarly to
Convolution2D
, but it computes the maximum of input spatial patch for each channel without any parameter instead of computing the inner products.Parameters:  x (Variable) – Input variable.
 ksize (int or pair of ints) – Size of pooling window.
ksize=k
andksize=(k, k)
are equivalent.  stride (int or pair of ints or None) – Stride of pooling applications.
stride=s
andstride=(s, s)
are equivalent. IfNone
is specified, then it uses same stride as the pooling window size.  pad (int or pair of ints) – Spatial padding width for the input array.
pad=p
andpad=(p, p)
are equivalent.  cover_all (bool) – If
True
, all spatial locations are pooled into some output pixels. It may make the output size larger.  use_cudnn (bool) – If
True
and cuDNN is enabled, then this function uses cuDNN as the core implementation.
Returns: Output variable.
Return type:
max_pooling_nd¶

chainer.functions.
max_pooling_nd
(x, ksize, stride=None, pad=0, cover_all=True, use_cudnn=True)[source]¶ Ndimensionally spatial max pooling function.
This function provides a Ndimensionally generalized version of
max_pooling_2d()
. This acts similarly toConvolutionND
, but it computes the maximum of input spatial patch for each channel without any parameter instead of computing the inner products.Parameters:  x (Variable) – Input variable.
 ksize (int or tuple of ints) – Size of pooling window.
ksize=k
andksize=(k, k, ..., k)
are equivalent.  stride (int or tuple of ints or None) – Stride of pooling applications.
stride=s
andstride=(s,s, ..., s)
are equivalent. IfNone
is specified, then it uses same stride as the pooling window size.  pad (int or tuple of ints) – Spatial padding width for the input array.
pad=p
andpad=(p, p, ..., p)
are equivalent.  cover_all (bool) – If
True
, all spatial locations are pooled into some output pixels. It may make the output size larger.  use_cudnn (bool) – If
True
and cuDNN is enabled, then this function uses cuDNN as the core implementation. cuDNN supports more than onedimensional pooling.
Returns: Output variable.
Return type:
roi_pooling_2d¶

chainer.functions.
roi_pooling_2d
(x, rois, outh, outw, spatial_scale)[source]¶ Spatial Region of Interest (ROI) pooling function.
This function acts similarly to
MaxPooling2D
, but it computes the maximum of input spatial patch for each channel with the region of interest.Parameters:  x (Variable) – Input variable. The shape is expected to be 4 dimentional: (n: batch, c: channel, h, height, w: width).
 rois (Variable) – Input roi variable. The shape is expected to be (n: data size, 5), and each datum is set as below: (batch_index, x_min, y_min, x_max, y_max).
 outh (int) – Height of output image after pooled.
 outw (int) – Width of output image after pooled.
 spatial_scale (float) – Scale of the roi is resized.
Returns: Output variable.
Return type: See the original paper proposing ROIPooling: Fast RCNN.
spatial_pyramid_pooling_2d¶

chainer.functions.
spatial_pyramid_pooling_2d
(x, pyramid_height, pooling_class, use_cudnn=True)[source]¶ Spatial pyramid pooling function.
It outputs a fixedlength vector regardless of input feature map size.
It performs pooling operation to the input 4Darray
x
with different kernel sizes and padding sizes, and then flattens all dimensions except first dimension of all pooling results, and finally concatenates them along second dimension.At \(i\)th pyramid level, the kernel size \((k_h^{(i)}, k_w^{(i)})\) and padding size \((p_h^{(i)}, p_w^{(i)})\) of pooling operation are calculated as below:
\[\begin{split}k_h^{(i)} &= \lceil b_h / 2^i \rceil, \\ k_w^{(i)} &= \lceil b_w / 2^i \rceil, \\ p_h^{(i)} &= (2^i k_h^{(i)}  b_h) / 2, \\ p_w^{(i)} &= (2^i k_w^{(i)}  b_w) / 2,\end{split}\]where \(\lceil \cdot \rceil\) denotes the ceiling function, and \(b_h, b_w\) are height and width of input variable
x
, respectively. Note that index of pyramid level \(i\) is zerobased.See detail in paper: Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition.
Parameters:  x (Variable) – Input variable. The shape of
x
should be(batchsize, # of channels, height, width)
.  pyramid_height (int) – Number of pyramid levels
 pooling_class (MaxPooling2D or AveragePooling2D) – Only MaxPooling2D class can be available for now.
 use_cudnn (bool) – If
True
and cuDNN is enabled, then this function uses cuDNN as the core implementation.
Returns:  Output variable. The shape of the output variable
will be \((batchsize, c \sum_{h=0}^{H1} 2^{2h}, 1, 1)\), where \(c\) is the number of channels of input variable
x
and \(H\) is the number of pyramid levels.
Return type: Note
This function uses some pooling classes as components to perform spatial pyramid pooling. Now it supports only
MaxPooling2D
as elemental pooling operator so far. x (Variable) – Input variable. The shape of
unpooling_2d¶

chainer.functions.
unpooling_2d
(x, ksize, stride=None, pad=0, outsize=None, cover_all=True)[source]¶ Inverse operation of pooling for 2d array.
This function acts similarly to
Deconvolution2D
, but it spreads input 2d array’s value without any parameter instead of computing the inner products.Parameters:  x (Variable) – Input variable.
 ksize (int or pair of ints) – Size of pooling window.
ksize=k
andksize=(k, k)
are equivalent.  stride (int, pair of ints or None) – Stride of pooling applications.
stride=s
andstride=(s, s)
are equivalent. IfNone
is specified, then it uses same stride as the pooling window size.  pad (int or pair of ints) – Spatial padding width for the input array.
pad=p
andpad=(p, p)
are equivalent.  outsize (None or pair of ints) – Expected output size (height, width)
of array after the operation. If
None
, the size (height or width) is estimated from the size of input array in first batch withget_deconv_outsize()
. If outsize is notNone
, the result of outsize applied toget_conv_outsize()
must be equal to the shape of the 2d array in the input batchx
.  cover_all (bool) – If
True
, the output size may be smaller than the size ifcover_all
isFalse
. This flag serves to align behavior to the pooling functions which can cover all input locations, seemax_pooling_2d()
andconvolution_2d()
.
Returns: Output variable.
Return type:
upsampling_2d¶

chainer.functions.
upsampling_2d
(x, indexes, ksize, stride=None, pad=0, outsize=None, cover_all=True)[source]¶ Upsampling using pooling indices.
This function produces an upsampled image using pooling indices.
Example
It should be noted that you need to specify
use_cudnn=False
when you create MaxPooling2D object because if cuDNN used for operating max pooling,indexes
is never created and stored in the MaxPooling2D object.>>> p = F.MaxPooling2D(2, 2, use_cudnn=False) >>> x = np.arange(1, 37).reshape(1, 1, 6, 6).astype('f') >>> x = chainer.Variable(x) >>> x.data array([[[[ 1., 2., 3., 4., 5., 6.], [ 7., 8., 9., 10., 11., 12.], [ 13., 14., 15., 16., 17., 18.], [ 19., 20., 21., 22., 23., 24.], [ 25., 26., 27., 28., 29., 30.], [ 31., 32., 33., 34., 35., 36.]]]], dtype=float32)
This is the original
x
before max pooling.>>> pooled_x = p(x) >>> pooled_x.data array([[[[ 8., 10., 12.], [ 20., 22., 24.], [ 32., 34., 36.]]]], dtype=float32)
This is the output of the max pooling operation.
upsampling_2d
needsindexes
array stored in the max pooling objectp
.>>> upsampled_x = F.upsampling_2d( ... pooled_x, p.indexes, p.kh, p.sy, p.ph, x.shape[2:]) >>> upsampled_x.shape (1, 1, 6, 6) >>> upsampled_x.data array([[[[ 0., 0., 0., 0., 0., 0.], [ 0., 8., 0., 10., 0., 12.], [ 0., 0., 0., 0., 0., 0.], [ 0., 20., 0., 22., 0., 24.], [ 0., 0., 0., 0., 0., 0.], [ 0., 32., 0., 34., 0., 36.]]]], dtype=float32)
Parameters:  x (Variable) – Input variable.
 indexes (ndarray or ndarray) – Index array that was used to calculate x with MaxPooling2D.
 ksize (int or (int, int)) – ksize attribute of MaxPooling2D object that is used to calculate x
 stride (int or (int, int)) – stride attribute of MaxPooling2D object that is used to calculate x
 pad (int or (int, int)) – pad attribute of MaxPooling2D object that is used to calculate x
 outsize ((int, int)) – Expected output size (height, width).
 cover_all (bool) – Whether cover_all is used in the MaxPooling2D object or not.
Returns: Output variable.
Return type:
Utility functions¶
forget¶

chainer.functions.
forget
(func, *xs)[source]¶ Call a function without storing internal results.
On a forward propagation Chainer stores all internal results of
Function
on a computational graph as they are required on backwardpropagation. These results consume too much memory when the internal results are too large. This method forgets such internal results on forward propagation, and still supports backpropagation with recalculation.In a forward propagation, this method calls a given function with given variables without creating a computational graph. That means, no internal results are stored. In a backward propagation this method calls the given function again to create a computational graph to execute backpropagation.
This method reduces internal memory usage. Instead it requires more calculation time as it calls the function twice.
Example
Let
f
be a function defined as:>>> def f(a, b): ... return a + b + a
and,
x
andy
beVariable
:>>> x = chainer.Variable(np.random.uniform(1, 1, 5).astype('f')) >>> y = chainer.Variable(np.random.uniform(1, 1, 5).astype('f'))
When
z
is calculated asz = f(x, y)
, its internal resultx + y
is stored in memory. Instead if you callf
withforget()
:>>> z = F.forget(f, x, y)
internal
x + y
is forgotten.Note
The method does not support functions behaving randomly, such as
dropout()
andnegative_sampling()
. It is because first results of these function differ from the second one.Parameters: Returns: A variable
func
returns. If it returns a tuple, the method returns a tuple too.Return type: