Standard Link implementations¶
Chainer provides many Link
implementations in the
chainer.links
package.
Note
Some of the links are originally defined in the chainer.functions
namespace. They are still left in the namespace for backward compatibility,
though it is strongly recommended to use them via the chainer.links
package.
Learnable connections¶
Bias¶

class
chainer.links.
Bias
(axis=1, shape=None)[source]¶ Broadcasted elementwise summation with learnable parameters.
Computes a elementwise summation as
bias()
function does except that its second input is a learnable bias parameter \(b\) the link has.Parameters:  axis (int) – The first axis of the first input of
bias()
function along which its second input is applied.  shape (tuple of ints) – Shape of the learnable bias parameter. If
None
, this link does not have learnable parameters so an explicit bias needs to be given to its__call__
method’s second input.
See also
See
bias()
for details.Variables: b (Variable) – Bias parameter if shape
is given. Otherwise, no attributes. axis (int) – The first axis of the first input of
Bilinear¶

class
chainer.links.
Bilinear
(left_size, right_size, out_size, nobias=False, initialW=None, initial_bias=None)[source]¶ Bilinear layer that performs tensor multiplication.
Bilinear is a primitive link that wraps the
bilinear()
functions. It holds parametersW
,V1
,V2
, andb
corresponding to the arguments ofbilinear()
.Parameters:  left_size (int) – Dimension of input vector \(e^1\) (\(J\))
 right_size (int) – Dimension of input vector \(e^2\) (\(K\))
 out_size (int) – Dimension of output vector \(y\) (\(L\))
 nobias (bool) – If
True
, parametersV1
,V2
, andb
are omitted.  initialW (3D numpy array) – Initial value of \(W\).
Shape of this argument must be
(left_size, right_size, out_size)
. IfNone
, \(W\) is initialized by centered Gaussian distribution properly scaled according to the dimension of inputs and outputs. May also be a callable that takesnumpy.ndarray
orcupy.ndarray
and edits its value.  initial_bias (tuple) – Initial values of \(V^1\), \(V^2\)
and \(b\). The length this argument must be 3.
Each element of this tuple must have the shapes of
(left_size, output_size)
,(right_size, output_size)
, and(output_size,)
, respectively. IfNone
, \(V^1\) and \(V^2\) is initialized by scaled centered Gaussian distributions and \(b\) is set to \(0\). May also be a tuple of callables that takenumpy.ndarray
orcupy.ndarray
and edit its value.
See also
See
chainer.functions.bilinear()
for details.Variables:
Convolution2D¶

class
chainer.links.
Convolution2D
(in_channels, out_channels, ksize, stride=1, pad=0, wscale=1, bias=0, nobias=False, use_cudnn=True, initialW=None, initial_bias=None, deterministic=False)[source]¶ Twodimensional convolutional layer.
This link wraps the
convolution_2d()
function and holds the filter weight and bias vector as parameters.Parameters:  in_channels (int) – Number of channels of input arrays. If
None
, parameter initialization will be deferred until the first forward data pass at which time the size will be determined.  out_channels (int) – Number of channels of output arrays.
 ksize (int or pair of ints) – Size of filters (a.k.a. kernels).
ksize=k
andksize=(k, k)
are equivalent.  stride (int or pair of ints) – Stride of filter applications.
stride=s
andstride=(s, s)
are equivalent.  pad (int or pair of ints) – Spatial padding width for input arrays.
pad=p
andpad=(p, p)
are equivalent.  wscale (float) – Scaling factor of the initial weight.
 bias (float) – Initial bias value.
 nobias (bool) – If
True
, then this link does not use the bias term.  use_cudnn (bool) – If
True
, then this link uses cuDNN if available.  initialW (4D array) – Initial weight value. If
None
, then this function uses to initializewscale
. May also be a callable that takesnumpy.ndarray
orcupy.ndarray
and edits its value.  initial_bias (1D array) – Initial bias value. If
None
, then this function uses to initializebias
. May also be a callable that takesnumpy.ndarray
orcupy.ndarray
and edits its value.  deterministic (bool) – The output of this link can be
nondeterministic when it uses cuDNN.
If this option is
True
, then it forces cuDNN to use a deterministic algorithm. This option is only available for cuDNN version >= v4.
See also
See
chainer.functions.convolution_2d()
for the definition of twodimensional convolution.Variables:  in_channels (int) – Number of channels of input arrays. If
ConvolutionND¶

class
chainer.links.
ConvolutionND
(ndim, in_channels, out_channels, ksize, stride=1, pad=0, initialW=None, initial_bias=None, use_cudnn=True, cover_all=False)[source]¶ Ndimensional convolution layer.
This link wraps the
convolution_nd()
function and holds the filter weight and bias vector as parameters.Parameters:  ndim (int) – Number of spatial dimensions.
 in_channels (int) – Number of channels of input arrays.
 out_channels (int) – Number of channels of output arrays.
 ksize (int or tuple of ints) – Size of filters (a.k.a. kernels).
ksize=k
andksize=(k, k, ..., k)
are equivalent.  stride (int or tuple of ints) – Stride of filter application.
stride=s
andstride=(s, s, ..., s)
are equivalent.  pad (int or tuple of ints) – Spatial padding width for input arrays.
pad=p
andpad=(p, p, ..., p)
are equivalent.  initialW – Value used to initialize the filter weight. May be an
initializer instance or another value that
init_weight()
helper function can take.  initial_bias – Value used to initialize the bias vector. May be an
initializer instance or another value except
None
thatinit_weight()
helper function can take. IfNone
is given, this link does not use the bias vector.  use_cudnn (bool) – If
True
, then this link uses cuDNN if available. Seeconvolution_nd()
for exact conditions of cuDNN availability.  cover_all (bool) – If
True
, all spatial locations are convoluted into some output pixels. It may make the output size larger.cover_all
needs to beFalse
if you want to use cuDNN.
See also
See
convolution_nd()
for the definition of Ndimensional convolution. Seeconvolution_2d()
for the definition of twodimensional convolution.Variables:
Deconvolution2D¶

class
chainer.links.
Deconvolution2D
(in_channels, out_channels, ksize, stride=1, pad=0, wscale=1, bias=0, nobias=False, outsize=None, use_cudnn=True, initialW=None, initial_bias=None, deterministic=False)[source]¶ Two dimensional deconvolution function.
This link wraps the
deconvolution_2d()
function and holds the filter weight and bias vector as parameters.Parameters:  in_channels (int) – Number of channels of input arrays. If
None
, parameter initialization will be deferred until the first forward data pass at which time the size will be determined.  out_channels (int) – Number of channels of output arrays.
 ksize (int or pair of ints) – Size of filters (a.k.a. kernels).
ksize=k
andksize=(k, k)
are equivalent.  stride (int or pair of ints) – Stride of filter applications.
stride=s
andstride=(s, s)
are equivalent.  pad (int or pair of ints) – Spatial padding width for input arrays.
pad=p
andpad=(p, p)
are equivalent.  wscale (float) – Scaling factor of the initial weight.
 bias (float) – Initial bias value.
 nobias (bool) – If
True
, then this function does not use the bias term.  outsize (tuple) – Expected output size of deconvolutional operation.
It should be pair of height and width \((out_H, out_W)\).
Default value is
None
and the outsize is estimated by input size, stride and pad.  use_cudnn (bool) – If
True
, then this function uses cuDNN if available.  initialW (4D array) – Initial weight value. If
None
, then this function uses to initializewscale
. May also be a callable that takesnumpy.ndarray
orcupy.ndarray
and edits its value.  initial_bias (1D array) – Initial bias value. If
None
, then this function uses to initializebias
. May also be a callable that takesnumpy.ndarray
orcupy.ndarray
and edits its value.  deterministic (bool) – The output of this link can be
nondeterministic when it uses cuDNN.
If this option is
True
, then it forces cuDNN to use a deterministic algorithm. This option is only available for cuDNN version >= v4.
The filter weight has four dimensions \((c_I, c_O, k_H, k_W)\) which indicate the number of input channels, output channels, height and width of the kernels, respectively. The filter weight is initialized with i.i.d. Gaussian random samples, each of which has zero mean and deviation \(\sqrt{1/(c_I k_H k_W)}\) by default. The deviation is scaled by
wscale
if specified.The bias vector is of size \(c_O\). Its elements are initialized by
bias
argument. Ifnobias
argument is set to True, then this function does not hold the bias parameter.See also
See
chainer.functions.deconvolution_2d()
for the definition of twodimensional convolution. in_channels (int) – Number of channels of input arrays. If
DeconvolutionND¶

class
chainer.links.
DeconvolutionND
(ndim, in_channels, out_channels, ksize, stride=1, pad=0, outsize=None, initialW=None, initial_bias=0, use_cudnn=True)[source]¶ Ndimensional deconvolution function.
This link wraps
deconvolution_nd()
function and holds the filter weight and bias vector as its parameters.Parameters:  ndim (int) – Number of spatial dimensions.
 in_channels (int) – Number of channels of input arrays.
 out_channels (int) – Number of channels of output arrays.
 ksize (int or tuple of ints) – Size of filters (a.k.a. kernels).
ksize=k
andksize=(k, k, ..., k)
are equivalent.  stride (int or tuple of ints) – Stride of filter application.
stride=s
andstride=(s, s, ..., s)
are equivalent.  pad (int or tuple of ints) – Spatial padding width for input arrays.
pad=p
andpad=(p, p, ..., p)
are equivalent.  outsize (tuple of ints) – Expected output size of deconvolutional
operation. It should be a tuple of ints that represents the output
size of each dimension. Default value is
None
and the outsize is estimated with input size, stride and pad.  initialW – Value used to initialize the filter weight. May be an
initializer instance of another value the same with that
init_weight()
function can take.  initial_bias – Value used to initialize the bias vector. May be an
initializer instance or another value except
None
the same with thatinit_weight()
function can take. IfNone
is supplied, this link does not use the bias vector.  use_cudnn (bool) – If
True
, then this link uses cuDNN if available.
See also
Variables:
DilatedConvolution2D¶

class
chainer.links.
DilatedConvolution2D
(in_channels, out_channels, ksize, stride=1, pad=0, dilate=1, wscale=1, bias=0, nobias=False, use_cudnn=True, initialW=None, initial_bias=None)[source]¶ Twodimensional dilated convolutional layer.
This link wraps the
dilated_convolution_2d()
function and holds the filter weight and bias vector as parameters.Parameters:  in_channels (int) – Number of channels of input arrays. If
None
, parameter initialization will be deferred until the first forward data pass at which time the size will be determined.  out_channels (int) – Number of channels of output arrays.
 ksize (int or pair of ints) – Size of filters (a.k.a. kernels).
ksize=k
andksize=(k, k)
are equivalent.  stride (int or pair of ints) – Stride of filter applications.
stride=s
andstride=(s, s)
are equivalent.  pad (int or pair of ints) – Spatial padding width for input arrays.
pad=p
andpad=(p, p)
are equivalent.  dilate (int or pair of ints) – Dilation factor of filter applications.
dilate=d
anddilate=(d, d)
are equivalent.  wscale (float) – Scaling factor of the initial weight.
 bias (float) – Initial bias value.
 nobias (bool) – If
True
, then this link does not use the bias term.  use_cudnn (bool) – If
True
, then this link uses cuDNN if available.  initialW (4D array) – Initial weight value. If
None
, then this function uses to initializewscale
. May also be a callable that takesnumpy.ndarray
orcupy.ndarray
and edits its value.  initial_bias (1D array) – Initial bias value. If
None
, then this function uses to initializebias
. May also be a callable that takesnumpy.ndarray
orcupy.ndarray
and edits its value.
See also
See
chainer.functions.dilated_convolution_2d()
for the definition of twodimensional dilated convolution.Variables:  in_channels (int) – Number of channels of input arrays. If
EmbedID¶

class
chainer.links.
EmbedID
(in_size, out_size, initialW=None, ignore_label=None)[source]¶ Efficient linear layer for onehot input.
This is a link that wraps the
embed_id()
function. This link holds the ID (word) embedding matrixW
as a parameter.Parameters:  in_size (int) – Number of different identifiers (a.k.a. vocabulary size).
 out_size (int) – Size of embedding vector.
 initialW (2D array) – Initial weight value. If
None
, then the matrix is initialized from the standard normal distribution. May also be a callable that takesnumpy.ndarray
orcupy.ndarray
and edits its value.  ignore_label (int or None) – If
ignore_label
is an int value,i
th column of return value is filled with0
.
See also
Variables: W (Variable) – Embedding parameter matrix.
GRU¶

class
chainer.links.
GRU
(n_units, n_inputs=None, init=None, inner_init=None, bias_init=0)[source]¶ Stateless Gated Recurrent Unit function (GRU).
GRU function has six parameters \(W_r\), \(W_z\), \(W\), \(U_r\), \(U_z\), and \(U\). All these parameters are \(n \times n\) matrices, where \(n\) is the dimension of hidden vectors.
Given two inputs a previous hidden vector \(h\) and an input vector \(x\), GRU returns the next hidden vector \(h'\) defined as
\[\begin{split}r &=& \sigma(W_r x + U_r h), \\ z &=& \sigma(W_z x + U_z h), \\ \bar{h} &=& \tanh(W x + U (r \odot h)), \\ h' &=& (1  z) \odot h + z \odot \bar{h},\end{split}\]where \(\sigma\) is the sigmoid function, and \(\odot\) is the elementwise product.
GRU
does not hold the value of hidden vector \(h\). So this is stateless. UseStatefulGRU
as a stateful GRU.Parameters:  See:
 On the Properties of Neural Machine Translation: EncoderDecoder Approaches [Cho+, SSST2014].
 Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling [Chung+NIPS2014 DLWorkshop].
See also
Highway¶

class
chainer.links.
Highway
(in_out_size, nobias=False, activate=<function relu>, init_Wh=None, init_Wt=None, init_bh=None, init_bt=1)[source]¶ Highway module.
In highway network, two gates are added to the ordinal nonlinear transformation (\(H(x) = activate(W_h x + b_h)\)). One gate is the transform gate \(T(x) = \sigma(W_t x + b_t)\), and the other is the carry gate \(C(x)\). For simplicity, the author defined \(C = 1  T\). Highway module returns \(y\) defined as
\[y = activate(W_h x + b_h) \odot \sigma(W_t x + b_t) + x \odot(1  \sigma(W_t x + b_t))\]The output array has the same spatial size as the input. In order to satisfy this, \(W_h\) and \(W_t\) must be square matrices.
Parameters:  in_out_size (int) – Dimension of input and output vectors.
 nobias (bool) – If
True
, then this function does not use the bias.  activate – Activation function of plain array. \(tanh\) is also available.
 init_Wh (2D array) – Initial weight value of plain array. If
None
, then this function uses it to initializewscale
. May also be a callable that takesnumpy.ndarray
orcupy.ndarray
and edits its value.  init_bh (1D array) – Initial bias value of plain array. If
None
, then this function uses it to initialize zero vector. May also be a callable that takesnumpy.ndarray
orcupy.ndarray
and edits its value.  init_Wt (2D array) – Initial weight value of transform array.
If
None
, then this function uses it to initializewscale
. May also be a callable that takesnumpy.ndarray
orcupy.ndarray
and edits its value.  init_bt (1D array) – Initial bias value of transform array.
Default value is 1 vector.
May also be a callable that takes
numpy.ndarray
orcupy.ndarray
and edits its value. Negative value is recommended by the author of the paper. (e.g. 1, 3, ...).
 See:
 Highway Networks.
Inception¶

class
chainer.links.
Inception
(in_channels, out1, proj3, out3, proj5, out5, proj_pool, conv_init=None, bias_init=None)[source]¶ Inception module of GoogLeNet.
It applies four different functions to the input array and concatenates their outputs along the channel dimension. Three of them are 2D convolutions of sizes 1x1, 3x3 and 5x5. Convolution paths of 3x3 and 5x5 sizes have 1x1 convolutions (called projections) ahead of them. The other path consists of 1x1 convolution (projection) and 3x3 max pooling.
The output array has the same spatial size as the input. In order to satisfy this, Inception module uses appropriate padding for each convolution and pooling.
See: Going Deeper with Convolutions.
Parameters:  in_channels (int) – Number of channels of input arrays.
 out1 (int) – Output size of 1x1 convolution path.
 proj3 (int) – Projection size of 3x3 convolution path.
 out3 (int) – Output size of 3x3 convolution path.
 proj5 (int) – Projection size of 5x5 convolution path.
 out5 (int) – Output size of 5x5 convolution path.
 proj_pool (int) – Projection size of max pooling path.
 conv_init – A callable that takes
numpy.ndarray
orcupy.ndarray
and edits its value. It is used for initialization of the convolution matrix weights. Maybe beNone
to use default initialization.  bias_init – A callable that takes
numpy.ndarray
orcupy.ndarray
and edits its value. It is used for initialization of the convolution bias weights. Maybe beNone
to use default initialization.
InceptionBN¶

class
chainer.links.
InceptionBN
(in_channels, out1, proj3, out3, proj33, out33, pooltype, proj_pool=None, stride=1, conv_init=None, dtype=<type 'numpy.float32'>)[source]¶ Inception module of the new GoogLeNet with BatchNormalization.
This chain acts like
Inception
, while InceptionBN uses theBatchNormalization
on top of each convolution, the 5x5 convolution path is replaced by two consecutive 3x3 convolution applications, and the pooling method is configurable.See: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.
Parameters:  in_channels (int) – Number of channels of input arrays.
 out1 (int) – Output size of the 1x1 convolution path.
 proj3 (int) – Projection size of the single 3x3 convolution path.
 out3 (int) – Output size of the single 3x3 convolution path.
 proj33 (int) – Projection size of the double 3x3 convolutions path.
 out33 (int) – Output size of the double 3x3 convolutions path.
 pooltype (str) – Pooling type. It must be either
'max'
or'avg'
.  proj_pool (bool) – If
True
, do projection in the pooling path.  stride (int) – Stride parameter of the last convolution of each path.
 conv_init – A callable that takes
numpy.ndarray
orcupy.ndarray
and edits its value. It is used for initialization of the convolution matrix weights. Maybe beNone
to use default initialization.  dtype (numpy.dtype) – Type to use in
~batch_normalization.BatchNormalization
.
See also
Variables: train (bool) – If True
, then batch normalization layers are used in training mode. IfFalse
, they are used in testing mode.
Linear¶

class
chainer.links.
Linear
(in_size, out_size, wscale=1, bias=0, nobias=False, initialW=None, initial_bias=None)[source]¶ Linear layer (a.k.a. fullyconnected layer).
This is a link that wraps the
linear()
function, and holds a weight matrixW
and optionally a bias vectorb
as parameters.The weight matrix
W
is initialized with i.i.d. Gaussian samples, each of which has zero mean and deviation \(\sqrt{1/\text{in_size}}\). The bias vectorb
is of sizeout_size
. Each element is initialized with thebias
value. Ifnobias
argument is set to True, then this link does not hold a bias vector.Parameters:  in_size (int) – Dimension of input vectors. If
None
, parameter initialization will be deferred until the first forward data pass at which time the size will be determined.  out_size (int) – Dimension of output vectors.
 wscale (float) – Scaling factor of the weight matrix.
 bias (float) – Initial bias value.
 nobias (bool) – If
True
, then this function does not use the bias.  initialW (2D array) – Initial weight value. If
None
, then this function uses to initializewscale
. May also be a callable that takesnumpy.ndarray
orcupy.ndarray
and edits its value.  initial_bias (1D array) – Initial bias value. If
None
, then this function uses to initializebias
. May also be a callable that takesnumpy.ndarray
orcupy.ndarray
and edits its value.
See also
Variables:  in_size (int) – Dimension of input vectors. If
LSTM¶

class
chainer.links.
LSTM
(in_size, out_size, **kwargs)[source]¶ Fullyconnected LSTM layer.
This is a fullyconnected LSTM layer as a chain. Unlike the
lstm()
function, which is defined as a stateless activation function, this chain holds upward and lateral connections as child links.It also maintains states, including the cell state and the output at the previous time step. Therefore, it can be used as a stateful LSTM.
This link supports variable length inputs. The minibatch size of the current input must be equal to or smaller than that of the previous one. The minibatch size of
c
andh
is determined as that of the first inputx
. When minibatch size ofi
th input is smaller than that of the previous input, this link only updatesc[0:len(x)]
andh[0:len(x)]
and doesn’t change the rest ofc
andh
. So, please sort input sequences in descending order of lengths before applying the function.Parameters:  in_size (int) – Dimension of input vectors. If
None
, parameter initialization will be deferred until the first forward data pass at which time the size will be determined.  out_size (int) – Dimensionality of output vectors.
 lateral_init – A callable that takes
numpy.ndarray
orcupy.ndarray
and edits its value. It is used for initialization of the lateral connections. Maybe beNone
to use default initialization.  upward_init – A callable that takes
numpy.ndarray
orcupy.ndarray
and edits its value. It is used for initialization of the upward connections. Maybe beNone
to use default initialization.  bias_init – A callable that takes
numpy.ndarray
orcupy.ndarray
and edits its value It is used for initialization of the biases of cell input, input gate and output gate.and gates of the upward connection. Maybe a scalar, in that case, the bias is initialized by this value. Maybe beNone
to use default initialization.  forget_bias_init – A callable that takes
numpy.ndarray
orcupy.ndarray
and edits its value It is used for initialization of the biases of the forget gate of the upward connection. Maybe a scalar, in that case, the bias is initialized by this value. Maybe beNone
to use default initialization.
Variables:  in_size (int) – Dimension of input vectors. If
MLPConvolution2D¶

class
chainer.links.
MLPConvolution2D
(in_channels, out_channels, ksize, stride=1, pad=0, wscale=1, activation=<function relu>, use_cudnn=True, conv_init=None, bias_init=None)[source]¶ Twodimensional MLP convolution layer of Network in Network.
This is an “mlpconv” layer from the Network in Network paper. This layer is a twodimensional convolution layer followed by 1x1 convolution layers and interleaved activation functions.
Note that it does not apply the activation function to the output of the last 1x1 convolution layer.
Parameters:  in_channels (int or None) – Number of channels of input arrays.
If
None
, parameter initialization will be deferred until the first forward data pass at which time the size will be determined.  out_channels (tuple of ints) – Tuple of number of channels. The ith integer indicates the number of filters of the ith convolution.
 ksize (int or pair of ints) – Size of filters (a.k.a. kernels) of the
first convolution layer.
ksize=k
andksize=(k, k)
are equivalent.  stride (int or pair of ints) – Stride of filter applications at the
first convolution layer.
stride=s
andstride=(s, s)
are equivalent.  pad (int or pair of ints) – Spatial padding width for input arrays at
the first convolution layer.
pad=p
andpad=(p, p)
are equivalent.  activation (function) – Activation function for internal hidden units. Note that this function is not applied to the output of this link.
 use_cudnn (bool) – If
True
, then this link uses cuDNN if available.  conv_init – An initializer of weight matrices passed to the convolution layers.
 bias_init – An initializer of bias vectors passed to the convolution layers.
See: Network in Network.
Variables: activation (function) – Activation function.  in_channels (int or None) – Number of channels of input arrays.
If
NStepLSTM¶

class
chainer.links.
NStepLSTM
(n_layers, in_size, out_size, dropout, use_cudnn=True)[source]¶ Stacked LSTM for sequnces.
This link is stacked version of LSTM for sequences. It calculates hidden and cell states of all layer at endofstring, and all hidden states of the last layer for each time.
Unlike
chainer.functions.n_step_lstm()
, this function automatically sort inputs in descending order by length, and transpose the seuqnece. Users just need to call the link with a list ofchainer.Variable
holding sequences.Parameters: See also
Scale¶

class
chainer.links.
Scale
(axis=1, W_shape=None, bias_term=False, bias_shape=None)[source]¶ Broadcasted elementwise product with learnable parameters.
Computes a elementwise product as
scale()
function does except that its second input is a learnable weight parameter \(W\) the link has.Parameters:  axis (int) – The first axis of the first input of
scale()
function along which its second input is applied.  W_shape (tuple of ints) – Shape of learnable weight parameter. If
None
, this link does not have learnable weight parameter so an explicit weight needs to be given to its__call__
method’s second input.  bias_term (bool) – Whether to also learn a bias (equivalent to Scale link + Bias link).
 bias_shape (tuple of ints) – Shape of learnable bias. If
W_shape
isNone
, this should be given to determine the shape. Otherwise, the bias has the same shapeW_shape
with the weight parameter andbias_shape
is ignored.
See also
See
scale()
for details.Variables:  axis (int) – The first axis of the first input of
StatefulGRU¶

class
chainer.links.
StatefulGRU
(in_size, out_size, init=None, inner_init=None, bias_init=0)[source]¶ Stateful Gated Recurrent Unit function (GRU).
Stateful GRU function has six parameters \(W_r\), \(W_z\), \(W\), \(U_r\), \(U_z\), and \(U\). All these parameters are \(n \times n\) matrices, where \(n\) is the dimension of hidden vectors.
Given input vector \(x\), Stateful GRU returns the next hidden vector \(h'\) defined as
\[\begin{split}r &=& \sigma(W_r x + U_r h), \\ z &=& \sigma(W_z x + U_z h), \\ \bar{h} &=& \tanh(W x + U (r \odot h)), \\ h' &=& (1  z) \odot h + z \odot \bar{h},\end{split}\]where \(h\) is current hidden vector.
As the name indicates,
StatefulGRU
is stateful, meaning that it also holds the next hidden vector h’ as a state. UseGRU
as a stateless version of GRU.Parameters:  in_size (int) – Dimension of input vector \(x\).
 out_size (int) – Dimension of hidden vector \(h\).
 init – A callable that takes
numpy.ndarray
orcupy.ndarray
and edits its value. It is used for initialization of the GRU’s input units (\(W\)). Maybe be None to use default initialization.  inner_init – A callable that takes
numpy.ndarray
orcupy.ndarray
and edits its value. It is used for initialization of the GRU’s inner recurrent units (\(U\)). Maybe beNone
to use default initialization.  bias_init – A callable or scalar used to initialize the bias values for
both the GRU’s inner and input units. Maybe be
None
to use default initialization.
Variables: h (Variable) – Hidden vector that indicates the state of
StatefulGRU
.See also
GRU
StatefulPeepholeLSTM¶

class
chainer.links.
StatefulPeepholeLSTM
(in_size, out_size)[source]¶ Fullyconnected LSTM layer with peephole connections.
This is a fullyconnected LSTM layer with peephole connections as a chain. Unlike the
LSTM
link, this chain holdspeep_i
,peep_f
andpeep_o
as child links besidesupward
andlateral
.Given a input vector \(x\), Peephole returns the next hidden vector \(h'\) defined as
\[\begin{split}a &=& \tanh(upward x + lateral h), \\ i &=& \sigma(upward x + lateral h + peep_i c), \\ f &=& \sigma(upward x + lateral h + peep_f c), \\ c' &=& a \odot i + f \odot c, \\ o &=& \sigma(upward x + lateral h + peep_o c'), \\ h' &=& o \tanh(c'),\end{split}\]where \(\sigma\) is the sigmoid function, \(\odot\) is the elementwise product, \(c\) is the current cell state, \(c'\) is the next cell state and \(h\) is the current hidden vector.
Parameters: Variables:  upward (Linear) – Linear layer of upward connections.
 lateral (Linear) – Linear layer of lateral connections.
 peep_i (Linear) – Linear layer of peephole connections to the input gate.
 peep_f (Linear) – Linear layer of peephole connections to the forget gate.
 peep_o (Linear) – Linear layer of peephole connections to the output gate.
 c (Variable) – Cell states of LSTM units.
 h (Variable) – Output at the current time step.
StatelessLSTM¶

class
chainer.links.
StatelessLSTM
(in_size, out_size, lateral_init=None, upward_init=None, bias_init=0, forget_bias_init=0)[source]¶ Stateless LSTM layer.
This is a fullyconnected LSTM layer as a chain. Unlike the
lstm()
function, this chain holds upward and lateral connections as child links. This link doesn’t keep cell and hidden states.Parameters: Variables:  upward (chainer.links.Linear) – Linear layer of upward connections.
 lateral (chainer.links.Linear) – Linear layer of lateral connections.
Activation/loss/normalization functions with parameters¶
BatchNormalization¶

class
chainer.links.
BatchNormalization
(size, decay=0.9, eps=2e05, dtype=<type 'numpy.float32'>, use_gamma=True, use_beta=True, initial_gamma=None, initial_beta=None, use_cudnn=True)[source]¶ Batch normalization layer on outputs of linear or convolution functions.
This link wraps the
batch_normalization()
andfixed_batch_normalization()
functions.It runs in three modes: training mode, finetuning mode, and testing mode.
In training mode, it normalizes the input by batch statistics. It also maintains approximated population statistics by moving averages, which can be used for instant evaluation in testing mode.
In finetuning mode, it accumulates the input to compute population statistics. In order to correctly compute the population statistics, a user must use this mode to feed minibatches running through whole training dataset.
In testing mode, it uses precomputed population statistics to normalize the input variable. The population statistics is approximated if it is computed by training mode, or accurate if it is correctly computed by finetuning mode.
Parameters:  size (int or tuple of ints) – Size (or shape) of channel dimensions.
 decay (float) – Decay rate of moving average. It is used on training.
 eps (float) – Epsilon value for numerical stability.
 dtype (numpy.dtype) – Type to use in computing.
 use_gamma (bool) – If
True
, use scaling parameter. Otherwise, use unit(1) which makes no effect.  use_beta (bool) – If
True
, use shifting parameter. Otherwise, use unit(0) which makes no effect.  use_cudnn (bool) – If
True
, then this link uses cuDNN if available.
See: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Variables:  gamma (Variable) – Scaling parameter.
 beta (Variable) – Shifting parameter.
 avg_mean (Variable) – Population mean.
 avg_var (Variable) – Population variance.
 N (int) – Count of batches given for finetuning.
 decay (float) – Decay rate of moving average. It is used on training.
 eps (float) – Epsilon value for numerical stability. This value is added to the batch variances.
 use_cudnn (bool) – If
True
, then this link uses cuDNN if available.
LayerNormalization¶

class
chainer.links.
LayerNormalization
(size=None, eps=1e06, initial_gamma=None, initial_beta=None)[source]¶ Layer normalization layer on outputs of linear functions.
This link implements a “layer normalization” layer which normalizes the input units by statistics that are computed along the second axis, scales and shifts them. Parameter initialization will be deferred until the first forward data pass at which time the size will be determined.
Parameters:  size (int) – Size of input units. If
None
, parameter initialization will be deferred until the first forward data pass at which time the size will be determined.  eps (float) – Epsilon value for numerical stability of normalization.
 initial_gamma (Initializer) – Initializer for scaling vector.
If
None
, then the vector is filled by 1. If a scalar, the vector is filled by it. Ifnumpy.ndarray
, the vector is set by it.  initial_beta (Initializer) – Initializer for shifting vector.
If
None
, then the vector is filled by 0. If a scalar, the vector is filled by it. Ifnumpy.ndarray
, the vector is set by it.
Variables: See: Layer Normalization
 size (int) – Size of input units. If
BinaryHierarchicalSoftmax¶

class
chainer.links.
BinaryHierarchicalSoftmax
(in_size, tree)[source]¶ Hierarchical softmax layer over binary tree.
In natural language applications, vocabulary size is too large to use softmax loss. Instead, the hierarchical softmax uses product of sigmoid functions. It costs only \(O(\log(n))\) time where \(n\) is the vocabulary size in average.
At first a user need to prepare a binary tree whose each leaf is corresponding to a word in a vocabulary. When a word \(x\) is given, exactly one path from the root of the tree to the leaf of the word exists. Let \(\mbox{path}(x) = ((e_1, b_1), \dots, (e_m, b_m))\) be the path of \(x\), where \(e_i\) is an index of \(i\)th internal node, and \(b_i \in \{1, 1\}\) indicates direction to move at \(i\)th internal node (1 is left, and 1 is right). Then, the probability of \(x\) is given as below:
\[\begin{split}P(x) &= \prod_{(e_i, b_i) \in \mbox{path}(x)}P(b_i  e_i) \\ &= \prod_{(e_i, b_i) \in \mbox{path}(x)}\sigma(b_i x^\top w_{e_i}),\end{split}\]where \(\sigma(\cdot)\) is a sigmoid function, and \(w\) is a weight matrix.
This function costs \(O(\log(n))\) time as an average length of paths is \(O(\log(n))\), and \(O(n)\) memory as the number of internal nodes equals \(n  1\).
Parameters:  in_size (int) – Dimension of input vectors.
 tree – A binary tree made with tuples like ((1, 2), 3).
Variables: W (Variable) – Weight parameter matrix.
See: Hierarchical Probabilistic Neural Network Language Model [Morin+, AISTAT2005].

static
create_huffman_tree
(word_counts)[source]¶ Makes a Huffman tree from a dictionary containing word counts.
This method creates a binary Huffman tree, that is required for
BinaryHierarchicalSoftmax
. For example,{0: 8, 1: 5, 2: 6, 3: 4}
is converted to((3, 1), (2, 0))
.Parameters: word_counts (dict of int key and int or float values) – Dictionary representing counts of words. Returns: Binary Huffman tree with tuples and keys of word_coutns
.
BlackOut¶

class
chainer.links.
BlackOut
(in_size, counts, sample_size)[source]¶ BlackOut loss layer.
See also
black_out()
for more detail.Parameters: Variables: W (Variable) – Weight parameter matrix.
CRF1d¶

class
chainer.links.
CRF1d
(n_label)[source]¶ Linearchain conditional random field loss layer.
This link wraps the
crf1d()
function. It holds a transition cost matrix as a parameter.Parameters: n_label (int) – Number of labels. See also
crf1d()
for more detail.Variables: cost (Variable) – Transition cost parameter. 
argmax
(xs)[source]¶ Computes a state that maximizes a joint probability.
Parameters: xs (list of Variable) – Input vector for each label. Returns:  A tuple of
Variable
representing each  loglikelihood and a list representing the argmax path.
Return type: tuple See also
See
crf1d_argmax()
for more detail. A tuple of

PReLU¶

class
chainer.links.
PReLU
(shape=(), init=0.25)[source]¶ Parametric ReLU function as a link.
Parameters:  shape (tuple of ints) – Shape of the parameter array.
 init (float) – Initial parameter value.
See the paper for details: Delving Deep into Rectifiers: Surpassing HumanLevel Performance on ImageNet Classification.
See also
Variables: W (Variable) – Coefficient of parametric ReLU.
Maxout¶

class
chainer.links.
Maxout
(in_size, out_size, pool_size, wscale=1, initialW=None, initial_bias=0)[source]¶ Fullyconnected maxout layer.
Let
M
,P
andN
be an input dimension, a pool size, and an output dimension, respectively. For an input vector \(x\) of sizeM
, it computes\[Y_{i} = \mathrm{max}_{j} (W_{ij\cdot}x + b_{ij}).\]Here \(W\) is a weight tensor of shape
(M, P, N)
, \(b\) an optional bias vector of shape(M, P)
and \(W_{ij\cdot}\) is a subvector extracted from \(W\) by fixing first and second dimensions to \(i\) and \(j\), respectively. Minibatch dimension is omitted in the above equation.As for the actual implementation, this chain has a Linear link with a
(M * P, N)
weight matrix and an optionalM * P
dimensional bias vector.Parameters:  in_size (int) – Dimension of input vectors.
 out_size (int) – Dimension of output vectors.
 pool_size (int) – Number of channels.
 wscale (float) – Scaling factor of the weight matrix.
 initialW (3D array or None) – Initial weight value.
If
None
, then this function useswscale
to initialize.  initial_bias (2D array, float or None) – Initial bias value.
If it is float, initial bias is filled with this value.
If
None
, bias is omitted.
Variables: linear (Link) – The Linear link that performs affine transformation.
See also
See also
Goodfellow, I., Wardefarley, D., Mirza, M., Courville, A., & Bengio, Y. (2013). Maxout Networks. In Proceedings of the 30th International Conference on Machine Learning (ICML13) (pp. 13191327). URL
NegativeSampling¶

class
chainer.links.
NegativeSampling
(in_size, counts, sample_size, power=0.75)[source]¶ Negative sampling loss layer.
This link wraps the
negative_sampling()
function. It holds the weight matrix as a parameter. It also builds a sampler internally given a list of word counts.Parameters: See also
negative_sampling()
for more detail.Variables: W (Variable) – Weight parameter matrix.
Machine learning models¶
Classifier¶

class
chainer.links.
Classifier
(predictor, lossfun=<function softmax_cross_entropy>, accfun=<function accuracy>)[source]¶ A simple classifier model.
This is an example of chain that wraps another chain. It computes the loss and accuracy based on a given input/label pair.
Parameters: Variables:  predictor (Link) – Predictor network.
 lossfun (function) – Loss function.
 accfun (function) – Function that computes accuracy.
 y (Variable) – Prediction for the last minibatch.
 loss (Variable) – Loss value for the last minibatch.
 accuracy (Variable) – Accuracy for the last minibatch.
 compute_accuracy (bool) – If
True
, compute accuracy on the forward computation. The default value isTrue
.
Pretrained models¶
Pretrained models are mainly used to achieve a good performance with a small
dataset, or extract a semantic feature vector. Although CaffeFunction
automatically loads a pretrained model released as a caffemodel,
the following link models provide an interface for automatically converting
caffemodels, and easily extracting semantic feature vectors.
For example, to extract the feature vectors with VGG16Layers
, which is
a common pretrained model in the field of image recognition,
users need to write the following few lines:
from chainer.links import VGG16Layers
from PIL import Image
model = VGG16Layers()
img = Image.open("path/to/image.jpg")
feature = model.extract([img], layers=["fc7"])["fc7"]
where fc7
denotes a layer before the last fullyconnected layer.
Unlike the usual links, these classes automatically load all the
parameters from the pretrained models during initialization.
VGG16Layers¶

class
chainer.links.
VGG16Layers
(pretrained_model='auto')[source]¶ A pretrained CNN model with 16 layers provided by VGG team [1].
During initialization, this chain model automatically downloads the pretrained caffemodel, convert to another chainer model, stores it on your local directory, and initializes all the parameters with it. This model would be useful when you want to extract a semantic feature vector from a given image, or finetune the model on a different dataset. Note that this pretrained model is released under Creative Commons Attribution License.
If you want to manually convert the pretrained caffemodel to a chainer model that can be specified in the constructor, please use
convert_caffemodel_to_npz
classmethod instead.[1] K. Simonyan and A. Zisserman, Very Deep Convolutional Networks for LargeScale Image Recognition Parameters: pretrained_model (str) – the destination of the pretrained chainer model serialized as a .npz
file. If this argument is specified asauto
, it automatically downloads the caffemodel from the internet. Note that in this case the converted chainer model is stored on$CHAINER_DATASET_ROOT/pfnet/chainer/models
directory, where$CHAINER_DATASET_ROOT
is set as$HOME/.chainer/dataset
unless you specify another value as a environment variable. The converted chainer model is automatically used from the second time. If the argument is specified asNone
, all the parameters are not initialized by the pretrained model, but the default initializer used in the original paper, i.e.,chainer.initializers.Normal(scale=0.01)
.Variables: available_layers (list of str) – The list of available layer names used by __call__
andextract
methods.
classmethod
convert_caffemodel_to_npz
(path_caffemodel, path_npz)[source]¶ Converts a pretrained caffemodel to a chainer model.
Parameters:

extract
(images, layers=['fc7'], size=(224, 224), test=True, volatile=OFF)[source]¶ Extracts all the feature maps of given images.
The difference of directly executing
__call__
is that it directly accepts images as an input and automatically transforms them to a proper variable. That is, it is also interpreted as a shortcut method that implicitly callsprepare
and__call__
functions.Parameters:  images (iterable of PIL.Image or numpy.ndarray) – Input images.
 layers (list of str) – The list of layer names you want to extract.
 size (pair of ints) – The resolution of resized images used as
an input of CNN. All the given images are not resized
if this argument is
None
, but the resolutions of all the images should be the same.  test (bool) – If
True
, dropout runs in test mode.  volatile (Flag) – Volatility flag used for input variables.
Returns: A directory in which the key contains the layer name and the value contains the corresponding feature map variable.
Return type: Dictionary of ~chainer.Variable

predict
(images, oversample=True)[source]¶ Computes all the probabilities of given images.
Parameters:  images (iterable of PIL.Image or numpy.ndarray) – Input images.
 oversample (bool) – If
True
, it averages results across center, corners, and mirrors. Otherwise, it uses only the center.
Returns: Output that contains the class probabilities of given images.
Return type:

classmethod

chainer.links.model.vision.vgg.
prepare
(image, size=(224, 224))[source]¶ Converts the given image to the numpy array for VGG models.
Note that you have to call this method before
__call__
because the pretrained vgg model requires to resize the given image, covert the RGB to the BGR, subtract the mean, and permute the dimensions before calling.Parameters:  image (PIL.Image or numpy.ndarray) – Input image.
If an input is
numpy.ndarray
, its shape must be(height, width)
,(height, width, channels)
, or(channels, height, width)
, and the order of the channels must be RGB.  size (pair of ints) – Size of converted images.
If
None
, the given image is not resized.
Returns: The converted output array.
Return type:  image (PIL.Image or numpy.ndarray) – Input image.
If an input is
ResNet50Layers¶

class
chainer.links.
ResNet50Layers
(pretrained_model='auto')[source]¶ A pretrained CNN model with 50 layers provided by MSRA [1].
When you specify the path of the pretrained chainer model serialized as a
.npz
file in the constructor, this chain model automatically initializes all the parameters with it. This model would be useful when you want to extract a semantic feature vector per image, or finetune the model on a different dataset. Note that unlikeVGG16Layers
, it does not automatically download a pretrained caffemodel. This caffemodel can be downloaded at GitHub.If you want to manually convert the pretrained caffemodel to a chainer model that can be specified in the constructor, please use
convert_caffemodel_to_npz
classmethod instead.[1] K. He et. al., Deep Residual Learning for Image Recognition Parameters: pretrained_model (str) – the destination of the pretrained chainer model serialized as a .npz
file. If this argument is specified asauto
, it automatically loads and converts the caffemodel from$CHAINER_DATASET_ROOT/pfnet/chainer/models/ResNet50model.caffemodel
, where$CHAINER_DATASET_ROOT
is set as$HOME/.chainer/dataset
unless you specify another value as an environment variable. Note that in this case the converted chainer model is stored on the same directory and automatically used from the second time. If the argument is specified asNone
, all the parameters are not initialized by the pretrained model, but the default initializer used in the original paper, i.e.,chainer.initializers.HeNormal(scale=1.0)
.Variables: available_layers (list of str) – The list of available layer names used by __call__
andextract
methods.
classmethod
convert_caffemodel_to_npz
(path_caffemodel, path_npz)[source]¶ Converts a pretrained caffemodel to a chainer model.
Parameters:

extract
(images, layers=['pool5'], size=(224, 224), test=True, volatile=OFF)[source]¶ Extracts all the feature maps of given images.
The difference of directly executing
__call__
is that it directly accepts images as an input and automatically transforms them to a proper variable. That is, it is also interpreted as a shortcut method that implicitly callsprepare
and__call__
functions.Parameters:  images (iterable of PIL.Image or numpy.ndarray) – Input images.
 layers (list of str) – The list of layer names you want to extract.
 size (pair of ints) – The resolution of resized images used as
an input of CNN. All the given images are not resized
if this argument is
None
, but the resolutions of all the images should be the same.  test (bool) – If
True
, BatchNormalization runs in test mode.  volatile (Flag) – Volatility flag used for input variables.
Returns: A directory in which the key contains the layer name and the value contains the corresponding feature map variable.
Return type: Dictionary of ~chainer.Variable

predict
(images, oversample=True)[source]¶ Computes all the probabilities of given images.
Parameters:  images (iterable of PIL.Image or numpy.ndarray) – Input images.
 oversample (bool) – If
True
, it averages results across center, corners, and mirrors. Otherwise, it uses only the center.
Returns: Output that contains the class probabilities of given images.
Return type:

classmethod

chainer.links.model.vision.resnet.
prepare
(image, size=(224, 224))[source]¶ Converts the given image to the numpy array for ResNets.
Note that you have to call this method before
__call__
because the pretrained resnet model requires to resize the given image, covert the RGB to the BGR, subtract the mean, and permute the dimensions before calling.Parameters:  image (PIL.Image or numpy.ndarray) – Input image.
If an input is
numpy.ndarray
, its shape must be(height, width)
,(height, width, channels)
, or(channels, height, width)
, and the order of the channels must be RGB.  size (pair of ints) – Size of converted images.
If
None
, the given image is not resized.
Returns: The converted output array.
Return type:  image (PIL.Image or numpy.ndarray) – Input image.
If an input is