# Standard Function implementations¶

Chainer provides basic Function implementations in the chainer.functions package.

Non-parameterized functions are provided as plain Python functions. These can be used directly in forward computation without explicit handling of Function objects. On the other hand, parameterized functions should be used with explicit handling of Function objects.

## Learnable connections¶

class chainer.functions.BinaryHierarchicalSoftmax(in_size, tree)[source]

Implementation of hierarchical softmax (HSM).

In natural language applications, vocabulary size is too large to use softmax loss. Instead, the hierarchical softmax uses product of sigmoid functions. It costs only $$O(\log(n))$$ time where $$n$$ is the vocabulary size in average.

At first a user need to prepare a binary tree whose each leaf is corresponding to a word in a vocabulary. When a word $$x$$ is given, exactly one path from the root of the tree to the leaf of the word exists. Let $$\mbox{path}(x) = ((e_1, b_1), \dots, (e_m, b_m))$$ be the path of $$x$$, where $$e_i$$ is an index of $$i$$-th internal node, and $$b_i \in \{-1, 1\}$$ indicates direction to move at $$i$$-th internal node (-1 is left, and 1 is right). Then, the probability of $$x$$ is given as below:

$\begin{split}P(x) &= \prod_{(e_i, b_i) \in \mbox{path}(x)}P(b_i | e_i) \\ &= \prod_{(e_i, b_i) \in \mbox{path}(x)}\sigma(b_i x^\top w_{e_i}),\end{split}$

where $$\sigma(\cdot)$$ is a sigmoid function, and $$w$$ is a weight matrix.

This function costs $$O(\log(n))$$ time as an average length of paths is $$O(\log(n))$$, and $$O(n)$$ memory as the number of internal nodes equals $$n - 1$$.

Parameters: in_size (int) – Dimension of input vectors. tree – A binary tree made with tuples like ((1, 2), 3).

See: Hierarchical Probabilistic Neural Network Language Model [Morin+, AISTAT2005].

class chainer.functions.Convolution2D(in_channels, out_channels, ksize, stride=1, pad=0, wscale=1, bias=0, nobias=False, use_cudnn=True)[source]

Two-dimensional convolution function.

The details of this function are described below the arguments description.

Parameters: in_channels (int) – Number of channels of input arrays. out_channels (int) – Number of channels of output arrays. ksize (int or (int, int)) – Size of filters (a.k.a. kernels). ksize=k and ksize=(k, k) are equivalent. stride (int or (int, int)) – Stride of filter applications. stride=s and stride=(s, s) are equivalent. pad (int or (int, int)) – Spatial padding width for input arrays. pad=p and pad=(p, p) are equivalent. wscale (float) – Scaling factor of the initial weight. bias (float) – Initial bias value. nobias (bool) – If True, then this function does not use the bias term. use_cudnn (bool) – If True, then this function uses CuDNN if available.

This function holds at most two parameter arrays: W and b, which indicate the filter weight and the bias vector, respectively.

The filter weight has four dimensions $$(c_O, c_I, k_H, k_W)$$ which indicate the number of output channels, the number of input channels, height and width of the kernels, respectively. The filter weight is initialized with i.i.d. Gaussian random samples, each of which has zero mean and deviation $$\sqrt{1/(c_I k_H k_W)}$$ by default. The deviation is scaled by wscale if specified.

The bias vector is of size $$c_O$$. Each element of it is initialized by bias argument. If nobias argument is set to True, then this function does not hold the bias parameter.

The two-dimensional convolution function is defined as follows. Let $$X$$ be the input tensor of dimensions $$(n, c_I, h, w)$$, where $$n$$ is the batch size, and $$(h, w)$$ is spatial size of the input image. Then the Convolution2D function computes correlations between filters and patches of size $$(k_H, k_W)$$ in $$X$$. Note that correlation here is equivalent to the inner product between expanded vectors. Patches are extracted at positions shifted by multiples of stride from the first position -pad for each spatial axis. The right-most (or bottom-most) patches do not run over the padded spatial size.

Let $$(s_Y, s_X)$$ be the stride of filter application, and $$(p_H, p_W)$$ the spatial padding size. Then, the output size $$(h_O, w_O)$$ is determined by the following equations:

$\begin{split}h_O &= (h + 2p_H - k_H) / s_Y + 1,\\ w_O &= (w + 2p_W - k_W) / s_X + 1.\end{split}$
class chainer.functions.EmbedID(in_size, out_size)[source]

Efficient linear function for one-hot input.

This is a parameterized function to embed the given discrete identifier (e.g. word) into a continuous vector space. This function just holds embedding vectors for all identifiers as one large matrix W, which is learnable. The identifiers are directly used as indexes of the matrix W.

Parameters: in_size (int) – Number of different identifiers (a.k.a. vocabulary size). out_size (int) – Size of embedding vector.

Note

This function is non-differentiable with respect to the input identifiers.

class chainer.functions.Linear(in_size, out_size, wscale=1, bias=0, nobias=False)[source]

Implementation of a linear function (a.k.a. fully-connected layer or affine transformation).

This function holds a weight matrix W and a bias vector b.

The weight matrix W has shape (out_size, in_size). This matrix is initialized with i.i.d. Gaussian samples, each of which has zero mean and deviation $$\sqrt{1/\text{in_size}}$$. The deviation is scaled by factor wscale if specified.

The bias vector b is of size out_size. Each element is initialized with the bias value. If nobias argument is set to True, then this function does not hold a bias vector.

Let $$X$$ be an input matrix, and $$W, b$$ the weight matrix and the bias vector, respectively. Then, the output matrix $$Y$$ is computed by $$Y = XW^\top + b$$, where the addition by $$b$$ is broadcasted across the minibatch.

Parameters: in_size (int) – Dimension of input vectors. out_size (int) – Dimension of output vectors. wscale (float) – Scaling factor of the weight matrix. bias (float) – Initial bias value. nobias (bool) – If True, then this function does not use the bias.

Note

This function accepts an input variable of a non-matrix array. In this case, the leading dimension is treated as the batch dimension, and the other dimensions are reduced to one dimension.

class chainer.functions.Parameter(array)[source]

Function that outputs its weight array.

This is a parameterized function that takes no input and returns a variable holding a shallow copy of the parameter array.

Parameters: array – Initial parameter array.

## Array manipulation functions¶

chainer.functions.concat(xs, axis=1)[source]

Concatenates given variables along an axis.

Parameters: xs (tuple of Variables) – Variables to be concatenated. axis (int) – Axis that the input arrays are concatenated along. Output variable. Variable
chainer.functions.copy(x, dst)[source]

Copies the input variable onto the specified device.

This function copies the array of input variable onto the device specified by dst if the original array is on GPU, and otherwise just copies the array within host memory.

Parameters: x (Variable) – Variable to be copied. dst – Target device specifier. Output variable. Variable
chainer.functions.dropout(x, ratio=0.5, train=True)[source]

Drops elements of input variable randomly.

This function drops input elements randomly with probability ratio and scales the remaining elements by factor 1 / (1 - ratio). In testing mode, it does nothing and just returns x.

Parameters: x (Variable) – Input variable. ratio (float) – Dropout ratio. train (bool) – If True, executes dropout. Otherwise, does nothing. Output variable. Variable

See the paper by G. Hinton: Improving neural networks by preventing co-adaptation of feature detectors.

chainer.functions.identity(*inputs)[source]

Just returns input variables.

chainer.functions.reshape(x, shape)[source]

Reshapes an input variable without copy.

Parameters: x (Variable) – Input variable. shape (tuple of ints) – Target shape. Variable that holds a reshaped version of the input variable. Variable

## Activation functions¶

chainer.functions.exp(x)[source]

Elementwise exponential function.

chainer.functions.leaky_relu(x, slope=0.2)[source]

Leaky Rectified Linear Unit function.

This function is expressed as $$f(x) = \max(x, ax)$$, where $$a$$ is a configurable slope value.

Parameters: x (Variable) – Input variable. slope (float) – Slope value $$a$$. Output variable. Variable
chainer.functions.log(x)[source]

Elementwise natural logarithm function.

chainer.functions.lstm(c_prev, x)[source]

Long Short-Term Memory units as an activation function.

This function implements LSTM units with forget gates. Let the previous cell state $$c_{\text{prev}}$$ and the incoming signal $$x$$. Then, first the incoming signal $$x$$ is split along the second dimension into four arrays $$a, i, f, o$$ of the same shapes. Second, it computes outputs as:

$\begin{split}c &= \tanh(a) \text{sigmoid}(i) + c_{\text{prev}} \text{sigmoid}(f), \\ h &= \tanh(c) \text{sigmoid}(o).\end{split}$

This function outputs these two arrays as a tuple of two variables.

Parameters: c_prev (Variable) – Variable that holds the previous cell state. The cell state should be a zero array or the output of the previous call of LSTM. x (Variable) – Variable that holds the incoming signal. It must have the second dimension four times of that of the cell state, Two Variable objects c and h. c is the updated cell state. h indicates the outgoing signal. tuple

See the original paper proposing LSTM with forget gates: Long Short-Term Memory in Recurrent Neural Networks.

class chainer.functions.PReLU(shape=(), init=0.25)[source]

Parametric ReLU function.

PReLU function is written in elementwise equation as $$PReLU(x) = \max(x, ax)$$, where $$a$$ is a parameter array.

When the PReLU function is combined with two-dimensional convolution, the elements of parameter $$a$$ are typically shared across the same filter of different pixels. In order to support such usage, this function supports the shape of parameter array that indicates leading dimensions of input arrays except the batch dimension.

Parameters: shape (tuple of ints) – Shape of the parameter array. init (float) – Initial parameter value.
chainer.functions.relu(x, use_cudnn=True)[source]

Rectified Linear Unit function $$f(x)=\max(0, x)$$.

Parameters: x (Variable) – Input variable. use_cudnn (bool) – If True and CuDNN is enabled, then this function uses CuDNN as the core implementation. Output variable. Variable
chainer.functions.sigmoid(x, use_cudnn=True)[source]

Elementwise sigmoid logistic function $$f(x)=(1 + \exp(-x))^{-1}$$.

Parameters: x (Variable) – Input variable. use_cudnn (bool) – If True and CuDNN is enabled, then this function uses CuDNN as the core implementation. Output variable. Variable
chainer.functions.softmax(x, use_cudnn=True)[source]

Channelwise softmax function.

This function only accepts a two dimensional input array, and computes its softmax along the second axis. For each index $$i, j$$ of the input matrix $$x$$, it computes $$f_{ij}(x)={\exp(x_{ij}) \over \sum_j \exp(x_{ij})}$$.

Parameters: x (Variable) – Input variable. use_cudnn (bool) – If True and CuDNN is enabled, then this function uses CuDNN as the core implementation. Output variable. Variable
chainer.functions.tanh(x, use_cudnn=True)[source]

Elementwise hyperbolic tangent function.

Parameters: x (Variable) – Input variable. use_cudnn (bool) – If True and CuDNN is enabled, then this function uses CuDNN as the core implementation. Output variable. Variable

## Pooling functions¶

chainer.functions.average_pooling_2d(x, ksize, stride=None, pad=0, use_cudnn=True)[source]

Spatial average pooling function.

This function acts similarly to Convolution2D, but it computes the average of input spatial patch for each channel without any parameter instead of computing the inner products.

Parameters: x (Variable) – Input variable. ksize (int or (int, int)) – Size of pooling window. ksize=k and ksize=(k, k) are equivalent. stride (int or (int, int) or None) – Stride of pooling applications. ksize=k and ksize=(k, k) are equivalent. If None is specified, then it uses same stride as the pooling window size. pad (int or (int, int)) – Spatial padding width for the input array. pad=p and pad=(p, p) are equivalent. use_cudnn (bool) – If True and CuDNN is enabled, then this function uses CuDNN as the core implementation. Output variable. Variable

Note

This function currently does not support cover_all mode as max_pooling_2d(). Average pooling runs in non-cover-all mode.

chainer.functions.max_pooling_2d(x, ksize, stride=None, pad=0, cover_all=True, use_cudnn=True)[source]

Spatial max pooling function.

This function acts similarly to Convolution2D, but it computes the maximum of input spatial patch for each channel without any parameter instead of computing the inner products.

Parameters: x (Variable) – Input variable. ksize (int or (int, int)) – Size of pooling window. ksize=k and ksize=(k, k) are equivalent. stride (int or (int, int) or None) – Stride of pooling applications. ksize=k and ksize=(k, k) are equivalent. If None is specified, then it uses same stride as the pooling window size. pad (int or (int, int)) – Spatial padding width for the input array. pad=p and pad=(p, p) are equivalent. cover_all (bool) – If True, all spatial locations are pooled into some output pixels. It may make the output size larger. use_cudnn (bool) – If True and CuDNN is enabled, then this function uses CuDNN as the core implementation. Ouptut variable. Variable

## Normalization functions¶

class chainer.functions.BatchNormalization(size, decay=0.9, eps=1e-05)[source]

Batch normalization on outputs of linear or convolution functions.

Parameters: size (int or tuple of ints) – Size (or shape) of channel dimensions. decay (float) – Decay rate of moving average. eps (float) – Epsilon value for numerical stability.
__call__(x, test=False, finetune=False)[source]

Invokes the forward propagation of BatchNormalization.

BatchNormalization accepts additional arguments, which controlls three different running mode.

Parameters: x (Variable) – An input variable. test (bool) – If True, BatchNormalization runs in testing mode; it normalizes the input using precomputed statistics. finetune (bool) – If True, BatchNormalization runs in finetuning mode; it accumulates the input array to compute population statistics for normalization, and normalizes the input using batch statistics.

If test and finetune are both False, then BatchNormalization runs in training mode; it computes moving averages of mean and variance for evaluation during training, and normalizes the input using batch statistics.

chainer.functions.local_response_normalization(x, n=5, k=2, alpha=0.0001, beta=0.75)[source]

Local response normalization across neighboring channels.

This function implements normalization across channels. Let $$x$$ an input image with $$N$$ channels. Then, this function computes an output image $$y$$ by following formula:

$y_i = {x_i \over \left( k + \ \alpha \sum_{j=\max{1, i - n/2}}^{\min{N, i + n/2}} \ x_j^2 \right)^\beta}.$
Parameters: x (Variable) – Input variable. n (int) – Normalization window width. k (float) – Smoothing parameter. alpha (float) – Normalizer scaling parameter. beta (float) – Normalizer power parameter. Output variable. Variable

See: SSec. 3.3 of ImageNet Classification with Deep Convolutional Neural Networks

## Loss, evaluation and aggregation¶

chainer.functions.accuracy(y, t)[source]

Computes muticlass classification accuracy of the minibatch.

Parameters: y (Variable) – Variable holding a matrix whose (i, j)-th element indicates the score of the class j at the i-th example. t (Variable) – Variable holding an int32 vector of groundtruth labels. A variable holding a scalar array of the accuracy. Variable

Note

This function is non-differentiable.

chainer.functions.mean_squared_error(x0, x1)[source]

Mean squared error function.

This function computes mean squared error between two variables. The mean is taken over the minibatch. Note that the error is not scaled by 1/2.

chainer.functions.softmax_cross_entropy(x, t, use_cudnn=True)[source]

Computes cross entropy loss on softmax of the prediction using the groundtruth label vector.

Parameters: x (Variable) – Variable holding a matrix whose (i, j)-th element indicates unnormalized log probability of the class j at the i-th example. t (Variable) – Variable holding an int32 vector of groundtruth labels. A variable holding a scalar array of the cross entropy loss. Variable

Note

This function is differentiable only by x.

chainer.functions.sum(x)[source]

Computes sum of all elements.

## Reusable subnetwork of complex architectures¶

class chainer.functions.Inception(in_channels, out1, proj3, out3, proj5, out5, proj_pool)[source]

Inception module of GoogLeNet.

It applies four different functions to the input array and concatenates their outputs along the channel dimension. Three of them are 2D convolutions of sizes 1x1, 3x3 and 5x5. Convolution paths of 3x3 and 5x5 sizes have 1x1 convolutions (called projections) ahead of them. The other path consists of 1x1 convolution (projection) and 3x3 max pooling.

The output array has the same spatial size as the input. In order to satisfy this, Inception module uses appropriate padding for each convolution and pooling.

Parameters: in_channels (int) – Number of channels of input arrays. out1 (int) – Output size of 1x1 convolution path. proj3 (int) – Projection size of 3x3 convolution path. out3 (int) – Output size of 3x3 convolution path. proj5 (int) – Projection size of 5x5 convolution path. out5 (int) – Output size of 5x5 convolution path. proj_pool (int) – Projection size of max pooling path. Output variable. Its array has the same spatial size and the same minibatch size as the input array. The channel dimension has size out1 + out3 + out5 + proj_pool. Variable

Note

This function inserts the full computation graph of the Inception module behind the input array. This function itself is not inserted into the computation graph.