Optimizer¶

class
chainer.
Optimizer
[source]¶ Base class of all numerical optimizers.
This class provides basic features for all optimization methods. It optimizes parameters of a target link. The target link is registered via the
setup()
method, and then theupdate()
method updates its parameters based on a given loss function.Each optimizer implementation must be defined as a child class of Optimizer. It must override
update()
method. An optimizer can use internal states each of which is tied to one of the parameters. State is a dictionary of serializable values (typically arrays of size same as the corresponding parameters). In order to use state dictionaries, the optimizer must overrideinit_state()
method (or its CPU/GPU versions,init_state_cpu()
andinit_state_gpu()
).If the optimizer is based on single gradient computation (like most firstorder methods), then it should inherit
GradientMethod
, which adds some features dedicated for the first order methods.Optimizer instance also supports hook functions. Hook function is registered by the
add_hook()
method. Each hook function is called in registration order in advance of the actual parameter update.Variables:  target – Target link object. It is set by the
setup()
method.  t – Number of update steps. It must be incremented by the
update()
method.  epoch – Current epoch. It is incremented by the
new_epoch()
method.

accumulate_grads
(grads)[source]¶ Accumulates gradients from other source.
This method just adds given gradient arrays to gradients that this optimizer holds. It is typically used in dataparallel optimization, where gradients for different shards are computed in parallel and aggregated by this method. This method correctly treats multiple GPU devices.
Parameters: grads (Iterable) – Iterable of gradient arrays to be accumulated. Deprecated since version v1.5: Use the
chainer.Link.addgrads()
method of the target link instead.

add_hook
(hook, name=None)[source]¶ Registers a hook function.
Hook function is typically called right after the gradient computation, though the timing depends on the optimization method.
Parameters:

clip_grads
(maxnorm)[source]¶ Clips the norm of whole gradients up to the threshold.
Parameters: maxnorm (float) – Threshold of gradient L2 norm. Deprecated since version v1.5: Use the
GradientClipping
hook function instead.

compute_grads_norm
()[source]¶ Computes the norm of whole gradients.
Returns: L2 norm of whole gradients, i.e. square root of sum of square of all gradient elements. Return type: float Warning
This method returns a CPUcomputed value, which means that this method synchronizes between CPU and GPU if at least one of the gradients reside on the GPU.
Deprecated since version v1.5.

init_state
(param, state)[source]¶ Initializes the optimizer state corresponding to the parameter.
This method should add needed items to the
state
dictionary. Each optimizer implementation that uses its own states should override this method or CPU/GPU dedicated versions (init_state_cpu()
andinit_state_gpu()
).Parameters: See also

init_state_cpu
(param, state)[source]¶ Initializes the optimizer state on CPU.
This method is called from
init_state()
by default.Parameters:  param (Variable) – Parameter variable. Its data array is
of type
numpy.ndarray
.  state (dict) – State dictionary.
See also
 param (Variable) – Parameter variable. Its data array is
of type

init_state_gpu
(param, state)[source]¶ Initializes the optimizer state on GPU.
This method is called from
init_state()
by default.Parameters:  param (Variable) – Parameter variable. Its data array is
of type
cupy.ndarray
.  state (dict) – State dictionary.
See also
 param (Variable) – Parameter variable. Its data array is
of type

new_epoch
()[source]¶ Starts a new epoch.
This method increments the
epoch
count. Note that if the optimizer depends on the epoch count, then user should call this method appropriately at the beginning of each epoch.

prepare
()[source]¶ Prepares for an update.
This method initializes missing optimizer states (e.g. for newly added parameters after the set up), and copies arrays in each state dictionary to CPU or GPU according to the corresponding parameter array.

remove_hook
(name)[source]¶ Removes a hook function.
Parameters: name (str) – Registered name of the hook function to remove.

serialize
(serializer)[source]¶ Serializes or deserializes the optimizer.
It only saves or loads the following things:
 Optimizer states
 Global states (
t
andepoch
)
It does not saves nor loads the parameters of the target link. They should be separately saved or loaded.
Parameters: serializer (AbstractSerializer) – Serializer or deserializer object.

setup
(link)[source]¶ Sets a target link and initializes the optimizer states.
Given link is set to the
target
attribute. It also prepares the optimizer state dictionaries corresponding to all parameters in the link hierarchy. The existing states are discarded.Parameters: link (Link) – Target link object.

update
(lossfun=None, *args, **kwds)[source]¶ Updates the parameters and optimizer states.
This method updates the parameters of the target link and corresponding optimizer states. The behavior of this method is different for the cases either
lossfun
is given or not.If
lossfun
is given, then this method initializes the gradients by zeros, calls it with given extra arguments, and calls thebackward()
method of its output to compute the gradients. The implementation might calllossfun
more than once.If
lossfun
is not given, then this method assumes that the gradients of all parameters are already computed. An implementation that requires multiple gradient computations might raise an error on this case.In both cases, this method invokes the update procedure for all parameters.
Parameters:  lossfun (function) – Loss function. It accepts arbitrary arguments
and returns one
Variable
object that represents the loss (or objective) value. This argument can be omitted for single gradientbased methods. In this case, this method assumes gradient arrays computed.  kwds (args,) – Arguments for the loss function.
 lossfun (function) – Loss function. It accepts arbitrary arguments
and returns one

weight_decay
(decay)[source]¶ Applies weight decay to the parameter/gradient pairs.
Parameters: decay (float) – Coefficient of weight decay. Deprecated since version v1.5: Use the
WeightDecay
hook function instead.

zero_grads
()[source]¶ Fills all gradient arrays by zeros.
Deprecated since version v1.5: Use the
chainer.Link.cleargrads()
method for the target link instead.
 target – Target link object. It is set by the

class
chainer.
GradientMethod
[source]¶ Base class of all single gradientbased optimizers.
This is an extension of the
Optimizer
class. Typical gradient methods that just require the gradient at the current parameter vector on an update can be implemented as its child class.An implementation of a gradient method must override the following methods:
init_state()
or bothinit_state_cpu()
andinit_state_gpu()
update_one()
or bothupdate_one_cpu()
andupdate_one_gpu()
Note
It is recommended to call
use_cleargrads()
after creating aGradientMethod
object for efficiency.
reallocate_cleared_grads
()[source]¶ Reallocate gradients cleared by
cleargrad()
.This method allocates arrays for all gradients which have
None
. This method is called before and after every optimizer hook. If an inheriting optimizer does not require this allocation, the optimizer can override this method with a blank function.

update
(lossfun=None, *args, **kwds)[source]¶ Updates parameters based on a loss function or computed gradients.
This method runs in two ways.
 If
lossfun
is given, then use it as a loss function to compute gradients.  Otherwise, this method assumes that the gradients are already computed.
In both cases, the computed gradients are used to update parameters. The actual update routines are defined by the
update_one()
method (or its CPU/GPU versions,update_one_cpu()
andupdate_one_gpu()
). If

update_one
(param, state)[source]¶ Updates a parameter based on the corresponding gradient and state.
This method calls appropriate one from
update_param_cpu()
orupdate_param_gpu()
.Parameters:

use_cleargrads
(use=True)[source]¶ Enables or disables use of
cleargrads()
in update.Parameters: use (bool) – If True
, this function enables use of cleargrads. IfFalse
, disables use of cleargrads (zerograds is used).Note
Note that
update()
callszerograds()
by default for backward compatibility. It is recommended to call this method before first call of update because cleargrads is more efficient than zerograds.
Hook functions¶

class
chainer.optimizer.
WeightDecay
(rate)[source]¶ Optimizer hook function for weight decay regularization.
This hook function adds a scaled parameter to the corresponding gradient. It can be used as a regularization.
Parameters: rate (float) – Coefficient for the weight decay. Variables: rate (float) – Coefficient for the weight decay.

class
chainer.optimizer.
Lasso
(rate)[source]¶ Optimizer hook function for Lasso regularization.
This hook function adds a scaled parameter to the sign of each weight. It can be used as a regularization.
Parameters: rate (float) – Coefficient for the weight decay. Variables: rate (float) – Coefficient for the weight decay.

class
chainer.optimizer.
GradientClipping
(threshold)[source]¶ Optimizer hook function for gradient clipping.
This hook function scales all gradient arrays to fit to the defined L2 norm threshold.
Parameters: threshold (float) – L2 norm threshold. Variables: threshold (float) – L2 norm threshold of gradient norm.

class
chainer.optimizer.
GradientNoise
(eta, noise_func=<function exponential_decay_noise>)[source]¶ Optimizer hook function for adding gradient noise.
This hook function simply adds noise generated by the
noise_func
to the gradient. By default it adds timedependent annealed Gaussian noise to the gradient at every training step:\[g_t \leftarrow g_t + N(0, \sigma_t^2)\]where
\[\sigma_t^2 = \frac{\eta}{(1+t)^\gamma}\]with \(\eta\) selected from {0.01, 0.3, 1.0} and \(\gamma = 0.55\).
Parameters:  eta (float) – Parameter that defines the scale of the noise, which for the default noise function is recommended to be either 0.01, 0.3 or 1.0.
 noise_func (function) – Noise generating function which by default is given by Adding Gradient Noise Improves Learning for Very Deep Networks.