Optimizer¶

class
chainer.
Optimizer
[source]¶ Base class of all numerical optimizers.
This class provides basic features for all optimization methods. It optimizes parameters of a target link. The target link is registered via the
setup()
method, and then theupdate()
method updates its parameters based on a given loss function.Each optimizer implementation must be defined as a child class of Optimizer. It must override
update()
method.If the optimizer is based on single gradient computation (like most firstorder methods), then it should inherit
GradientMethod
, which adds some features dedicated for the first order methods, including the support ofUpdateRule
.Optimizer instance also supports hook functions. Hook function is registered by the
add_hook()
method. Each hook function is called in registration order in advance of the actual parameter update. If the hook function has an attributecall_for_each_param
and its value isTrue
, the hook function is used as a hook function of all update rules (i.e., it is invoked for every parameter by passing the corresponding update rule and the parameter).Variables:  target – Target link object. It is set by the
setup()
method.  t – Number of update steps. It must be incremented by the
update()
method.  epoch – Current epoch. It is incremented by the
new_epoch()
method.

add_hook
(hook, name=None)[source]¶ Registers a hook function.
Hook function is typically called right after the gradient computation, though the timing depends on the optimization method.
Parameters:  hook (function) – Hook function. If
hook.call_for_each_param
is true, this hook function is called for each parameter by passing the update rule and the parameter. Otherwise, this hook function is called only once each iteration by passing the optimizer.  name (str) – Name of the registration. If omitted,
hook.name
is used by default.
 hook (function) – Hook function. If

new_epoch
()[source]¶ Starts a new epoch.
This method increments the
epoch
count. Note that if the optimizer depends on the epoch count, then user should call this method appropriately at the beginning of each epoch.

remove_hook
(name)[source]¶ Removes a hook function.
Parameters: name (str) – Registered name of the hook function to remove.

serialize
(serializer)[source]¶ Serializes or deserializes the optimizer.
It only saves or loads the following things:
 Optimizer states
 Global states (
t
andepoch
)
It does not saves nor loads the parameters of the target link. They should be separately saved or loaded.
Parameters: serializer (AbstractSerializer) – Serializer or deserializer object.

setup
(link)[source]¶ Sets a target link and initializes the optimizer states.
Given link is set to the
target
attribute. It also prepares the optimizer state dictionaries corresponding to all parameters in the link hierarchy. The existing states are discarded.Parameters: link (Link) – Target link object.

update
(lossfun=None, *args, **kwds)[source]¶ Updates the parameters.
This method updates the parameters of the target link. The behavior of this method is different for the cases either
lossfun
is given or not.If
lossfun
is given, this method typically clears the gradients, calls the loss function with given extra arguments, and calls thebackward()
method of its output to compute the gradients. The actual implementation might calllossfun
more than once.If
lossfun
is not given, then this method assumes that the gradients of all parameters are already computed. An implementation that requires multiple gradient computations might raise an error on this case.In both cases, this method invokes the update procedure for all parameters.
Parameters:  lossfun (function) – Loss function. It accepts arbitrary arguments
and returns one
Variable
object that represents the loss (or objective) value. This argument can be omitted for single gradientbased methods. In this case, this method assumes gradient arrays computed.  kwds (args,) – Arguments for the loss function.
 lossfun (function) – Loss function. It accepts arbitrary arguments
and returns one
 target – Target link object. It is set by the

class
chainer.
GradientMethod
[source]¶ Base class of all single gradientbased optimizers.
This is an extension of the
Optimizer
class. Typical gradient methods that just require the gradient at the current parameter vector on an update can be implemented as its child class.This class uses
UpdateRule
to manage the update rule of each parameter. A child class of GradientMethod should overridecreate_update_rule()
to create the default update rule of each parameter.This class also provides
hyperparam
, which is the hyperparameter used as the default configuration of each update rule. All builtin gradient method implementations also provide proxy properties that act as aliases to the attributes ofhyperparam
. It is recommended to provide such an alias to each attribute. It can be done by only adding one line for each attribute usingHyperparameterProxy
.Variables: hyperparam (Hyperparameter) – The hyperparameter of the gradient method. It is used as the default configuration of each update rule (i.e., the hyperparameter of each update rule refers this hyperparameter as its parent). 
create_update_rule
()[source]¶ Creates a new update rule object.
This method creates an update rule object. It is called by
setup()
to set up an update rule of each parameter. Each implementation of the gradient method should override this method to provide the default update rule implementation.Returns: Update rule object. Return type: UpdateRule

reallocate_cleared_grads
()[source]¶ Reallocate gradients cleared by
cleargrad()
.This method allocates arrays for all gradients which have
None
. This method is called before and after every optimizer hook. If an inheriting optimizer does not require this allocation, the optimizer can override this method with a blank function.

update
(lossfun=None, *args, **kwds)[source]¶ Updates parameters based on a loss function or computed gradients.
This method runs in two ways.
 If
lossfun
is given, then it is used as a loss function to compute gradients.  Otherwise, this method assumes that the gradients are already computed.
In both cases, the computed gradients are used to update parameters. The actual update routines are defined by the update rule of each parameter.
 If

use_cleargrads
(use=True)[source]¶ Enables or disables use of
cleargrads()
in update.Parameters: use (bool) – If True
, this function enables use of cleargrads. IfFalse
, disables use of cleargrads (zerograds is used).Deprecated since version v2.0: Note that
update()
callscleargrads()
by default.cleargrads()
is more efficient thanzerograds()
, so one does not have to calluse_cleargrads()
. This method remains for backward compatibility.

Hook functions¶

class
chainer.optimizer.
WeightDecay
(rate)[source]¶ Optimizer/UpdateRule hook function for weight decay regularization.
This hook function adds a scaled parameter to the corresponding gradient. It can be used as a regularization.
Parameters: rate (float) – Coefficient for the weight decay. Variables: rate (float) – Coefficient for the weight decay.

class
chainer.optimizer.
Lasso
(rate)[source]¶ Optimizer/UpdateRule hook function for Lasso regularization.
This hook function adds a scaled parameter to the sign of each weight. It can be used as a regularization.
Parameters: rate (float) – Coefficient for the weight decay. Variables: rate (float) – Coefficient for the weight decay.

class
chainer.optimizer.
GradientClipping
(threshold)[source]¶ Optimizer hook function for gradient clipping.
This hook function scales all gradient arrays to fit to the defined L2 norm threshold.
Parameters: threshold (float) – L2 norm threshold. Variables: threshold (float) – L2 norm threshold of gradient norm.

class
chainer.optimizer.
GradientNoise
(eta, noise_func=<function exponential_decay_noise>)[source]¶ Optimizer/UpdateRule hook function for adding gradient noise.
This hook function simply adds noise generated by the
noise_func
to the gradient. By default it adds timedependent annealed Gaussian noise to the gradient at every training step:\[g_t \leftarrow g_t + N(0, \sigma_t^2)\]where
\[\sigma_t^2 = \frac{\eta}{(1+t)^\gamma}\]with \(\eta\) selected from {0.01, 0.3, 1.0} and \(\gamma = 0.55\).
Parameters:  eta (float) – Parameter that defines the scale of the noise, which for the default noise function is recommended to be either 0.01, 0.3 or 1.0.
 noise_func (function) – Noise generating function which by default is given by Adding Gradient Noise Improves Learning for Very Deep Networks.