Optimizer¶

class chainer.Optimizer[source]¶

Base class of all numerical optimizers.

This class provides basic features for all optimization methods. It optimizes parameters of a target link. The target link is registered via the setup() method, and then the update() method updates its parameters based on a given loss function.

Each optimizer implementation must be defined as a child class of Optimizer. It must override update() method. An optimizer can use internal states each of which is tied to one of the parameters. State is a dictionary of serializable values (typically arrays of size same as the corresponding parameters). In order to use state dictionaries, the optimizer must override init_state() method (or its CPU/GPU versions, init_state_cpu() and init_state_gpu()).

If the optimizer is based on single gradient computation (like most first-order methods), then it should inherit GradientMethod, which adds some features dedicated for the first order methods.

Optimizer instance also supports hook functions. Hook function is registered by the add_hook() method. Each hook function is called in registration order in advance of the actual parameter update.

Variables:	target – Target link object. It is set by the `setup()` method. t – Number of update steps. It must be incremented by the `update()` method. epoch – Current epoch. It is incremented by the `new_epoch()` method.

accumulate_grads(grads)[source]¶

Accumulates gradients from other source.

This method just adds given gradient arrays to gradients that this optimizer holds. It is typically used in data-parallel optimization, where gradients for different shards are computed in parallel and aggregated by this method. This method correctly treats multiple GPU devices.

Parameters:	grads (Iterable) – Iterable of gradient arrays to be accumulated.

Deprecated since version v1.5: Use the chainer.Link.addgrads() method of the target link instead.

add_hook(hook, name=None)[source]¶

Registers a hook function.

Hook function is typically called right after the gradient computation, though the timing depends on the optimization method.

Parameters:	hook (function) – Hook function. It accepts the optimizer object. name (str) – Name of the registration. If omitted, `hook.name` is used by default.

call_hooks()[source]¶: Invokes hook functions in registration order.

clip_grads(maxnorm)[source]¶

Clips the norm of whole gradients up to the threshold.

Parameters:	maxnorm (float) – Threshold of gradient L2 norm.

Deprecated since version v1.5: Use the GradientClipping hook function instead.

compute_grads_norm()[source]¶

Computes the norm of whole gradients.

Returns:	L2 norm of whole gradients, i.e. square root of sum of square of all gradient elements.
Return type:	float

Warning

This method returns a CPU-computed value, which means that this method synchronizes between CPU and GPU if at least one of the gradients reside on the GPU.

Deprecated since version v1.5.

init_state(param, state)[source]¶

Initializes the optimizer state corresponding to the parameter.

This method should add needed items to the state dictionary. Each optimizer implementation that uses its own states should override this method or CPU/GPU dedicated versions (init_state_cpu() and init_state_gpu()).

Parameters:	param (Variable) – Parameter variable. state (dict) – State dictionary.

See also

init_state()

new_epoch()[source]¶

Starts a new epoch.

This method increments the epoch count. Note that if the optimizer depends on the epoch count, then user should call this method appropriately at the beginning of each epoch.

prepare()[source]¶

Prepares for an update.

This method initializes missing optimizer states (e.g. for newly added parameters after the set up), and copies arrays in each state dictionary to CPU or GPU according to the corresponding parameter array.

remove_hook(name)[source]¶

Removes a hook function.

Parameters:	name (str) – Registered name of the hook function to remove.

serialize(serializer)[source]¶

Serializes or deserializes the optimizer.

It only saves or loads the following things:

Optimizer states
Global states (t and epoch)

It does not saves nor loads the parameters of the target link. They should be separately saved or loaded.

Parameters:	serializer (AbstractSerializer) – Serializer or deserializer object.

setup(link)[source]¶

Sets a target link and initializes the optimizer states.

Given link is set to the target attribute. It also prepares the optimizer state dictionaries corresponding to all parameters in the link hierarchy. The existing states are discarded.

Parameters:	link (Link) – Target link object.

update(lossfun=None, *args, **kwds)[source]¶

Updates the parameters and optimizer states.

This method updates the parameters of the target link and corresponding optimizer states. The behavior of this method is different for the cases either lossfun is given or not.

If lossfun is given, then this method initializes the gradients by zeros, calls it with given extra arguments, and calls the backward() method of its output to compute the gradients. The implementation might call lossfun more than once.

If lossfun is not given, then this method assumes that the gradients of all parameters are already computed. An implementation that requires multiple gradient computations might raise an error on this case.

In both cases, this method invokes the update procedure for all parameters.

Parameters:	lossfun (function) – Loss function. It accepts arbitrary arguments and returns one `Variable` object that represents the loss (or objective) value. This argument can be omitted for single gradient-based methods. In this case, this method assumes gradient arrays computed. kwds (args,) – Arguments for the loss function.

weight_decay(decay)[source]¶

Applies weight decay to the parameter/gradient pairs.

Parameters:	decay (float) – Coefficient of weight decay.

Deprecated since version v1.5: Use the WeightDecay hook function instead.

zero_grads()[source]¶: Fills all gradient arrays by zeros.

Deprecated since version v1.5: Use the chainer.Link.cleargrads() method for the target link instead.

class chainer.GradientMethod[source]¶

Base class of all single gradient-based optimizers.

This is an extension of the Optimizer class. Typical gradient methods that just require the gradient at the current parameter vector on an update can be implemented as its child class.

An implementation of a gradient method must override the following methods:

init_state() or both init_state_cpu() and init_state_gpu()
update_one() or both update_one_cpu() and update_one_gpu()

Note

It is recommended to call use_cleargrads() after creating a GradientMethod object for efficiency.

call_hooks()[source]¶: Invokes hook functions in registration order.

reallocate_cleared_grads()[source]¶

Reallocate gradients cleared by cleargrad().

This method allocates arrays for all gradients which have None. This method is called before and after every optimizer hook. If an inheriting optimizer does not require this allocation, the optimizer can override this method with a blank function.

update(lossfun=None, *args, **kwds)[source]¶

Updates parameters based on a loss function or computed gradients.

This method runs in two ways.

If lossfun is given, then use it as a loss function to compute gradients.
Otherwise, this method assumes that the gradients are already computed.

In both cases, the computed gradients are used to update parameters. The actual update routines are defined by the update_one() method (or its CPU/GPU versions, update_one_cpu() and update_one_gpu()).

update_one(param, state)[source]¶

Updates a parameter based on the corresponding gradient and state.

This method calls appropriate one from update_param_cpu() or update_param_gpu().

Parameters:	param (Variable) – Parameter variable. state (dict) – State dictionary.

update_one_cpu(param, state)[source]¶

Updates a parameter on CPU.

Parameters:	param (Variable) – Parameter variable. state (dict) – State dictionary.

update_one_gpu(param, state)[source]¶

Updates a parameter on GPU.

Parameters:	param (Variable) – Parameter variable. state (dict) – State dictionary.

use_cleargrads(use=True)[source]¶

Enables or disables use of cleargrads() in update.

Parameters:	use (bool) – If `True`, this function enables use of cleargrads. If `False`, disables use of cleargrads (zerograds is used).

Note

Note that update() calls zerograds() by default for backward compatibility. It is recommended to call this method before first call of update because cleargrads is more efficient than zerograds.

Hook functions¶

class chainer.optimizer.WeightDecay(rate)[source]¶

Optimizer hook function for weight decay regularization.

This hook function adds a scaled parameter to the corresponding gradient. It can be used as a regularization.

Parameters:	rate (float) – Coefficient for the weight decay.
Variables:	rate (float) – Coefficient for the weight decay.

class chainer.optimizer.Lasso(rate)[source]¶

Optimizer hook function for Lasso regularization.

This hook function adds a scaled parameter to the sign of each weight. It can be used as a regularization.

Parameters:	rate (float) – Coefficient for the weight decay.
Variables:	rate (float) – Coefficient for the weight decay.

class chainer.optimizer.GradientClipping(threshold)[source]¶

Optimizer hook function for gradient clipping.

This hook function scales all gradient arrays to fit to the defined L2 norm threshold.

Parameters:	threshold (float) – L2 norm threshold.
Variables:	threshold (float) – L2 norm threshold of gradient norm.

class chainer.optimizer.GradientNoise(eta, noise_func=<function exponential_decay_noise>)[source]¶

Optimizer hook function for adding gradient noise.

This hook function simply adds noise generated by the noise_func to the gradient. By default it adds time-dependent annealed Gaussian noise to the gradient at every training step:

\[g_t \leftarrow g_t + N(0, \sigma_t^2)\]

where

\[\sigma_t^2 = \frac{\eta}{(1+t)^\gamma}\]

with \(\eta\) selected from {0.01, 0.3, 1.0} and \(\gamma = 0.55\).

Parameters:	eta (float) – Parameter that defines the scale of the noise, which for the default noise function is recommended to be either 0.01, 0.3 or 1.0. noise_func (function) – Noise generating function which by default is given by Adding Gradient Noise Improves Learning for Very Deep Networks.