# Optimizer¶

class chainer.Optimizer[source]

Base class of all numerical optimizers.

Optimizer is set up with references to parameters and gradients, and then on every call of update(), it updates parameters based on corresponding gradients. Optimizer implementations must override update_one() method, which updates one parameter array using the corresponding gradient array.

Optimizer can optionally use state for each parameter/gradient pair. It is initialized by init_state() method at set up.

t

int

Number of update steps. It can be used in update_one() implementation, where t is incremented beforehand.

accumulate_grads(grads)[source]

This method just adds given gradient arrays to gradients that this optimizer holds. It is typically used in data-parallel optimization, where gradients for different shards are computed in parallel and aggregated by this method. This method correctly treats multiple GPU devices.

clip_grads(maxnorm)[source]

Clips the norm of whole gradients up to given threshold.

Parameters: maxnorm (float) – Threshold of gradient L2 norm.

compute_grads_norm()
It uses this method to compute the gradient norm to be clipped.
compute_grads_norm()[source]

Computes the norm of whole gradients.

Returns: L2 norm of whole gradients, i.e. square root of sum of square of all gradient elements. float

Warning

This method returns a CPU-computed value, which means that this method synchronizes between CPU and GPU if at least one of the gradients reside on the GPU.

init_state(param, grad)[source]

Returns initial state corresponding to the given parameter and gradient.

Default implementation delegates the procedure to init_state_cpu() or init_state_gpu() depending on the type of param.

Parameters: param – Parameter array. grad – Gradient array corresponding to param. Initial state value. Warning Note that, on every call of update_one(), the state value is passed by value and then the method updates its content, so the state must be a reference. Especiallly, one cannot use a value of built-in numeric type. If the state is one scalar value, it is recommended to use scalar array, i.e. ndarray with shape ().
init_state_cpu(param, grad)[source]

Returns initial state corresponding to the given CPU parameter and gradient.

Parameters: param (ndarray) – Parameter array. grad (ndarray) – Gradient array. Initial state value.
init_state_gpu(param, grad)[source]

Returns initial state corresponding to the given GPU parameter and gradient.

Parameters: param (GPUArray) – Parameter array. grad (GPUArray) – Gradient array. Initial state value.
setup(params_grads)[source]

Parameters: params_grads – Tuple (pair) of two tuples. The first element is a tuple of parameter arrays, and the second is a tuple of corresponding gradient arrays. Return value of FunctionSet.collect_parameters() method can be used.
update()[source]

This method iteratively calls update_one() for each parameter/ gradient/state tuple. Beforehand, t attribute is incremented.

update_one(param, grad, state)[source]

Updates one parameter array and its state using the corresponding gradient array.

The default implementation delegates the procedure to update_one_cpu() or update_one_gpu() depending on the type of the parameter array. Optimizer implmentation must override these type-specific methods or this update_one() method directly.

Parameters: param – Parameter array. grad – Gradient array. state – State value.
update_one_cpu(param, grad, state)[source]

Updates one parameter array and its state using the corresponding gradient array on CPU.

Parameters: param (ndarray) – Parameter array. grad (ndarray) – Gradient array. state – State value.
update_one_gpu(param, grad, state)[source]

Updates one parameter array and its state using the corresponding gradient array on GPU.

Parameters: param (GPUArray) – Parameter array. grad (GPUArray) – Gradient array. state – State value.
weight_decay(decay)[source]

Applies weight decay (a.k.a. L2 or Tikonov regularization) of given scale to the current gradients.

Parameters: decay (float) – Coefficient of weight decay
zero_grads()[source]

Fills all gradient arrays by zeros.

This method should be call before backprop takes place, since gradients are accumulated on backprop.