chainer.optimizers.Adam

class chainer.optimizers.Adam(alpha=0.001, beta1=0.9, beta2=0.999, eps=1e-08, eta=1.0, weight_decay_rate=0, amsgrad=False, adabound=False, final_lr=0.1, gamma=0.001)[source]

Adam optimizer.

See: Adam: A Method for Stochastic Optimization

Modified for proper weight decay (also called AdamW). AdamW introduces the additional parameters eta and weight_decay_rate, which can be used to properly scale the learning rate, and decouple the weight decay rate from alpha, as shown in the below paper.

Note that with the default values eta = 1 and weight_decay_rate = 0, this implementation is identical to the standard Adam method.

See: Fixing Weight Decay Regularization in Adam

A flag amsgrad to use the AMSGrad variant of Adam from the paper: On the Convergence of Adam and Beyond

A flag adabound to use the AdaBound variant of Adam from the paper: Adaptive Gradient Methods with Dynamic Bound of Learning Rate

If both amsgrad and adabound are True, the optimizer is equivalent to AMSBound proposed in the AdaBound paper.

Parameters
  • alpha (float) – Coefficient of learning rate.

  • beta1 (float) – Exponential decay rate of the first order moment.

  • beta2 (float) – Exponential decay rate of the second order moment.

  • eps (float) – Small value for the numerical stability.

  • eta (float) – Schedule multiplier, can be used for warm restarts.

  • weight_decay_rate (float) – Weight decay rate.

  • amsgrad (bool) – Whether to use AMSGrad variant of Adam.

  • adabound (bool) – Whether to use the AdaBound variant of Adam.

  • final_lr (float) – Final (SGD) learning rate in AdaBound.

  • gamma (float) – Convergence speed of the bound functions in AdaBound.

Methods

add_hook(hook, name=None, timing='auto')[source]

Registers a hook function.

Hook function is typically called right after the gradient computation, though the timing depends on the optimization method, and the timing attribute.

Parameters
  • hook (callable) – Hook function. If hook.call_for_each_param is true, this hook function is called for each parameter by passing the update rule and the parameter. Otherwise, this hook function is called only once each iteration by passing the optimizer.

  • name (str) – Name of the registration. If omitted, hook.name is used by default.

  • timing (str) – Specifies when the hook is called. If ‘auto’, the timimg property of the hook will decide the timing. If ‘pre’, the hook will be called before any updates. If ‘post’, the hook will be called after any updates.

call_hooks(timing='pre')[source]

Invokes hook functions in registration order.

check_nan_in_grads()[source]

Checks if there is NaN in grads when dynamic loss scaling used.

create_update_rule()[source]

Creates a new update rule object.

This method creates an update rule object. It is called by setup() to set up an update rule of each parameter. Each implementation of the gradient method should override this method to provide the default update rule implementation.

Returns

Update rule object.

Return type

UpdateRule

is_safe_to_update()[source]
loss_scaling(interval=1000, scale=None)[source]

Configures the loss scaling algorithm.

Parameters
  • interval (int) – Number of iterations until scaling factor gets doubled. This is effective when “dynamic” loss scaling is used.

  • scale (float) – Loss scaling factor. If None, “dynamic” loss scaling is used, otherwise “static” loss scaling is used.

new_epoch(auto=False)[source]

Starts a new epoch.

This method increments the epoch count. Note that if the optimizer depends on the epoch count, then user should call this method appropriately at the beginning of each epoch.

Parameters

auto (bool) – Should be True if this method is called by an updater. In this case, use_auto_new_epoch should be set to True by the updater.

reallocate_cleared_grads()[source]

Reallocate gradients cleared by cleargrad().

This method allocates arrays for all gradients which have None. This method is called before and after every optimizer hook. If an inheriting optimizer does not require this allocation, the optimizer can override this method with a blank function.

remove_hook(name)[source]

Removes a hook function.

Parameters

name (str) – Registered name of the hook function to remove.

serialize(serializer)[source]

Serializes or deserializes the optimizer.

It only saves or loads the following things:

  • Optimizer states

  • Global states (t and epoch)

It does not saves nor loads the parameters of the target link. They should be separately saved or loaded.

Parameters

serializer (AbstractSerializer) – Serializer or deserializer object.

set_loss_scale(loss_scale)[source]

Sets loss scaling factor.

setup(link)[source]

Sets a target link and initializes the optimizer states.

Given link is set to the target attribute. It also prepares the optimizer state dictionaries corresponding to all parameters in the link hierarchy. The existing states are discarded.

Parameters

link (Link) – Target link object.

Returns

The optimizer instance.

Note

As of v4.0.0, this function returns the optimizer instance itself so that you can instantiate and setup the optimizer in one line, e.g., optimizer = SomeOptimizer().setup(link).

update(lossfun=None, *args, **kwds)[source]

Updates parameters based on a loss function or computed gradients.

This method runs in two ways.

  • If lossfun is given, then it is used as a loss function to compute gradients.

  • Otherwise, this method assumes that the gradients are already computed.

In both cases, the computed gradients are used to update parameters. The actual update routines are defined by the update rule of each parameter.

update_loss_scale()[source]
use_cleargrads(use=True)[source]

Enables or disables use of cleargrads() in update.

Parameters

use (bool) – If True, this function enables use of cleargrads. If False, disables use of cleargrads (zerograds is used).

Deprecated since version v2.0: Note that update() calls cleargrads() by default. cleargrads() is more efficient than zerograds(), so one does not have to call use_cleargrads(). This method remains for backward compatibility.

use_fp32_update(flag=True)[source]

Enables use of parameter update in fp32.

Attributes

adabound

Alias to self.hyperparam.adabound

alpha

Alias to self.hyperparam.alpha

alpha_t
amsgrad

Alias to self.hyperparam.amsgrad

beta1

Alias to self.hyperparam.beta1

beta2

Alias to self.hyperparam.beta2

epoch = 0
eps

Alias to self.hyperparam.eps

eta

Alias to self.hyperparam.eta

final_lr

Alias to self.hyperparam.final_lr

gamma

Alias to self.hyperparam.gamma

lr
t = 0
target = None
use_auto_new_epoch = False
weight_decay_rate

Alias to self.hyperparam.weight_decay_rate