chainer.optimizer_hooks.GradientLARS

class chainer.optimizer_hooks.GradientLARS(threshold=0.01, weight_decay=0.0, eps=1e-09)[source]

Optimizer/UpdateRule hook function for layer wise adaptive rate scaling.

See: Large Batch Training of Convolutional Networks.

See: Convergence Analysis of Gradient Descent Algorithms with Proportional Updates.

This hook function scales all gradient arrays to fit to the weight norm.

In <https://arxiv.org/abs/1708.03888>,

\[\begin{split}v_{t+1} &= m * v_t + \gamma * \lambda * (\nabla L(w_t) + \beta w_t), \\ w_{t+1} &= w_{t} - v_{t+1},\end{split}\]

where

  • \(\gamma\) : learning_rate
  • \(m\) : momentum
  • \(\beta\) : weight_decay
  • \(\eta\) : lars_coeeficient
  • \(\lambda\): local_lr \(=\eta * \frac{\|w_t\|}{\|\nabla L(w_t)\| + \beta * \|w_t\|}\).

As \(lr\) in chainer.optimizers.SGD or chainer.optimizers.MomentumSGD corresponds to \(\gamma * \eta\), we define \(clip\_rate\) as \(\frac{\|w_t\|}{\|\nabla L(w_t)\| + \beta * \|w_t\|}\) and reformulate the aforementioned formula as: \(v_{t+1} = m * v_t + lr * clip\_rate * (\nabla L(w_t) + \beta w_t)\) and implement in this way. So you do not set lars_coeeficient.

Parameters:
Variables:
  • threashold (float) – If weight norm is more than threshold, this function scales all gradient arrays to fit weight norm. (See <https://arxiv.org/abs/1801.03137>)
  • weight_decay (float) – Coefficient for the weight decay.
  • eps (float) – Small value for the numerical stability. (See <https://arxiv.org/abs/1801.03137>)
  • timing (string) – Specifies when this hook should be called by the Optimizer/UpdateRule. Valid values are ‘pre’ (before any updates) and ‘post’ (after any updates).

Methods

__call__(rule, param)[source]

Call self as a function.

Attributes

call_for_each_param = True
name = 'GradientLARS'
timing = 'pre'