class chainer.optimizer_hooks.GradientLARS(threshold=0.01, weight_decay=0.0, eps=1e-09)[source]

Optimizer/UpdateRule hook function for layer wise adaptive rate scaling.

See: Large Batch Training of Convolutional Networks.

See: Convergence Analysis of Gradient Descent Algorithms with Proportional Updates.

This hook function scales all gradient arrays to fit to the weight norm.

In <>,

\[\begin{split}v_{t+1} &= m * v_t + \gamma * \lambda * (\nabla L(w_t) + \beta w_t), \\ w_{t+1} &= w_{t} - v_{t+1},\end{split}\]


  • \(\gamma\) : learning_rate

  • \(m\) : momentum

  • \(\beta\) : weight_decay

  • \(\eta\) : lars_coeeficient

  • \(\lambda\): local_lr \(=\eta * \frac{\|w_t\|}{\|\nabla L(w_t)\| + \beta * \|w_t\|}\).

As \(lr\) in chainer.optimizers.SGD or chainer.optimizers.MomentumSGD corresponds to \(\gamma * \eta\), we define \(clip\_rate\) as \(\frac{\|w_t\|}{\|\nabla L(w_t)\| + \beta * \|w_t\|}\) and reformulate the aforementioned formula as: \(v_{t+1} = m * v_t + lr * clip\_rate * (\nabla L(w_t) + \beta w_t)\) and implement in this way. So you do not set lars_coeeficient.

  • threashold (float) – If weight norm is more than threshold, this function scales all gradient arrays to fit weight norm. (See <>)

  • weight_decay (float) – Coefficient for the weight decay.

  • eps (float) – Small value for the numerical stability. (See <>)

  • timing (string) – Specifies when this hook should be called by the Optimizer/UpdateRule. Valid values are ‘pre’ (before any updates) and ‘post’ (after any updates).

  • call_for_each_param (bool) – Specifies if this hook is called for each parameter (True) or only once (False) by an optimizer to which this hook is registered. This function does not expect users to switch the value from default one, which is True.


__call__(rule, param)[source]

Call self as a function.


call_for_each_param = True
name = 'GradientLARS'
timing = 'pre'