Learning Rate Scheduler#


Poly#

Scala:

val lrScheduler = Poly(power=0.5, maxIteration=1000)

Python:

lr_scheduler = Poly(power=0.5, max_iteration=1000, bigdl_type="float")

A learning rate decay policy, where the effective learning rate follows a polynomial decay, to be zero by the max_iteration. Calculation: base_lr (1 - iter/maxIteration) ^ (power)

power coeffient of decay, refer to calculation formula

maxIteration max iteration when lr becomes zero

Scala example:

import com.intel.analytics.bigdl.dllib.optim.SGD._
import com.intel.analytics.bigdl.dllib.optim._
import com.intel.analytics.bigdl.dllib.tensor.{Storage, Tensor}
import com.intel.analytics.bigdl.dllib.tensor.TensorNumericMath.TensorNumeric.NumericFloat
import com.intel.analytics.bigdl.dllib.utils.T

val optimMethod = new SGD[Double](0.1)
optimMethod.learningRateSchedule = Poly(3, 100)
def feval(x: Tensor[Double]): (Double, Tensor[Double]) = {
  return (0.1, Tensor[Double](Storage(Array(1.0, 1.0))))
}
val x = Tensor[Double](Storage(Array(10.0, 10.0)))
optimMethod.optimize(feval, x)
> print(optimMethod.learningRateSchedule.currentRate)
-0.1
optimMethod.optimize(feval, x)
> print(optimMethod.learningRateSchedule.currentRate)
-0.0970299

Python example:

optim_method = SGD(0.1)
optimMethod.learningRateSchedule = Poly(3, 100)

Default#

It is the default learning rate schedule. For each iteration, the learning rate would update with the following formula: l_{n + 1} = l / (1 + n * learning_rate_decay) where l is the initial learning rate

Scala:

val lrScheduler = Default()

Python:

lr_scheduler = Default()

Scala example:

val optimMethod = new SGD[Double](0.1, 0.1)
def feval(x: Tensor[Double]): (Double, Tensor[Double]) = {
  return (0.1, Tensor[Double](Storage(Array(1.0, 1.0))))
}
val x = Tensor[Double](Storage(Array(10.0, 10.0)))
optimMethod.optimize(feval, x)
> print(optimMethod.learningRateSchedule.currentRate)
-0.1
optimMethod.optimize(feval, x)
> print(optimMethod.learningRateSchedule.currentRate)
-0.09090909090909091
optimMethod.optimize(feval, x)
> print(optimMethod.learningRateSchedule.currentRate)
-0.08333333333333334

Python example:

optimMethod = SGD(leaningrate_schedule=Default())

NaturalExp#

A learning rate schedule, which rescale the learning rate by exp ( -decay_rate * iter / decay_step ) referring to tensorflow’s learning rate decay # natural_exp_decay

decay_step how often to apply decay

gamma the decay rate. e.g. 0.96

Scala:

val learningRateScheduler = NaturalExp(1, 1)

Scala example:

val optimMethod = new SGD[Double](0.1)
optimMethod.learningRateSchedule = NaturalExp(1, 1)
def feval(x: Tensor[Double]): (Double, Tensor[Double]) = {
  (0.1, Tensor[Double](Storage(Array(1.0, 1.0))))
}
val x = Tensor[Double](Storage(Array(10.0, 10.0)))
val state = T("epoch" -> 0, "evalCounter" -> 0)
optimMethod.state = state
optimMethod.optimize(feval, x)
> print(optimMethod.learningRateSchedule.currentRate)
-0.1

optimMethod.optimize(feval, x)
> print(optimMethod.learningRateSchedule.currentRate)
-0.036787944117144235

optimMethod.optimize(feval, x)
> print(optimMethod.learningRateSchedule.currentRate)
-0.013533528323661271

Exponential#

A learning rate schedule, which rescale the learning rate by lr_{n + 1} = lr * decayRate ^ (iter / decayStep)

decayStep the inteval for lr decay

decayRate decay rate

stairCase if true, iter / decayStep is an integer division and the decayed learning rate follows a staircase function.

Scala:

val learningRateSchedule = Exponential(10, 0.96)

Python:

exponential = Exponential(100, 0.1)

Scala example:

val optimMethod = new SGD[Double](0.05)
optimMethod.learningRateSchedule = Exponential(10, 0.96)
def feval(x: Tensor[Double]): (Double, Tensor[Double]) = {
  (0.1, Tensor[Double](Storage(Array(1.0, 1.0))))
}
val x = Tensor[Double](Storage(Array(10.0, 10.0)))
val state = T("epoch" -> 0, "evalCounter" -> 0)
optimMethod.state = state
optimMethod.optimize(feval, x)
> print(optimMethod.learningRateSchedule.currentRate)
-0.05

optimMethod.optimize(feval, x)
> print(optimMethod.learningRateSchedule.currentRate)
-0.049796306069892535

Python example:

optimMethod = SGD(leaningrate_schedule=Exponential(100, 0.1))

Plateau#

Plateau is the learning rate schedule when a metric has stopped improving. Models often benefit from reducing the learning rate by a factor of 2-10 once learning stagnates. It monitors a quantity and if no improvement is seen for a ‘patience’ number of epochs, the learning rate is reduced.

monitor quantity to be monitored, can be Loss or score

factor factor by which the learning rate will be reduced. new_lr = lr * factor

patience number of epochs with no improvement after which learning rate will be reduced.

mode one of {min, max}. In min mode, lr will be reduced when the quantity monitored has stopped decreasing; in max mode it will be reduced when the quantity monitored has stopped increasing

epsilon threshold for measuring the new optimum, to only focus on significant changes.

cooldown number of epochs to wait before resuming normal operation after lr has been reduced.

minLr lower bound on the learning rate.

Scala:

val learningRateSchedule = Plateau(monitor="score", factor=0.1, patience=10, mode="min", epsilon=1e-4f, cooldown=0, minLr=0)

Python:

plateau = Plateau("score", factor=0.1, patience=10, mode="min", epsilon=1e-4, cooldown=0, minLr=0)

Scala example:

val optimMethod = new SGD[Double](0.05)
optimMethod.learningRateSchedule = Plateau(monitor="score", factor=0.1, patience=10, mode="min", epsilon=1e-4f, cooldown=0, minLr=0)
def feval(x: Tensor[Double]): (Double, Tensor[Double]) = {
  (0.1, Tensor[Double](Storage(Array(1.0, 1.0))))
}
val x = Tensor[Double](Storage(Array(10.0, 10.0)))
val state = T("epoch" -> 0, "evalCounter" -> 0)
optimMethod.state = state
optimMethod.optimize(feval, x)
> print(optimMethod.learningRateSchedule.currentRate)


optimMethod.optimize(feval, x)
> print(optimMethod.learningRateSchedule.currentRate)

Python example:

optimMethod = SGD(leaningrate_schedule=Plateau("score"))

Warmup#

A learning rate gradual increase policy, where the effective learning rate increase delta after each iteration. Calculation: base_lr + delta * iteration

delta increase amount after each iteration

Scala:

val learningRateSchedule = Warmup(delta = 0.05)

Python:

warmup = Warmup(delta=0.05)

Scala example:

val lrSchedules = new SequentialSchedule(100)
lrSchedules.add(Warmup(0.3), 3).add(Poly(3, 100), 100)
val optimMethod = new SGD[Double](learningRate = 0.1, learningRateSchedule = lrSchedules)

def feval(x: Tensor[Double]): (Double, Tensor[Double]) = {
  return (0.1, Tensor[Double](Storage(Array(1.0, 1.0))))
}
val x = Tensor[Double](Storage(Array(10.0, 10.0)))
optimMethod.optimize(feval, x)
> print(optimMethod.learningRateSchedule.currentRate)
-0.1

optimMethod.optimize(feval, x)
> print(optimMethod.learningRateSchedule.currentRate)
-0.4

Python example:

optimMethod = SGD(leaningrate_schedule=Warmup(0.05))

SequentialSchedule#

A learning rate scheduler which can stack several learning rate schedulers.

iterationPerEpoch iteration numbers per epoch

Scala:

val learningRateSchedule = SequentialSchedule(iterationPerEpoch=100)

Python:

sequentialSchedule = SequentialSchedule(iteration_per_epoch=5)

Scala example:

val lrSchedules = new SequentialSchedule(100)
lrSchedules.add(Warmup(0.3), 3).add(Poly(3, 100), 100)
val optimMethod = new SGD[Double](learningRate = 0.1, learningRateSchedule = lrSchedules)

def feval(x: Tensor[Double]): (Double, Tensor[Double]) = {
  return (0.1, Tensor[Double](Storage(Array(1.0, 1.0))))
}
val x = Tensor[Double](Storage(Array(10.0, 10.0)))
optimMethod.optimize(feval, x)
> print(optimMethod.learningRateSchedule.currentRate)
-0.1

optimMethod.optimize(feval, x)
> print(optimMethod.learningRateSchedule.currentRate)
-0.4

optimMethod.optimize(feval, x)
> print(optimMethod.learningRateSchedule.currentRate)
-0.7

optimMethod.optimize(feval, x)
> print(optimMethod.learningRateSchedule.currentRate)
-1.0

optimMethod.optimize(feval, x)
> print(optimMethod.learningRateSchedule.currentRate)
-0.9702989999999999

Python example:

sequentialSchedule = SequentialSchedule(5)
poly = Poly(0.5, 2)
sequentialSchedule.add(poly, 5)

EpochDecay#

Scala:

def decay(epoch: Int): Double =
  if (epoch >= 1) 2.0 else if (epoch >= 2) 1.0 else 0.0

val learningRateSchedule = EpochDecay(decay)

It is an epoch decay learning rate schedule. The learning rate decays through a function argument on number of run epochs l_{n + 1} = l_{n} * 0.1 ^ decayType(epoch)

decayType is a function with number of run epochs as the argument

Scala example:

def decay(epoch: Int): Double =
  if (epoch == 1) 2.0 else if (epoch == 2) 1.0 else 0.0

val optimMethod = new SGD[Double](1000)
optimMethod.learningRateSchedule = EpochDecay(decay)
def feval(x: Tensor[Double]): (Double, Tensor[Double]) = {
  return (0.1, Tensor[Double](Storage(Array(1.0, 1.0))))
}
val x = Tensor[Double](Storage(Array(10.0, 10.0)))
val state = T("epoch" -> 0)
for(e <- 1 to 3) {
  state("epoch") = e
  optimMethod.state = state
  optimMethod.optimize(feval, x)
  if(e <= 1) {
    assert(optimMethod.learningRateSchedule.currentRate==10)
  } else if (e <= 2) {
    assert(optimMethod.learningRateSchedule.currentRate==100)
  } else {
    assert(optimMethod.learningRateSchedule.currentRate==1000)
  }
}

Regime#

A structure to specify hyper parameters by start epoch and end epoch. Usually work with [[EpochSchedule]].

startEpoch start epoch

endEpoch end epoch

config config table contains hyper parameters

EpochSchedule#

A learning rate schedule which configure the learning rate according to some pre-defined [[Regime]]. If the running epoch is within the interval of a regime r [r.startEpoch, r.endEpoch], then the learning rate will take the “learningRate” in r.config.

regimes an array of pre-defined [[Regime]].

Scala:

val regimes: Array[Regime] = Array(
  Regime(1, 3, T("learningRate" -> 1e-2, "weightDecay" -> 2e-4)),
  Regime(4, 7, T("learningRate" -> 5e-3, "weightDecay" -> 2e-4)),
  Regime(8, 10, T("learningRate" -> 1e-3, "weightDecay" -> 0.0))
)
val learningRateScheduler = EpochSchedule(regimes)

Scala example:

val regimes: Array[Regime] = Array(
  Regime(1, 3, T("learningRate" -> 1e-2, "weightDecay" -> 2e-4)),
  Regime(4, 7, T("learningRate" -> 5e-3, "weightDecay" -> 2e-4)),
  Regime(8, 10, T("learningRate" -> 1e-3, "weightDecay" -> 0.0))
)

val state = T("epoch" -> 0)
val optimMethod = new SGD[Double](0.1)
optimMethod.learningRateSchedule = EpochSchedule(regimes)
def feval(x: Tensor[Double]): (Double, Tensor[Double]) = {
  return (0.1, Tensor[Double](Storage(Array(1.0, 1.0))))
}
val x = Tensor[Double](Storage(Array(10.0, 10.0)))
for(e <- 1 to 10) {
  state("epoch") = e
  optimMethod.state = state
  optimMethod.optimize(feval, x)
  if(e <= 3) {
    assert(optimMethod.learningRateSchedule.currentRate==-1e-2)
    assert(optimMethod.weightDecay==2e-4)
  } else if (e <= 7) {
    assert(optimMethod.learningRateSchedule.currentRate==-5e-3)
    assert(optimMethod.weightDecay==2e-4)
  } else if (e <= 10) {
    assert(optimMethod.learningRateSchedule.currentRate==-1e-3)
    assert(optimMethod.weightDecay==0.0)
  }
}

EpochStep#

A learning rate schedule which rescale the learning rate by gamma for each stepSize epochs.

stepSize For how many epochs to update the learning rate once

gamma the rescale factor

Scala:

val learningRateScheduler = EpochStep(1, 0.5)

Scala example:

val optimMethod = new SGD[Double](0.1)
optimMethod.learningRateSchedule = EpochStep(1, 0.5)
def feval(x: Tensor[Double]): (Double, Tensor[Double]) = {
  (0.1, Tensor[Double](Storage(Array(1.0, 1.0))))
}
val x = Tensor[Double](Storage(Array(10.0, 10.0)))
val state = T("epoch" -> 0)
for(e <- 1 to 10) {
  state("epoch") = e
  optimMethod.state = state
  optimMethod.optimize(feval, x)
  assert(optimMethod.learningRateSchedule.currentRate==(-0.1 * Math.pow(0.5, e)))
}