Momentum, RMSprop, and Adam

less than 1 minute read

Published:

Gradient Descent with Momentum

Vdw = 0
Vdb = 0
On iteration t:
  Compute dW, db on the current mini-batch
  Vdw = beta*Vdw + (1-beta)*dW
  Vdb = beta*Vdb + (1-beta)*db
  W = W - alpha*Vdw
  b = b - alpha*Vdb 

RMSprop

Sdw = 0
Sdb = 0
On iteration t:
  Compute dW, db on the current mini-batch
  Sdw = beta*Sdw + (1-beta)*dW^2
  Sdb = beta*Sdb + (1-beta)*db^2
  W = W - alpha*dW/sqrt(Sdw + epsilon)
  b = b - alpha*db/sqrt(Sdb + epsilon)

Adam

Vdw = 0
Vdb = 0
Sdw = 0
Sdb = 0
On iteration t:
Compute dW, db on the current mini-batch
Vdw = beta_1*Vdw + (1-beta_1)*dW
Vdb = beta_1*Vdb + (1-beta_1)*db
Sdw = beta_2*Sdw + (1-beta_2)*dW^2
Sdb = beta_2*Sdb + (1-beta_2)*db^2
Vdw_corrected = Vdw/(1-beta_1^t)
Vdb_corrected = Vdb/(1-beta_1^t)
Sdw_corrected = Sdw/(1-beta_2^t)
Sdb_corrected = Sdb/(1-beta_2^t)
W = W - alpha*Vdw_corrected/(sqrt(Sdw_corrected) + epsilon)
b = b - alpha*Vdb_corrected/(sqrt(Sdb_corrected) + epsilon)