Momentum, RMSprop, and Adam
Published:
Gradient Descent with Momentum
Vdw = 0
Vdb = 0
On iteration t:
Compute dW, db on the current mini-batch
Vdw = beta*Vdw + (1-beta)*dW
Vdb = beta*Vdb + (1-beta)*db
W = W - alpha*Vdw
b = b - alpha*Vdb
RMSprop
Sdw = 0
Sdb = 0
On iteration t:
Compute dW, db on the current mini-batch
Sdw = beta*Sdw + (1-beta)*dW^2
Sdb = beta*Sdb + (1-beta)*db^2
W = W - alpha*dW/sqrt(Sdw + epsilon)
b = b - alpha*db/sqrt(Sdb + epsilon)
Adam
Vdw = 0
Vdb = 0
Sdw = 0
Sdb = 0
On iteration t:
Compute dW, db on the current mini-batch
Vdw = beta_1*Vdw + (1-beta_1)*dW
Vdb = beta_1*Vdb + (1-beta_1)*db
Sdw = beta_2*Sdw + (1-beta_2)*dW^2
Sdb = beta_2*Sdb + (1-beta_2)*db^2
Vdw_corrected = Vdw/(1-beta_1^t)
Vdb_corrected = Vdb/(1-beta_1^t)
Sdw_corrected = Sdw/(1-beta_2^t)
Sdb_corrected = Sdb/(1-beta_2^t)
W = W - alpha*Vdw_corrected/(sqrt(Sdw_corrected) + epsilon)
b = b - alpha*Vdb_corrected/(sqrt(Sdb_corrected) + epsilon)