A technique used when minimizing a function using gradient descent that helps avoid local minima.

One of the biggest problems when using gradient descent to minimize a function is that the optimization process may get stuck in a local minimum. Momentum is a technique that tries to mitigate that by allowing the optimization to "gather speed" and jump over small "hills" in the function.

The formulat for momentum used in a gradient descent optimization can be written as:

The formula is based on a simple iterative optimization algorithm, that can be expressed as:

In a simple gradient descent optimization, we define the update of the weights as:

If we want to add momentum, we add an additional term - the weight update in the previous step times a decay factor. The decay factor `α`

is just a number between 0 and 1 defining how much of the previous update will be taken into account. If we set `α`

= 0 then no momentum is applied, while `α`

= 1 means a lot of momentum.

A useful analogy is a ball rolling down a hill. If the hill is steep, the ball will accelerate (we update the weights more). This will help the ball jump over small local minima and continue down the hill (to a smaller value of the function). More momentum means a heavier ball with higher inertia.

The final formula of gradient descent optimization with momentum becomes:

Back to Math Terms