Yogi Optimizer (Premium ✮)

The is a sophisticated adaptive gradient optimization algorithm designed to address the convergence limitations of the widely used Adam optimizer, particularly in nonconvex settings. Introduced in the research paper " Adaptive Methods for Nonconvex Optimization ," Yogi provides a more stable and robust framework for training deep learning models by controlling the increase of the effective learning rate. The Core Problem: Why Yogi?

Where $g_t$ is the gradient at time $t$ and $\beta_2$ is a decay rate. The problem arises when the gradients are large and sparse. Adam adds the new squared gradient to the running average. If the running average is small and a large gradient suddenly appears, Adam updates the average aggressively. In some cases, this prevents the algorithm from regulating the effective step size correctly, leading to sub-optimal convergence.

Proposed by Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar in their 2019 paper, "On the Convergence of Adam and Beyond," Yogi was born out of a critical observation: While Adam works well for convex problems, its adaptive learning rate can increase rapidly based on past gradients, leading to non-convergent behavior or "forgetting" in deep neural networks.

Training GANs is a balancing act. The discriminator and generator often produce wildly fluctuating gradient magnitudes. Practitioners have reported that Yogi reduces mode collapse and produces higher quality samples because it prevents the optimizer from "forgetting" rare gradient features. yogi optimizer

model = MyNeuralNet() optimizer = optim.Yogi( model.parameters(), lr=0.01, betas=(0.9, 0.999), eps=1e-3, initial_accumulator=1e-6 )

Early optimization algorithms, like Stochastic Gradient Descent (SGD), functioned like a hiker running down a mountain. They calculated the slope (gradient) of the terrain and took a step in the downward direction. However, this hiker had no memory. If the terrain was noisy or rugged, the hiker might bounce around erratically.

Adam is the default choice for most deep learning practitioners because it works well "out of the box." However, researchers identified a theoretical flaw in Adam’s update rule regarding the second moment estimate (the variance). Where $g_t$ is the gradient at time $t$

: Yogi ensures that parameters with large gradients receive smaller learning rates, while those with smaller gradients receive larger ones. This enables more efficient exploration of the loss landscape Stability in Non-Convex Landscapes : It uses a bias-corrected second-moment estimate

The result? Yogi maintains a much more , even when faced with outlying, noisy gradients.

Where $g_t$ is the current gradient. If you unroll this, $v_t$ is essentially an of squared gradients. If the running average is small and a

Copying epsilon from Adam.

To understand Yogi, we must first understand the problem it solves. Training a neural network is essentially an optimization problem. The goal is to find a set of parameters (weights) that minimize a specific "loss function"—a mathematical representation of how wrong the model’s predictions are compared to reality.

$$v_t = \beta_2 v_t-1 + (1 - \beta_2) g_t^2$$