Improving Generalization Performance by Switching from Adam to SGD (arXiv)

On many hard tasks such as object recognition on CIFAR-100 or ImageNet, machine translation, or language modeling, SGD generalizes better than Adam. I have outlined in a recent blog post different recent approaches that try to mitigate this. Researchers from Salesforce Research propose a simple method that switches from Adam to SGD whenever a triggering condition is evoked and thus helps with genrealization.


