Recommended texts
Optmization
Kingma, D. P. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Barzilai, J., & Borwein, J. M. (1988). Two-point step size gradient methods. IMA journal of numerical analysis, 8(1), 141-148.
Implicit regularization
Neyshabur, B. (2017). Implicit regularization in deep learning. arXiv preprint arXiv:1709.01953.
Smith, S. L., Dherin, B., Barrett, D. G., & De, S. (2021). On the origin of implicit regularization in stochastic gradient descent. arXiv preprint arXiv:2101.12176.
HaoChen, J. Z., Wei, C., Lee, J., & Ma, T. (2021, July). Shape matters: Understanding the implicit bias of the noise covariance. In Conference on Learning Theory (pp. 2315-2357). PMLR.
Prechelt, L. (2002). Early stopping-but when?. In Neural Networks: Tricks of the trade (pp. 55-69). Berlin, Heidelberg: Springer Berlin Heidelberg.
The relevance of Bayesianism
Wilson, A. G., & Izmailov, P. (2020). Bayesian deep learning and a probabilistic perspective of generalization. Advances in neural information processing systems, 33, 4697-4708.
Mandt, S., Hoffman, M. D., & Blei, D. M. (2017). Stochastic gradient descent as approximate Bayesian inference. Journal of Machine Learning Research, 18(134), 1-35.