How to train your neural net

notes on machine learning and stochastic optimization

Author

Ben Powell for York MathSoc.

Published

November 26, 2025

Context

Context

Predictive ability has exceeded theoretical understanding.

Models have huge numbers of parameters, a wide variety of architectures and are trained on colossal data sets.

Statisticians have been left behind, sifting through the rubble in a world that no longer makes sense…

KD6-3.7. in Bladerunner.

Neural networks

Neural networks are functions built by composing multiple ‘layer’ functions, such as \[ \begin{align*} x_{l+1} = \sigma(W_l x_l + b_l) \end{align*} \] We will refer to the set of all a model’s parameters, arranged into a vector, as \(\theta\).

Typically, the output of the last layer is intended to be a prediction \(x_L=\hat y\) from which we measure the distance to an observable quantity, e.g. \((y-\hat y)^2\).

Some types of layer function have become famous, e.g. convolutional layers, residual layers, transformer layers. They are stitched together to form an ‘architecture’. The residual layer, which serves to add a bit onto its argument, looks a bit like this \[ \begin{align*} x_{l+1} = x_l + V_l^T\sigma(W_l x_l + b_l). \end{align*} \]

It is fun to think about how layer functions and architectures ought to be interpreted.

Historically, statisticians have been concerned with identifying and interpreting model parameters. In the context of machine learning attention is focused entirely on prediction.

Diagram for visualizing a neural network produced by Ashish Patel (https://github.com/ashishpatel26)

The objective function

Typically our objective function is a negative log-likelihood function \[ \begin{align} f(\theta) = -\sum_i l(\theta;D_i) \end{align} \] which takes the form of a sum, whose terms correspond to notionally independent observations.

Often a ‘regularization’ or ‘penalty’ term is added. Some people identify this with a Bayesian prior distribution.

For a regression problem we might have \[ \begin{align} D_i=\{x_i,y_i\} && l(\theta;D_i) = \log p(y_i \mid x_i,\theta) \end{align} \]

where \(x_i\) and \(y_i\) are covariate and response variables, respectively.

Often \(D=\{D_1,D_2,\ldots\}\) is too big to fit in RAM. The production of very large data sets of images served as the catalyst for the most recent AI/ML activity.

The parameter \(d_\theta\)-dimensional vector \(\theta\) is also thought to be very big. We assume vectors of length \(d_\theta\) can fit in RAM but matrices of size \(d_\theta \times d_\theta\) cannot.

Melted computer.

Derivatives of the objective function

First derivatives (gradients) point in directions in which the objective function increases. They are orthogonal to contours of the objective function. \[ \begin{align} g(\theta) = \frac{\partial}{\partial \theta} f(\theta) = - \sum_i \frac{\partial}{\partial \theta} l(\theta;D_i) \end{align} \] Second derivatives (hessians) describe the extent to which parameters are identified by data. \[ \begin{align} H(\theta) = \frac{\partial^2}{\partial \theta^2} f(\theta) = -\sum_i \frac{\partial^2}{\partial \theta^2} l(\theta;D_i) \end{align} \] The rank of a hessian (locally) describes the number of (linear combinations of) identifiable parameters. Ideas of robustness and parsimony incline us to prefer simpler models, which can be associated with low-rank hessians.

Functions with positive definite hessians are ‘convex’. There exists extensive theory for convex functions and their optimization.

Derivatives of the objective function

Many recent papers show how the machine learning/AI community are interested in the behaviour of gradients and hessians for their models.

Some notes on optimization

Gradient descent

Gradient descent is simple but still extremely useful. \[ \begin{align} \hat \theta_{t+1} \gets \hat \theta_{t} - \delta g(\hat \theta_{t}) \end{align} \] Let’s take a look at this algorithm in the context of a quadratic objective function \[ \begin{align} f(\theta) = \frac{1}{2} (\theta-\theta^*)^T H (\theta-\theta^*), && g = H (\theta-\theta^*). \end{align} \]

Code
n <- 32
xcoords <- seq(-1,1,length=n)*3
ycoords <- seq(-1,1,length=n)*2
grd <- as.matrix(expand.grid(xcoords,ycoords))
theta <- pi/4
Q <- matrix(c(cos(theta),sin(theta),-sin(theta),cos(theta)),2,2)
H <- Q%*%diag(c(4,1))%*%t(Q)
Hinv<-solve(H)
l_fun <- function(x){0.5*sum(x * (H%*%x))}
g_fun <- function(x){H%*%x}
z <- apply(grd,1,l_fun)
z <- matrix(z,n,n)

maxit <- 16
delta<-.2
xhatmat<-matrix(,maxit,2)
x0 <- c(-3,-1)
xhat <- x0
for(it in 1:maxit){
 deltait <- delta / it^.6
 xhatmat[it,] <- xhat
 g <- g_fun(xhat)# + rnorm(2)
 xhat <- xhat - deltait * g
}
contour(xcoords,ycoords,z,asp=1);points(0,0,pch=4);points(xhatmat[,1],xhatmat[,2],col=2,type="o")

Tracking the iterations of the gradient descent algorithm.

Newton optimization

The Newton step \[ \begin{align} \hat \theta_{t+1} \gets \hat \theta_{t} - H^{-1} \, g(\hat \theta_{t}) \end{align} \] jumps straight to the optimum for a (precisely measured) quadratic objective function.

Code
xhatmat<-matrix(,maxit,2)
xhat <- x0
delta <- 1
for(it in 1:maxit){
 xhatmat[it,] <- xhat
 g <- g_fun(xhat)
 xhat <- xhat - delta * Hinv%*%g
}
contour(xcoords,ycoords,z,asp=1);points(0,0,pch=4);points(xhatmat[,1],xhatmat[,2],col=2,type="o")

Tracking the iterations of the Newton optimizer.

Newton optimization is really great, but

  • Full Newton step deals badly with noisy gradients.
  • \(H\) might not be invertible or positive definite.
  • \(H\) is potentially very big.
  • \(H\) might not exist.

Pensive heads.

Optimization with momentum

Introduce a velocity variable \(v\) and write

\[ \begin{align} v_t & \gets v_{t-1} - \delta \left[ \gamma v_{t-1} + g(\theta_t) \right],\\ \theta_{t+1} & \gets \theta_t + \delta v_t \end{align} \]

The gradient is understood as a force acting on an object with velocity \(v\). The quantity \(\gamma\) acts like a friction coefficient.

Momentum-based optimizers are very popular and useful idea (it is a/the distinguishing feature of algorithms like Adam, Adagrad,…). It insulates us against rapidly changing gradients (mostly due to stochasticity in batch selection for gradient computation).

Code
xhatmat<-matrix(,maxit,2)
xhat <- x0
v <- c(0,0)
gamma <- 0.4
delta<-.4
for(it in 1:maxit){
 deltait <- delta / it^.6
 xhatmat[it,] <- xhat
 g <- g_fun(xhat)# + rnorm(2)
 v <- v - deltait * (gamma * v + g)
 xhat <- xhat + deltait * v
}
contour(xcoords,ycoords,z,asp=1);points(0,0,pch=4);points(xhatmat[,1],xhatmat[,2],col=2,type="o")

Tracking a momentum-augmented gradient descent algorithm with an arbitrary friction coefficient.

Specifying \(\gamma=2H^{1/2}\) causes the parameter vector to feel more friction in directions with high curvature. This slows the parameter vector down if gradients are changing rapidly. The choice is informed by consideration of the critically damped harmonic oscillator. \[ \begin{align} v_{t} & \gets v_{t-1} - \delta \left[ 2H^{1/2} v_t + g(\theta_t) \right]\\ \theta_{t+1} & \gets \theta_{t} + \delta v_t \end{align} \] Note that the recursion above is stable iff \(\text{eig}(\delta^2 H) \subset (0,1)\).

Code
eigH<-eigen(H)
Hsqrt<-eigH$vectors%*%diag(eigH$values^.5)%*%t(eigH$vectors)
xhatmat<-matrix(,maxit,2)
xhat <- x0
v <- c(0,0)
delta <- (1/sum(diag(H)))^.5
for(it in 1:maxit){
 deltait <- delta / it^.6
 xhatmat[it,] <- xhat
 g <- g_fun(xhat)
 v <- v - deltait * (2*Hsqrt%*%v + g)
 xhat <- xhat + deltait * v 
}
contour(xcoords,ycoords,z,asp=1);points(0,0,pch=4);points(xhatmat[,1],xhatmat[,2],col=2,type="o")

Tracking momentum-augmented gradient descent with the sqrt hessian as the friction coefficient.

Quasi-Newton optimization methods

Quasi-Newton methods involve approximating the hessian (and/or its inverse) from observed changes in gradient (see, for example, BFGS). The justification for these methods begins with consideration of the Taylor expansion of a function’s gradient \[\begin{align} y_t \approx H s_t \end{align}\] where \[\begin{align} y_t = g(\theta_{t})-g(\theta_{t-1}), && s_t= \theta_{t}-\theta_{t-1} \end{align}\] which is known as the secant equation. Most quasi-Newton methods have finite memory versions that avoid computing or storing the full hessian.

For our purposes it is useful to entertain the rank-one approximation \[\begin{align} \hat H = aa^T, && \hat H^{1/2} = (a^ta)^{-1/2} aa^T \end{align}\] and to adjust \(a\) at each iteration using \[\begin{align} a \gets a + \delta \left [ (y_t-\hat Hs_t) s_t^T a + s (y_t-\hat Hs_t)^T a \right]. \end{align}\]

Note that

  • We avoid inversion of the hessian.
  • We only really need to know about the curvature in the direction of the velocity.
  • Repeated evaluation of the gradients serves to average out the noise… what noise?
Code
xhatmat<-matrix(,maxit,2)
xhat <- x0
v <- c(0,0)
delta <- .1
for(it in 1:maxit){
 xhatmat[it,] <- xhat
 g <- g_fun(xhat)
 if(it>1){
 y <- c(g-g_old)
 d <- y * (delta*sum(y*v)/sum(y*y))^.5
 v <- v - d
 }
 v <- v - delta*g
 g_old <- g
 xhat <- xhat + v
 }
contour(xcoords,ycoords,z,asp=1);points(0,0,pch=4);points(xhatmat[,1],xhatmat[,2],col=2,type="o")

Stochastic optimization

Consider now the SDE known as the equation for ‘kinetic Langevin dynamics’ \[ \begin{align} dv_t =& - \gamma v_t dt - M^{-1} \nabla U(\theta_t) dt + \sqrt{2\gamma M^{-1}} dW_t \\ d\theta_t =& v_t dt. \end{align} \] With \(W_t\) a Brownian motion, or Wiener process. It can be shown that this Markov process has stationary distribution \(\exp(−U(X)−v^TMv/2)\).

There exist multiple ways to discretize an SDE. The Euler-Maruyama method is arguably the simplest. It tells us how to construct a discrete time Markov process, whose behaviour in the limit in which the step size \(\delta \rightarrow 0\), converges to the continuous one described by the SDE.

We arrive at… \[ \begin{align} v_{t} \gets & v_{t-1} - \delta \left[ \gamma v_{t-1} +M^{-1} \nabla U(\theta_{t}) + \sqrt{2\gamma \delta^{-1}M^{-1}} \, \epsilon_{t} \right] \\ \theta_{t+1} \gets & \theta_{t} + \delta v_{t} \end{align} \] where the \(\epsilon_t\) are unit normal random variables.

Compare this with the noisy form of our update equations \[ \begin{align} v_{t} \gets & v_{t-1} - \delta \left[ \gamma v_{t-1} + g_{t} + e_t \right] \\ \theta_{t+1} \gets & \theta_{t} + \delta v_{t} \end{align} \] We see that the expressions coincide if \[ \begin{align} M^{-1} \nabla U(\theta_{t}) = g, && \sqrt{2\delta^{-1} \gamma M^{-1}} \, \epsilon_{t} = e_t \end{align} \] We see that, in the limit of small \(\delta\), we are actually sampling from a distribution whose log-density has gradient \[ \begin{align} \nabla U = M g = \frac{2}{\delta} \, \gamma \, \text{var}^{-1}(e) \, g \end{align} \] i.e. the distribution of points the optimizer visits is affected by the gradients, the noise, the friction coefficient and the step size.

We note that:

  • Two functions whose gradients are the same up to multiplication by a positive definite matrix have the same optima. The premultiplying matrix may affect which optima we head to and how.
  • Large \(M\) means that the peaks of the distribution are very high. Random behaviour is suppressed.
  • Small step size \(\delta\) and small error variance increase \(M\).
  • Randomness is suppressed further if we align \(\gamma\) with \(\text{var}(e)\).
  • If we set \(M=I\), then our noisy optimizer samples from the posterior density for \(\theta\).

Is the suppression of random behaviour desirable?

  • Noise can knock us out of local minima
  • Noise can steer us away from very narrow minima, which sometimes correspond to non-robust parameter settings.
  • Randomly searching for ‘good’ parameter values is slow.

Is optimizing the optimizer a good idea?

  • Overfitting can be avoided if we optimize the model badly or for a short amount of time.
Code
eigH<-eigen(H)
Hsqrt<-eigH$vectors%*%diag(eigH$values^.5)%*%t(eigH$vectors)
maxit <- 512
xhatmat<-matrix(,maxit,2)
xhat <- x0
v <- c(0,0)
delta <- .3
for(it in 1:maxit){
 deltait <- delta / it^0
 xhatmat[it,] <- xhat
 g <- g_fun(xhat) + rnorm(2,0,.5)
 v <- v - deltait * (2*Hsqrt%*%v + g)
 xhat <- xhat + deltait * v 
}
contour(xcoords,ycoords,z,asp=1);points(0,0,pch=4);points(xhatmat[,1],xhatmat[,2],col=rgb(1,0,0,1/maxit^0.2),type="o")

Tracking momentum-augmented gradient descent with noisy gradients.

Future work

Theoretical work

Push calculations out of the ‘parameter space’ and into the ‘data space’.

  • The variance of the gradients affects the speed and stability of convergence.
    • Can/should we select batches of negatively correlated data to reduce the variance of their cumulative gradient?
  • We ought to understand where the low dimensionality of the hessians comes from.
    • Project gradients onto a subset of reference gradients?

Progress in AI/ML will (probably) be made both at the computational and conceptual levels.

Applied work

Bioimaging project

Resources

Visualizing the loss landscape