How to #4: Multiple Shooting Training with DiffeqFlux.jl

Adam Davis
5 min readFeb 1, 2023
Jackson Pollock — Number 1, 1949

“There is no accident, just as there is no beginning and no end” — J. Pollock

On this part 4 of the How To series, I’m diving deep into the topic of multiple shooting. Multiple shooting is training data is split into overlapping intervals which are then trained in tandem to join these overlaps. This can be extremely helpful when data has visual evidence of some oscillation. Here it is fit to a common Hare and Lynx population dataset to see if we can solve for the oscillation and predict what the population of each could be over time.

DiffeqFlux is an extremely useful way to use differential equation solvers fit to neural networks to in turn approximate data. Multiple shooting is also a part of DiffeqFlux and can also handle differential equation solvers. Multiple shooting is helpful because it allows for a solver to break out of a local minima and to approximate data that is non-linear or noisy.

DATASET:

This started as needing a better way to approximate data using Julia. I had read the attached paper by Steven Frank on using Multiple Shooting to solve for approximations to forecast future population levels. He had written his own loss functions and most of the code and only used DiffeqFlux to solve the model. I am using an example of multiple shooting for a simple ODE and instead of generating data from a known ODE, I’m using the Hare and Lynx dataset which is attached at the end of this post.

Populations from 1845 to 1935

The populations don’t always follow the same rules so there is a bit of noise in the data. Smoothing is a great way to deal with some of the noise. I followed Steven’s lead on first taking the log of the data and de-meaning it. After this it is then fit to a cubic spline for each layer for a little more smoothness. The model is trained on 60 years/time steps of the data. This is then tested later on against the last 30 years which are the test set.

Not all functions will be shown here as they are rather long. A link to the code is at the end as well as the dataset and Mr. Frank’s paper.

An important aspect of using DiffeqFlux to remember is that you don’t need to use a standard ODE to generate your data. If you have data to use then that can be handled in the loss function generally:

Above the group_size stands for the amount of points attached to each separate neural network in the multiple shooting objective. The continuity_term is the penalizing amount for each step of the optimization and is smaller in this case as there are more groups and many small penalizations may add up to one large one and could keep the model from converging optimally. The solver used is the AutoTsit5(Rosenbrock23()) as I left the default tolerances as-is.

The neural network architecture:

The nn is initialized with random parameters that will be trained later on. The training functions are as follows:

These optimization problems/solver train the model.

RESULTS:

Results of the solver can be visualized as following:

The solver is set to solve with the optimal parameters that provided the minimum loss from training. The term “set_tsteps” tell the new solver to save the prediction at each single step which is each year data was provided for.

The training set for the model is the first sixty years solved for at 3 month timesteps.

NN approximation of training set for Hare
NN approximation for Lynx

The solver was able to approximate the data fairly well. It is not perfect but perhaps many more iterations would also help the training along. As for the predicted levels:

Trained and forecasted Hare
Trained and forecasted Lynx

We can see that the model was able to reproduce a fairly accurate approximation of the next thirty years of the model. This was with a 66/33 percent split for training to compare a longer prediction window. It is visible in the first 60 years that the data is not absolute and certain years increase faster or decrease or do not have a sharp drop off as other years. More data possibly for climate or possible important local events might help to better explain why the populations don’t always line up together. The model learns the general ebb and flow from the data. To follow the training the optimal parameters for the nn are then able to fit to a solver to make the predictions from above:

CONCLUSION:

This was an example of how to use common multiple shooting algorithms that are fit to custom data. We will not always know that there is a relationship present in our data and instead would like to find out if there is one. Using neural networks to solve for differential equations can greatly improve our accuracy of approximations. The algorithms and solvers would be suitable for other data as well. The first step is to know the data well enough to know if this is the right procedure to use or not. After that, the sky’s the limit.

--

--