SDE model calibration through neural posterior estimation

Introduction

Last time, I discussed how the differentiability of JAX programs facilitates model calibration when calculating derivatives by hand would be extremely tedious, for instance when the model involves the numerical integration of a stochastic differential equation. While this method provides point estimates of the parameters that minimize a given loss function, it does not quantify the uncertainty of those estimates. Getting such error bars is important to assess the quality of the calibration, and to decide whether the model actually explains the observed data. In the example of option pricing, this concretely means that we can detect mispricing for statistical arbitrage purposes.

In this post, I will be exploring a machine learning approach to this problem.

Method

Mathematically, a model of a data-generating process can be roughly seen as a mapping

\[ F: \theta \mapsto X \]

from parameters \(\theta\) to random variables \(X\) that we can potentially observe. We choose random variables rather than real numbers to account for uncertainty, both epistemic and aleatoric. In the Heston model example, \(\theta\) represents the list of parameters of the system of stochastic differential equations, and \(X\) the option prices. The main source of uncertainty is the realization of the Brownian motion in the SDE.

Given a loss function \(L\) that penalizes discrepancies between the predictions \(F(\theta)\) and the observed data \(x_{\mathrm{observed}}\), calibration can be done as an optimization problem,

\[ \hat{\theta} = \mathrm{argmin}_\theta \, \mathbb{E}\left[ L\left( F(\theta), x_{\mathrm{observed}}\right)\right], \]

where the expectation is taken over the random variable \(X = F(\theta)\). In practice, it can be estimated with Monte Carlo methods.

If we want parameter estimates with quantified uncertainty, ideally we should aim for a conditional distribution \(P(\theta \vert x_{\mathrm{observed}})\). In Bayesian statistics, such a distribution can be constructed by conditioning a joint distribution \(P(\theta, X)\) on \(X\) and plugging the data in,

\[ P(\theta \vert x_{\mathrm{observed}}) = P(\theta \vert X)\vert_{X = x_{\mathrm{observed}}}. \]

Unsurprisingly, the conditioning

\[ P(\theta \vert X) = \frac{P(\theta, X)}{P(X)} = \frac{P(\theta)P(X \vert \theta)}{P(X)} \]

is the tricky part, and not just because of the high-dimensional integration involved in the computation of the marginal \(P(X)\), but also because a model \(F: \theta \mapsto X\) very often has an intractable likelihood function. This happens when the density of the random variable \(X=F(\theta)\) is hard (or even impossible) to derive. On the other hand, we can usually draw samples from this random variable, and use that to generate synthetic data. Concretely, we draw parameters from the prior, \(\theta_i \sim P(\theta)\), and simulate observations from the model of the data-generating process, \(x_i \sim X_i = F(\theta_i)\), for a certain number of times \(i=1,\dots,N\), so that \((\theta_i, x_i) \sim P(\theta, X)\) are a sample of the joint distribution. This assumes that one can easily sample from the prior \(P(\theta)\).

A family of algorithms in simulation-based inference then leverage conditional density estimation to directly estimate the posterior density from this synthetic data, for instance with conditional masked autoencoders. A distribution \(q_{nn}(\theta \vert X)\) parameterized by such a neural network can be trained on the synthetic data through maximization of the log-likelihood to become a good approximation of the desired posterior distribution.

Implementation

For the Heston model, we can reuse the code from the previous post to generate synthetic observations, but any implementation of a sampler of \(x_i \sim X_i = F(\theta_i)\) would work.

Code

from diffrax import (diffeqsolve, ControlTerm, Euler, MultiTerm,
ODETerm, SaveAt, VirtualBrownianTree)

def drift(t, y, args):
    S, V = y
    r, V0, k, theta, xi, rho = args
    d_y = jnp.array([r * S, k * (theta - V)])
    return d_y

def diffusion(t, y, args):
    S, V = y
    V = jnp.maximum(V, 0.001)
    r, V0, k, theta, vol_of_vol, rho = args
    mixing = jnp.array([[1., 0.],
                        [rho, jnp.sqrt(1-rho**2)]])
    scale = jnp.diag(jnp.array([jnp.sqrt(V)*S,
                                jnp.sqrt(V)*vol_of_vol]))
    return scale @ mixing

def simulate_heston_process(args, saveat, t0, t1, key, dataset_size=100):
    @jit
    @vmap
    def get_data(key):
        bm_key, vol_key = jr.split(key, 2)

        bm = VirtualBrownianTree(t0, t1, tol=1e-3, shape=(2,), key=bm_key)
        drift_term = ODETerm(drift)
        diffusion_term = ControlTerm(diffusion, bm)
        terms = MultiTerm(drift_term, diffusion_term)
        solver = Euler()
        dt0 = 0.005

        r, V0, k, theta, sigma, rho = args
        y0 = jnp.array([1.0, V0])

        sol = diffeqsolve(
            terms, solver, t0, t1, dt0, y0, saveat=saveat, args=args
        )
        return sol.ts, sol.ys


    data_key = jr.split(key, dataset_size)

    ts, ys = get_data(data_key)
    return ts, ys

@jit
def compute_call_price(spot_price, args, strikes,
                       maturities, key, sample_size=512):

    ts, ys = simulate_heston_process(args,
                                     SaveAt(ts=maturities),
                                     0.,
                                     2.,
                                     key,
                                     dataset_size=sample_size)

    spot_price_at_maturity = spot_price*ys[...,0]
    r, V0, k, theta, vol_of_vol, rho = args
    discount = jnp.exp(-r*maturities)

    simulated_payoffs = jnp.maximum(
        spot_price_at_maturity[:,None] - strikes[...,None], 0)

    simulated_calls = discount * jnp.mean(simulated_payoffs,
                                          axis=0)
    return simulated_calls

We will reuse the example of the previous post. For more flexibility, this market data could be introduced as further conditional parameters in the autoregressive network, but we will keep it simple for the sake of the example.

spot_price = 123.28
strikes = jnp.array([115., 120., 125., 130., 135., 140., 145])

maturity_dates = [datetime.strptime(s, '%Y;%m;%d') for s in
                  ['2015;8;21', '2015;9;18', '2015;12;18',
                   '2016;4;15', '2016;6;17', '2017;1;20']]
settle_date =datetime.strptime('2015;7;10', '%Y;%m;%d')
maturities = jnp.array([(md.date() - settle_date.date()).days/365
                        for md in maturity_dates]) #time to maturity in years

market_calls = jnp.array([
 [9.95, 10.63, 12.84, 15.1, 15.95, 20.0],
 [6.3, 7.2, 9.9, 12.3, 13.57, 17.5],
 [3.6, 4.55, 7.3, 9.7, 11.15, 15.2],
 [1.82, 2.68, 5.3, 7.7, 9.0, 13.2],
 [0.82, 1.45, 3.7, 5.85, 7.2, 11.27],
 [0.36, 0.77, 2.5, 4.48, 5.76, 9.65],
 [0.15, 0.38, 1.7, 3.44, 4.54, 8.1]])

The workflow discussed here does not leverage the differentiability of the JAX implementation (some more involved variations do in order to guide a more efficient sampling of synthetic data), but the jax.vmap function makes it very easy to parallelize the generation of samples.

Specifying the prior distribution of the parameters can be done with the JAX backend of TensorFlow-Probability.

prior_dist = tfd_jax.JointDistributionSequential([
    # rate r
    tfd_jax.Normal(0.0, 0.05, name='prior_rate'),
    # initial volatility V0
    tfd_jax.LogNormal(jnp.log(0.05), 1., name='prior_V0'),
    # reversion rate k
    tfd_jax.Normal(.5, .2, name='prior_k'),
    # long variance theta
    tfd_jax.LogNormal(jnp.log(0.05), 1., name='prior_theta'),
    # volatility of volatility vol_of_vol
    tfd_jax.LogNormal(jnp.log(0.2), 1., name='prior_vol_of_vol'),
    # correlation rho
    tfd_jax.Normal(0., 0.7, name='prior_rho')
])

The generation of parameters and observations is wrapped in a single function along with a filter to remove the failed simulations. Sometimes the diffrax integration fails to converge.

def simulate_training_data(target_training_size, key):
    # target_training_size is the number of parameters drawn, and the number
    # of corresponding observations generated. Some of the simulations will
    # fail and will have to be filtered out.

    param_key, data_key = jr.split(key)

    # draw the parameters from the prior
    simulated_params = prior_dist.sample(seed=param_key,
                                         sample_shape=(target_training_size,))

    # simulate the observations from the model, with a vmap on the
    # parameter argument
    simulated_prices = vmap(compute_call_price,
                            in_axes=(None, 0, None, None, None))(
                                spot_price,
                                simulated_params,
                                strikes,
                                maturities,
                                data_key)

    # collect the parameters and synthetic observations into arrays
    simulated_params = jnp.transpose(jnp.array(simulated_params))
    simulated_prices = jnp.array(simulated_prices).reshape(
      target_training_size, -1)

    # filter out the failed SDE integrations
    filter = ~jnp.any(jnp.isnan(simulated_prices), axis=1)
    simulated_params = simulated_params[filter]
    simulated_prices = simulated_prices[filter]

    return simulated_params, simulated_prices


seed = 123
key = jr.PRNGKey(seed)

theta, X = simulate_training_data(20000, key)

Masked autoencoders are also implemented in TensorFlow-Probability, but only in the original version with a TensorFlow backend. This will require to convert the JAX arrays to TensorFlow tensors, and back if we want to reuse JAX on the output of the learned surrogate model.

An autoregressive autoencoder is not constrained by default, so the transformed distribution would have the six-dimensional space \(\mathbb{R}^6\) as a support. Some of the parameters are strictly positive though, therefore chaining the autoregressive flow with constraining bijectors is necessary to obtain the proper support. Note that passing conditional inputs to nested bijectors requires naming them and specifying it in the bijector_kwargs arguments of the distribution methods (such as sample and log_prob).

# Conditional density estimation with MADE.
made = tfb.AutoregressiveNetwork(
  params=2,
  hidden_units=[32, 32],
  event_shape=(6,),
  conditional=True,
  kernel_initializer=tfk.initializers.VarianceScaling(0.01, seed=123),
  conditional_event_shape=(42,)
)

bijs = [tfb.Identity(),
        tfb.Exp(),
        tfb.Identity(),
        tfb.Exp(),
        tfb.Exp(),
        tfb.Identity()]

constraining_bij = tfb.Blockwise(bijs, block_sizes=6*[1],
                                   name='constraining_bij')

# Name the autoregressive flow to properly pass the
# conditional arguments:
maf = tfb.MaskedAutoregressiveFlow(made, name='maf')

# Transformed distribution parameterized by a neural network,
# q_nn(theta | X).
parameterized_distribution = tfd.TransformedDistribution(
  distribution=tfd.Sample(tfd.Normal(loc=0., scale=1.),
                          sample_shape=[6]),
  bijector=tfb.Chain([constraining_bij, maf]))

The training of the autoregressive flow can be implemented with keras, with a loss function defined as the negative log-likelihood of the conditional distribution.

# Construct and fit a model.

# Placeholder for the observations X.
X_ = tfkl.Input(shape=(X.shape[-1],), dtype=tf.float32)

# Placeholder for the parameters theta.
theta_ = tfkl.Input(shape=(theta.shape[-1],), dtype=tf.float32)

# Conditional likelihood q_nn(theta | X).
log_prob_ = parameterized_distribution.log_prob(
  theta_, bijector_kwargs={'maf': {'conditional_input': X_}})

# Putting everything together.
model = tfk.Model([theta_, X_], log_prob_)

model.compile(optimizer=tf.optimizers.Adam(learning_rate=0.0001),
              loss=lambda _, log_prob: -log_prob)

batch_size = 200

theta = tf.convert_to_tensor(theta)
X = tf.convert_to_tensor(X)

history = model.fit(x=[theta, X],
                    y=np.zeros((X.shape[0], 0), dtype=np.float32),
                    batch_size=batch_size,
                    epochs=70,
                    steps_per_epoch=X.shape[0] // batch_size,
                    shuffle=True,
                    verbose=True)

Once the surrogate distribution is trained to emulate posteriors with the synthetic data, it can be conditioned on the observed data to generate a sample of the approximate posterior of interest.

posterior_params = parameterized_distribution.sample(
    5000,
    bijector_kwargs={'maf': {'conditional_input': market_calls.reshape(-1)}}
    )

Exploring marginal density plots reveals how well the model fits the data. Tighter marginals mean better fit.

/usr/local/lib/python3.10/dist-packages/arviz/utils.py:184: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  numba_fn = numba.jit(**self.kwargs)(self.function)

Finally, these parameters drawn from the approximate posterior are passed to the simulator model to generate a sample of the posterior predictive distribution of call option prices.

def generate_posterior_predictive_prices(posterior_params, key):
    posterior_prices = vmap(compute_call_price,
                            in_axes=(None, 0, None, None, None))(
                                spot_price,
                                jnp.array(posterior_params),
                                strikes,
                                maturities,
                                key)
    filter = ~jnp.any(jnp.isnan(posterior_prices), axis=(1,2))
    posterior_prices = posterior_prices[filter]
    return posterior_prices

This sample can be used to compute a Monte Carlo estimate of the corresponding cumulative distribution function, with which we can assess how well the individual prices are explained by the calibrated model.

Posterior predictive quantiles of the observed prices:
[[0.998 0.827 0.629 0.639 0.719 0.844]
 [0.978 0.716 0.611 0.648 0.785 0.847]
 [0.92  0.701 0.536 0.599 0.781 0.842]
 [0.876 0.649 0.53  0.604 0.773 0.846]
 [0.819 0.607 0.497 0.532 0.768 0.832]
 [0.817 0.642 0.463 0.498 0.777 0.827]
 [0.888 0.711 0.448 0.484 0.773 0.812]]

Ideally, the values of this matrix should be close to 0.5. When they are close to 0 or 1, it means that the observed prices fall in the tails of the posterior predictive distribution and can be considered as outliers under the probabilistic pricing model.

Discussion

Machine learning models used to be criticized for their poor performance when it comes to probability estimation or quantification of uncertainty. This was arguably an obstacle to widespread adoption of machine learning in applications where quantifying risks with well-calibrated models is important, such as finance, healthcare, or engineering. Recent research in generative AI has led to the development of machine learning models of valid probability distributions, namely to generate samples from these distributions. They have unlocked powerful applications in scientific computing in problems of simulation-based inference. The example presented in this post illustrates the benefits that these probability-aware machine learning algorithms can bring to industry.

Firstly, we saw that paired with simulators, they allow to put error bars on inferred parameters, which makes it easier to assess model calibration and detect outliers in complex distributions.

Secondly, neural posterior estimation keeps the simulator model and the inference neural network separated, making it possible to increase the complexity of the model when it fails to capture certain phenomena in the data-generating process. All you have to do then is increase the dimensionality of the autoregressive flow and generate more synthetic data. In the option-pricing example, standard methods revolve around a geometric brownian motion to keep computations tractable, even though more complex processes such as the Heston model capture more characteristics of actual market prices. With a decoupled approach, there is no reason to favor a more simple model if the bias it introduces is too important.

Thirdly, in situations with sparse data, due to the Bayesian nature of neural posterior estimation, one can introduce prior knowledge about the parameters in the prior distribution, which would regularize the inference even if there weren’t enough data points for standard methods.

Lastly, the computationally expensive part of neural posterior estimation is the generation of synthetic data and the learning of the surrogate posterior \(q_{nn}(\theta \vert X)\), which both can be done in advance, and at inference time, once the data is observed, all that remains to do is to call the autoregressive flow once in parallel to generate a sample of the posterior. This is in general faster than a gradient-descent-based estimation of model parameters.