site_id | n_pats | n_pdevs | |
---|---|---|---|
0 | site_0 | 9 | 15 |
1 | site_1 | 10 | 14 |
2 | site_2 | 15 | 18 |
3 | site_3 | 9 | 10 |
4 | site_4 | 5 | 4 |
Deviations from the approved protocol of a clinical trial are common but need to be reported by investigators to the trial sponsor for review. Failing to do so can put patients at risk and affect the scientific quality of the trial, so detecting protocol deviation underreporting is part of the standard quality assurance activities. Investigator sites that report more deviations than their peers should also raise an alarm as this might indicate underlying quality issues. A statistical tool that quantifies the risk of over- and underreporting of protocol deviations could significantly improve quality activities, much like in the case of adverse event reporting.
The number of protocol deviations \(n_{pdevs}\) reported by an investigator site should depend linearly on the number of patients \(n_{pats}\) enrolled at that site, as more patients mean more chances of deviations, so these are the minimal attributes that should get collected prior to the analysis. The following table is a sample of the data we will use as an example.
A quick glance at the full dataset reveals there is indeed a relationship that looks linear, so a potential approach would be to build a regression model \(n_{pdevs} \sim \theta \cdot n_{pats}\) and quantify how every observation of \(n_{pdevs}\) deviates from its estimation.
From that scatterplot, it also appears that the residuals of a regression would not be iid. Rather, their variance would grow as the number of patients increases. This rules out least square regression, as it assumes iid normal residuals. Since we are dealing with count data, working with the Poisson distribution is a natural approach,
\[n_{pdevs} \vert n_{pats} \sim Poi(\lambda(n_{pats})),\]
and we can set \(\lambda(n_{pats}) = \theta \cdot n_{pats}\) to reflect our assumption of a linear relationship between \(n_{pdevs}\) and \(n_{pats}\). Note that a regular Poisson regression would fail to capture that linear relationship due to its exponential link function.
In this model, we immediately have \(E\left[n_{pdevs} \vert n_{pats}\right] = Var\left[n_{pdevs} \vert n_{pats}\right] = \theta \cdot n_{pats}\), which seems to reproduce the increasing spread of \(n_{pdevs}\).
We can infer the value of \(\theta\) through maximum likelihood estimation and use the resulting conditional Poisson model at each site to compute the cumulative distribution function (CDF) of the observed numbers of protocol deviations.
Code
def loss(par, n_pat, n_dev):
= tf.math.exp(par[0])
theta = tfd.Poisson(n_pat * theta)
dist return -tf.reduce_sum(dist.log_prob(n_dev))
def compute_cdf(par, n_pat, n_dev):
= tf.math.exp(par[0])
theta = tfd.Poisson(n_pat * theta)
dist return dist.cdf(n_dev)
@tf.function
def loss_and_gradient(par, n_pat, n_dev):
return tfp.math.value_and_gradient(lambda par: loss(par, n_pat, n_dev), par)
def fit(n_pat, n_dev):
= 2*tf.ones(1)
init = tfp.optimizer.lbfgs_minimize(
opt lambda par: loss_and_gradient(par, n_pat, n_dev), init, max_iterations=1000
)return opt
= tf.constant(data['n_pats'], dtype=tf.float32)
n_pats = tf.constant(data['n_pdevs'], dtype=tf.float32)
n_pdev
= fit(n_pats, n_pdev)
mle
#print(f"converged: {mle.converged}")
#print(f"iterations: {mle.num_iterations}")
= np.linspace(0, 40)
x = mle.position
par = np.exp(par[0]) * x
y
= compute_cdf(par, n_pats, n_pdev) cdfs
These CDF values are concentrated around 0 and 1, which makes this approach quite impractical and suggests that the variance of the model is lower than the variance of the data.
The low variance can be increased by treating \(\lambda(n_{pats})\) as a random function, rather than a deterministic one. So we assume that \(\lambda(n_{pats})\) is drawn from a gamma distribution, \(\lambda(n_{pats}) \sim \Gamma(\alpha, \beta)\), where the rate parameter \(\beta\) is inversely proportional to the expected number of protocol deviations from a given site, \(\beta = \beta_{pat} / n_{pats}\), in order to ensure linearity in \(n_{pats}\). In this context, maximum likelihood estimation would be a nightmare to implement (because of the rate parameters of the Poisson distribution) and probably not very stable, so it is best to turn to Bayesian inference via MCMC algorithms. We thus pick gamma priors for \(\alpha\) and \(\beta_{pat}\) with a shape parameters of 2 to prevent the corresponding Markov chains from drifting too close to zero, where pathological behaviors seem to occur with more permissive priors in this model.
Code
= tf.constant(data['site_id'])
sites = tf.constant(data['n_pats'], dtype=tf.float32)
n_pats = tf.constant(data['n_pdevs'], dtype=tf.float32)
n_pdev
= tfd.JointDistributionSequential([
mdl_pd #alpha
2, 2, name='alpha'),
tfd.Gamma(#beta_pt
2, 2, name='beta_pt'),
tfd.Gamma(#pdev rates for each patient
lambda beta_pt, alpha: tfd.Independent(
/ n_pats[tf.newaxis,...]),
tfd.Gamma(alpha[...,tf.newaxis], beta_pt[...,tf.newaxis] =1
reinterpreted_batch_ndims
),#observed pdevs
lambda rates: tfd.Independent(tfd.Poisson(rates), reinterpreted_batch_ndims=1)
])
We can sample the posterior distribution of this model with a Hamiltonian Monte Carlo algorithm and assess the convergence of the Markov chains before computing the posterior probabilities of interest.
Code
= tf.dtypes.float32
dtype = 5
nchain =3000
burnin=10000
num_steps= mdl_pd.sample(nchain)
alpha0, beta_pt0, rates0, _ = [alpha0, beta_pt0, rates0]
init_state = [tf.cast(i, dtype=dtype) for i in [0.01, 0.01, 0.01]]
step_size = lambda *init_state: mdl_pd.log_prob(
target_log_prob_fn list(init_state) + [tf.cast(n_pdev, dtype=dtype)])
= 3*[tfb.Exp()]
unconstraining_bijectors
@tf.function(autograph=False, experimental_compile=True)
def run_chain(init_state, step_size, target_log_prob_fn, unconstraining_bijectors,
=num_steps, burnin=burnin):
num_steps
def trace_fn(_, pkr):
return (
pkr.inner_results.inner_results.is_accepted
)
= tfp.mcmc.TransformedTransitionKernel(
kernel =tfp.mcmc.HamiltonianMonteCarlo(
inner_kernel
target_log_prob_fn,=3,
num_leapfrog_steps=step_size),
step_size=unconstraining_bijectors)
bijector
= tfp.mcmc.SimpleStepSizeAdaptation(
hmc =kernel,
inner_kernel=burnin
num_adaptation_steps
)
# Sampling from the chain.
= tfp.mcmc.sample_chain(
[alpha, beta_pt, rates], is_accepted =num_steps,
num_results=burnin,
num_burnin_steps=init_state,
current_state=hmc,
kernel=trace_fn)
trace_fnreturn alpha, beta_pt, rates, is_accepted
= run_chain(
alpha, beta_pt, rates, is_accepted
init_state, step_size, target_log_prob_fn, unconstraining_bijectors)
= alpha[burnin:,:]
alpha_ = tf.reshape(alpha_, [alpha_.shape[0]*alpha_.shape[1]])
alpha_ = beta_pt[burnin:,:]
beta_pt_ = tf.reshape(beta_pt_, [beta_pt_.shape[0]*beta_pt_.shape[1]])
beta_pt_ = rates[burnin:,:]
rates_ = tf.reshape(rates_, [rates_.shape[0]*rates_.shape[1], rates_.shape[2]])
rates_
= tfd.Gamma(alpha_[:,tf.newaxis], beta_pt_[:, tf.newaxis] / n_pats[tf.newaxis,...])
rates_dist_ = rates_dist_.cdf(rates_)
rates_cdf_
= {}
posterior 'alpha'] = tf.transpose(alpha[burnin:, :]).numpy()
posterior['beta_pt'] = tf.transpose(beta_pt[burnin:, :]).numpy()
posterior['rate0'] = tf.transpose(rates[burnin:, :, 0])
posterior['rate1'] = tf.transpose(rates[burnin:, :, 1])
posterior['rate2'] = tf.transpose(rates[burnin:, :, 2])
posterior[
= az.from_dict(posterior=posterior)
az_trace
print(f'MCMC acceptance rate: {is_accepted.numpy().mean()}')
az.plot_trace(az_trace) plt.show()
MCMC acceptance rate: 0.73832
Given a Markov chain sample \((\hat\alpha, \hat\beta_{pat}, (\hat\lambda_i)_{i=1,\dots,N})\), where \(i\) indexes the investigator sites, we can evaluate the CDF of \(\Gamma(\hat\alpha, \hat\beta_{pat} / n_{pats, i})\) at \(\hat\lambda_i\) and average these quantities along the whole Markov chain to obtain an indicator of over- and underreporting. This indicator corresponds to the rate tail area of the inferred Poisson rates under their posterior predictive distribution. Low values mean a risk of underreporting, and high values a risk of overreporting (see the last column of the following sample table).
site | n_pats | n_pdev | mean_pdev_rate | std_pdev_rate | rate_tail_area | |
---|---|---|---|---|---|---|
0 | site_0 | 9 | 15 | 15.51 | 3.82 | 0.42 |
1 | site_1 | 10 | 14 | 14.56 | 3.72 | 0.36 |
2 | site_2 | 15 | 18 | 18.73 | 4.28 | 0.31 |
3 | site_3 | 9 | 10 | 10.77 | 3.19 | 0.30 |
4 | site_4 | 5 | 4 | 4.86 | 2.08 | 0.24 |
5 | site_5 | 7 | 19 | 18.98 | 4.26 | 0.61 |
6 | site_6 | 25 | 31 | 31.74 | 5.59 | 0.32 |
7 | site_7 | 2 | 11 | 9.82 | 2.83 | 0.82 |
8 | site_8 | 6 | 7 | 7.75 | 2.71 | 0.32 |
9 | site_9 | 17 | 27 | 27.65 | 5.17 | 0.40 |
The distribution of the rate tail areas looks more convenient than in the first simple model. Not only did we add variance with a mixture model, but we also assess the inferred Poisson parameters rather than the observations, and the former are shrunk by their prior.
This metric also seems to agree with the intuition of what underreporting and overreporting should look like.
We can set thresholds for over- and underreporting alerts at .8 and .2 respectively to illustrate how an auditor could use this model to select investigator sites to focus on.
This method illustrates the potential of Bayesian modeling to supercharge a regression analysis toolbox with a variety of likelihood functions that can capture the intricacies of the data generating process and inference methods that are more flexible in quantifying uncertainties than the standard GLM methods. These properties are particularly helpful in practical tasks such as anomaly detection or risk management that combine expert insights with quantitative modeling.