Most users of surveil may not need to know much of anything about MCMC analysis. Many statistical software users do not know how estimates are obtained. However, there are a few topics that are important for surveil will benefit from having some familiarity with.
The following topics will be introduced here:
McElreath (2016) is helpful for learning Bayesian analysis using MCMC. Other books that discuss MCMC include MacKay (2003) and Gelman et al. (2014); some may also want to see Stan Development Team (2022) and Vehtari et al. (2021).
The discussion will make use of a model of U.S. cancer incidence
rates, ages 10-14 from the cancer
data:
data(cancer)
cancer2 <- subset(cancer, Age == "10-14")
fit <- stan_rw(cancer2, time = Year, iter = 1500, chains = 4, refresh = 0)
#> Distribution: normal
#> Distribution: normal
#> [1] "Setting normal prior(s) for eta_1: "
#> location scale
#> -6 5
#> [1] "\nSetting half-normal prior for sigma: "
#> location scale
#> 0 1
The iter
argument determines how many MCMC samples will
be drawn from the model. The first half of the samples will always be
discarded as warmup and the remaining samples will be used for
inference. The chains
argument determines how many
independent MCMC chains will be used. The code above uses
iter=1500
and chains=4
which means there will
be 3,000 MCMC samples returned for inference.
We can examine the MCMC samples themselves to see how the values
provided by the print
method were calculated. The
stanfit
object created by Stan contains all of the MCMC
samples:
Each row of the matrix phi
is a sample from the joint
posterior probability distribution of parameters, and each column
represents a parameter. In this case, each parameter is an annual cancer
incidence rate. The values in any given column are samples from the
marginal probability distribution for that parameter.
We can visualize the samples using histograms. This code first takes
the first column from phi
, which represents the cancer
incidence rate in 1999 (the first observed year); the rates are
multiplied by 10,000 for ease of reading:
phi <- 10e3 * phi
phi_1999 <- phi[,1]
head(phi_1999)
#> [1] 1.224993 1.215863 1.269613 1.229944 1.288228 1.252372
These distributions are typically summarized by calculating their mean and tail-area quantiles. The mean is taken as the parameter estimate and the tail-area quantiles provide a credible interval:
If we repeated these steps for all years (for each column in
phi
) we would obtain the same information that was provided
by print(fit)
.
In order to make inferences about quantities that involve more than
one parameter, we need to use samples from the joint probability
distribution. We can visualize the joint distribution using the
plot
method:
Each line is a visual depiction of a row from the matrix of MCMC
samples phi
.
When calculating quantities of interest, like annual percent change, cumulative change, or a measure of inequality, one must always use the joint distribution. This means that the quantity of interest needs to be calculated by working row-wise. To illustrate how this is done, the following code creates a new quantity: the difference between risk in 2017 (the 19th year) and 1999 (the first observed year):
The mean of the probability distribution for the change in the incidence rate per 10,000 from 1999 to 2017 is:
Cancer incidence increased by 0.26 per 10,000 from 1999 to 2017 for kids ages 10-14. The cumulative percent change from 1999 to 2017 can also be calculated this way:
The rate increase is a cumulative change of about 21% or about 1.2% annually.
An important difference between sampling with an MCMC algorithm and
sampling from a pseudo random number generator (like rnorm
)
is that MCMC produces samples that are (often) correlated with one
another. This means that in the process of drawing MCMC samples each
sample tends to be similar to the previous sample.
This means that for any number of MCMC samples, there is less information than would be provided by the same number of independently drawn samples. As a result, it often requires more MCMC samples to obtain an estimate of any given prevision.
Stan Development Team (2022) write:
Roughly speaking, the effective sample size (ESS) of a quantity of interest captures how many independent draws contain the same amount of information as the dependent sample obtained by the MCMC algorithm. The higher the ESS the better… For final results, we recommend requiring that the bulk-ESS is greater than 100 times the number of chains. For example, when running four chains, this corresponds to having a rank-normalized effective sample size of at least 400.
To examine the effective sample size (ESS), you can start by printing
the fit$samples
object. The ESS is reported in the column
labeled n_eff
:
print(samples)
#> Inference for Stan model: RW.
#> 4 chains, each with iter=1500; warmup=750; thin=1;
#> post-warmup draws per chain=750, total post-warmup draws=3000.
#>
#> mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
#> rate[1,1] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2715 1
#> rate[1,2] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3058 1
#> rate[1,3] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2535 1
#> rate[1,4] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3245 1
#> rate[1,5] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3387 1
#> rate[1,6] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3051 1
#> rate[1,7] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3217 1
#> rate[1,8] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3074 1
#> rate[1,9] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3165 1
#> rate[1,10] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2971 1
#> rate[1,11] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3196 1
#> rate[1,12] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3553 1
#> rate[1,13] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3337 1
#> rate[1,14] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2445 1
#> rate[1,15] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3402 1
#> rate[1,16] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3150 1
#> rate[1,17] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3389 1
#> rate[1,18] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2772 1
#> rate[1,19] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3236 1
#> sigma[1] 0.02 0.00 0.01 0.01 0.02 0.02 0.03 0.04 869 1
#> log_lik[1,1] -5.26 0.01 0.59 -6.87 -5.43 -5.04 -4.87 -4.82 1973 1
#> log_lik[1,2] -5.26 0.01 0.55 -6.75 -5.42 -5.05 -4.88 -4.84 2436 1
#> log_lik[1,3] -5.86 0.02 0.96 -8.35 -6.29 -5.61 -5.13 -4.88 2216 1
#> log_lik[1,4] -5.24 0.01 0.50 -6.59 -5.39 -5.05 -4.91 -4.87 1959 1
#> log_lik[1,5] -5.24 0.01 0.47 -6.51 -5.39 -5.06 -4.92 -4.88 2376 1
#> log_lik[1,6] -5.44 0.01 0.67 -7.32 -5.67 -5.19 -4.98 -4.90 2357 1
#> log_lik[1,7] -5.29 0.01 0.52 -6.79 -5.47 -5.10 -4.93 -4.88 2504 1
#> log_lik[1,8] -5.29 0.01 0.50 -6.76 -5.45 -5.10 -4.94 -4.90 2544 1
#> log_lik[1,9] -5.22 0.01 0.45 -6.48 -5.32 -5.05 -4.92 -4.88 2111 1
#> log_lik[1,10] -5.20 0.01 0.44 -6.48 -5.31 -5.03 -4.91 -4.88 1746 1
#> log_lik[1,11] -5.19 0.01 0.44 -6.39 -5.27 -5.02 -4.92 -4.89 1540 1
#> log_lik[1,12] -5.28 0.01 0.50 -6.73 -5.41 -5.09 -4.95 -4.91 2161 1
#> log_lik[1,13] -5.22 0.01 0.44 -6.47 -5.31 -5.05 -4.94 -4.91 1575 1
#> log_lik[1,14] -5.22 0.01 0.42 -6.40 -5.34 -5.06 -4.95 -4.92 1544 1
#> log_lik[1,15] -5.20 0.01 0.40 -6.36 -5.29 -5.05 -4.96 -4.93 1767 1
#> log_lik[1,16] -5.41 0.01 0.60 -7.15 -5.58 -5.20 -5.01 -4.95 1976 1
#> log_lik[1,17] -5.24 0.01 0.42 -6.40 -5.33 -5.08 -4.98 -4.95 1405 1
#> log_lik[1,18] -5.80 0.02 0.90 -8.18 -6.23 -5.51 -5.12 -4.97 2427 1
#> log_lik[1,19] -5.49 0.02 0.74 -7.58 -5.70 -5.22 -5.01 -4.94 2232 1
#> lp__ -25.29 0.13 3.64 -33.49 -27.54 -24.91 -22.71 -19.07 818 1
#>
#> Samples were drawn using NUTS(diag_e) at Tue Nov 5 05:00:50 2024.
#> For each parameter, n_eff is a crude measure of effective sample size,
#> and Rhat is the potential scale reduction factor on split chains (at
#> convergence, Rhat=1).
All of the rate parameters have ESS over 2,000 which is much more than adequate for inference.
The rate parameters are most important, since inference is almost
always focused on them. sigma
is the scale parameter for
the model; the log_lik
parameters are the log-likelihood of
each observation. The lp__
parameter is always generated by
Stan, it is the log-probability of the model.
If Stan is warning that Bulk ESS or tail area ESS is low and possibly
unreliable, the first thing you can do is print results and inspect the
ESS values. Depending on one’s purpose, the warning may be overly
cautious. Otherwise you can simply increase the number of samples drawn
using the iter
argument to stan_rw
.
A second important difference between sampling with MCMC and sampling
with functions like rnorm
is that with MCMC we have to
find the target distribution. The reason rstan (and
surveil) samples from multiple, independent MCMC chains by
default is that this allows us to check that they have all converged on
the same distribution. If one or more chains does not resemble the
others then there may be a convergence failure (Gelman et al.
2014). When chains have converged, the Rhat statistics will
all equal about 1.
Stan Development Team (2022) write:
We recommend running at least four chains by default and in general only fully trust the sample if R-hat is less than 1.01.
The default setting in surveil uses four MCMC chains, consistent with the above recommendation.
The Rhat statistic can be examined by printing the Stanfit object
print(fit$samples)
. We can see above that they are all
equal to about 1.00.
We could also visualize the Rhats all at once using
rstan::stan_rhat
:
Sometimes, MCMC algorithms are unable to provide unbiased samples that will converge on the target distribution given sufficient time. Stan’s MCMC algorithm will issue a warning when it encounters a region of the probability model that it seems unable to explore. These warnings of “divergent transitions” should not be ignored, as they indicate that something may have gone wrong (Betancourt 2017a, 2017b). They will be printed to the console just after the model finishes sampling. Note that other MCMC platforms are simply unable to identify this bias (they will not issue a warning like Stan’s divergent transitions, even if the problem is present).
If the number of divergent transitions is small (such as a handful out of 5,000 samples), and the ESS and Rhat are both looking good, then you may determine that the amount of bias potentially introduced is acceptable for your purposes.
If there is a large number of divergent transitions, this is a serious warning. You may have made a mistake with your input data, or it could be that the model is just not very appropriate for the data.
If you receive a divergent transition warning from a surveil model, here are some things to consider:
Draw more samples: if you also have low ESS, the divergent
transitions may disappear by increasing the number of iterations
(lengthening Stan’s adaptation or warm-up period). However, the default
value of iter = 3000
should generally be
sufficient.
Raise adapt_delta
: you can also control a tuning
parameter related to divergent transitions. Simply raising the
adapt_delta
value to, say, 0.99 or 0.999 if need be, may be
sufficient; e.g.,
stan_rw(data, time = Year, control = list(adapt_delta = 0.99, max_treedepth = 13))
.
(When raising adapt_delta
, raising
max_treedepth
may improve sampling speed.) Raising
adapt_delta
ever closer to 1 (e.g., to .9999) is generally
not helpful.
There may be times when you are not able to prevent divergent
transitions. This may occur when the population at risk is small and
observations are particularly noisy; introducing the covariance
structure (using core = TRUE
) may make these issues more
difficult to address. If you are working with multiple age groups, it
may be the case that the age groups with very low case counts are the
only source of the problem. To determine if that is the case, you can
run the model without that age group or run a model for only that age
group.