--- title: "MCMC with surveil" output: rmarkdown::html_vignette header-includes: - \usepackage{amsmath} vignette: > %\VignetteIndexEntry{MCMC with surveil} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: bib.bib link-citations: yes nocite: | @donegan_2022 --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", results = "hold", collapse = TRUE, eval = TRUE, fig.pos = 'h', fig.width = 5, fig.height = 3.5, fig.align = 'center' ) ``` Most users of *surveil* may not need to know much of anything about MCMC analysis. Many statistical software users do not know how estimates are obtained. However, there are a few topics that are important for *surveil* will benefit from having some familiarity with. The following topics will be introduced here: - Summarizing probability distributions - Effective sample size (ESS) - The Rhat statistic - Divergent transitions @mcelreath_2016 is helpful for learning Bayesian analysis using MCMC. Other books that discuss MCMC include @mackay_2003 and @gelman_2014; some may also want to see @stan-dev_2022 and @vehtari_2021. ## Example model ```{r message = FALSE} library(surveil) ``` The discussion will make use of a model of U.S. cancer incidence rates, ages 10-14 from the `cancer` data: ```{r} data(cancer) cancer2 <- subset(cancer, Age == "10-14") fit <- stan_rw(cancer2, time = Year, iter = 1500, chains = 4, refresh = 0) ``` The `iter` argument determines how many MCMC samples will be drawn from the model. The first half of the samples will always be discarded as warmup and the remaining samples will be used for inference. The `chains` argument determines how many independent MCMC chains will be used. The code above uses `iter=1500` and `chains=4` which means there will be 3,000 MCMC samples returned for inference. ## MCMC analysis We can examine the MCMC samples themselves to see how the values provided by the `print` method were calculated. The `stanfit` object created by Stan contains all of the MCMC samples: ```{r} samples <- fit$samples class(samples) ``` ```{r} phi <- as.matrix(samples, pars = "rate") dim(phi) ``` Each row of the matrix `phi` is a sample from the joint posterior probability distribution of parameters, and each column represents a parameter. In this case, each parameter is an annual cancer incidence rate. The values in any given column are samples from the marginal probability distribution for that parameter. We can visualize the samples using histograms. This code first takes the first column from `phi`, which represents the cancer incidence rate in 1999 (the first observed year); the rates are multiplied by 10,000 for ease of reading: ```{r} phi <- 10e3 * phi phi_1999 <- phi[,1] head(phi_1999) ``` ```{r} hist(phi_1999, xlab = "Rate per 10,000, 1999", main = NA) ``` These distributions are typically summarized by calculating their mean and tail-area quantiles. The mean is taken as the parameter estimate and the tail-area quantiles provide a credible interval: ```{r} mean(phi_1999) ``` ```{r} quantile(phi_1999, probs = c(0.025, 0.975)) ``` If we repeated these steps for all years (for each column in `phi`) we would obtain the same information that was provided by `print(fit)`. In order to make inferences about quantities that involve more than one parameter, we need to use samples from the joint probability distribution. We can visualize the joint distribution using the `plot` method: ```{r} plot(fit, style = "lines", scale = 10e3) ``` Each line is a visual depiction of a row from the matrix of MCMC samples `phi`. When calculating quantities of interest, like annual percent change, cumulative change, or a measure of inequality, one must always use the joint distribution. This means that the quantity of interest needs to be calculated by working row-wise. To illustrate how this is done, the following code creates a new quantity: the difference between risk in 2017 (the $19^{th}$ year) and 1999 (the first observed year): ```{r} phi_2017 <- phi[,19] diff <- (phi_2017 - phi_1999) hist(diff, xlab = "Rate difference", main = NA) ``` The mean of the probability distribution for the change in the incidence rate per 10,000 from 1999 to 2017 is: ```{r} mean(diff) ``` Cancer incidence increased by 0.26 per 10,000 from 1999 to 2017 for kids ages 10-14. The cumulative percent change from 1999 to 2017 can also be calculated this way: ```{r} 100 * mean( diff / phi_1999 ) ``` The rate increase is a cumulative change of about 21\% or about 1.2\% annually. ## Monte Carlo standard errors An important difference between sampling with an MCMC algorithm and sampling from a pseudo random number generator (like `rnorm`) is that MCMC produces samples that are (often) correlated with one another. This means that in the process of drawing MCMC samples each sample tends to be similar to the previous sample. This means that for any number of MCMC samples, there is less information than would be provided by the same number of independently drawn samples. As a result, it often requires more MCMC samples to obtain an estimate of any given prevision. @stan-dev_2022 write: > Roughly speaking, the effective sample size (ESS) of a quantity of interest captures how many independent draws contain the same amount of information as the dependent sample obtained by the MCMC algorithm. The higher the ESS the better... For final results, we recommend requiring that the bulk-ESS is greater than 100 times the number of chains. For example, when running four chains, this corresponds to having a rank-normalized effective sample size of at least 400. To examine the effective sample size (ESS), you can start by printing the `fit$samples` object. The ESS is reported in the column labeled `n_eff`: ```{r} print(samples) ``` All of the rate parameters have ESS over 2,000 which is much more than adequate for inference. The rate parameters are most important, since inference is almost always focused on them. `sigma` is the scale parameter for the model; the `log_lik` parameters are the log-likelihood of each observation. The `lp__` parameter is always generated by Stan, it is the log-probability of the model. If Stan is warning that Bulk ESS or tail area ESS is low and possibly unreliable, the first thing you can do is print results and inspect the ESS values. Depending on one's purpose, the warning may be overly cautious. Otherwise you can simply increase the number of samples drawn using the `iter` argument to `stan_rw`. ## Rhat A second important difference between sampling with MCMC and sampling with functions like `rnorm` is that with MCMC we have to *find* the target distribution. The reason *rstan* (and *surveil*) samples from multiple, independent MCMC chains by default is that this allows us to check that they have all converged on the same distribution. If one or more chains does not resemble the others then there may be a convergence failure [@gelman_2014]. When chains have converged, the Rhat statistics will all equal about 1. @stan-dev_2022 write: > We recommend running at least four chains by default and in general only fully trust the sample if R-hat is less than 1.01. The default setting in *surveil* uses four MCMC chains, consistent with the above recommendation. The Rhat statistic can be examined by printing the Stanfit object `print(fit$samples)`. We can see above that they are all equal to about 1.00. We could also visualize the Rhats all at once using `rstan::stan_rhat`: ```{r fig.width = 4, fig.height = 3.5, eval = FALSE} # plot not shown rstan::stan_rhat(samples) ``` ## Divergent transitions Sometimes, MCMC algorithms are unable to provide unbiased samples that will converge on the target distribution given sufficient time. Stan's MCMC algorithm will issue a warning when it encounters a region of the probability model that it seems unable to explore. These warnings of "divergent transitions" should not be ignored, as they indicate that something may have gone wrong [@betancourt_2017b;@betancourt_2017]. They will be printed to the console just after the model finishes sampling. Note that other MCMC platforms are simply unable to identify this bias (they will not issue a warning like Stan's divergent transitions, even if the problem is present). If the number of divergent transitions is small (such as a handful out of 5,000 samples), and the ESS and Rhat are both looking good, then you may determine that the amount of bias potentially introduced is acceptable for your purposes. If there is a large number of divergent transitions, this is a serious warning. You may have made a mistake with your input data, or it could be that the model is just not very appropriate for the data. If you receive a divergent transition warning from a *surveil* model, here are some things to consider: 1. Draw more samples: if you also have low ESS, the divergent transitions may disappear by increasing the number of iterations (lengthening Stan's adaptation or warm-up period). However, the default value of `iter = 3000` should generally be sufficient. 2. Raise `adapt_delta`: you can also control a tuning parameter related to divergent transitions. Simply raising the `adapt_delta` value to, say, 0.99 or 0.999 if need be, may be sufficient; e.g., `stan_rw(data, time = Year, control = list(adapt_delta = 0.99, max_treedepth = 13))`. (When raising `adapt_delta`, raising `max_treedepth` may improve sampling speed.) Raising `adapt_delta` ever closer to 1 (e.g., to .9999) is generally not helpful. There may be times when you are not able to prevent divergent transitions. This may occur when the population at risk is small and observations are particularly noisy; introducing the covariance structure (using `core = TRUE`) may make these issues more difficult to address. If you are working with multiple age groups, it may be the case that the age groups with very low case counts are the only source of the problem. To determine if that is the case, you can run the model without that age group or run a model for only that age group. ## References