--- title: "Using surveil for public health research" output: rmarkdown::html_vignette header-includes: - \usepackage{amsmath} vignette: > %\VignetteIndexEntry{Using surveil for public health research} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: bib.bib link-citations: yes --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", results = "hold", collapse = TRUE, eval = TRUE, fig.pos = 'h', fig.align = 'center' ) ``` This vignette demonstrates basic usage of *surveil* for public health research. The package is designed for routine time trend analysis, namely for time trends in disease incidence rates or mortality rates. Models were built using the [Stan](https://mc-stan.org/) modeling language for Bayesian inference with Markov chain Monte Carlo (MCMC), but users only need to be familiar with the R language. The package also contains special methods for age-standardization, printing and plotting model results, and for measuring and visualizing health inequalities. For age-standardization see `vignette("age-standardization")`. For discussion and demonstration analysis see @donegan_2022. ## Getting started ```{r message=FALSE, warning=FALSE, eval=T} library(surveil) library(ggplot2) library(knitr) ``` To use the models provided by *surveil,* the surveillance data minimally must contain case counts, population at risk estimates, and a discrete time period variable. The data may also include one or more grouping variables, such as race-ethnicity. Time periods should consist of equally spaced intervals. This vignette analyzes colorectal cancer incidence data by race-ethnicity, year, and Texas MSA for ages 50-79 (data obtained from CDC Wonder). The race-ethnicity grouping includes (non-Hispanic) black, (non-Hispanic) white, and Hispanic, and the MSAs include those centered on the cities of Austin, Dallas, Houston, and San Antonio. ```{r eval=T} head(msa) |> kable(booktabs = TRUE, caption = "Glimpse of colorectal cancer incidence data (CDC Wonder)") ``` The primary function in *surveil* is `stan_rw`, which fits random walk models to surveillance data. The function is expects the user to provide a `data.frame` with specific column names. There must be one column named `Count` containing case counts, and another column named `Population`, containing the sizes of the populations at risk. The user must provide the name of the column containing the time period (the default is `time = Year`, to match CDC Wonder data). Optionally, one can provide a grouping factor. For the MSA data printed above, the grouping column is Race and the time column is Year. ## Preparing the data We will demonstrate using aggregated CRC cases across Texas's top four MSAs. The `msa` data from CDC Wonder already has the necessary format (column names and contents), but these data are dis-aggregated by MSA. So for this analysis, we first group the data by year and race, and then combine cases across MSAs. The following code chunk aggregates the data by year and race-ethnicity: ```{r message = FALSE, warn = FALSE, eval = T} msa2 <- aggregate(cbind(Count, Population) ~ Year + Race, data = msa, FUN = sum) ``` The following code provides a glimpse of the aggregated data: ```{r eval = T} head(msa2) |> kable(booktabs = TRUE, caption = "Glimpse of aggregated Texas metropolitan CRC cases, by race and year") ``` ## Model specification ### The basics The base *surveil* model is specified as follows. The Poisson model is used as the likelihood: the probability of observing a given number of cases, $y_t$, conditional on a given level of risk, $e^{\phi_t}$, and known population at risk, $p_t$, is: $$y_t \sim \text{Pois}(p_t \cdot e^{\phi_t})$$ where $t$ indexes the time period. Next, we need a model for the log-rates, ${\phi_t}$. The first-difference prior states that our expectation for the log-rate at any time is its previous value, and we assign a normal probability distribution to deviations from the previous value [@clayton_1996]. This is also known as the random-walk prior: $$\phi_t \sim \text{Gau}(\phi_{t-1}, \tau^2)$$ This places higher probability on a smooth trend through time, specifically implying that underlying disease risk tends to have less variation than crude incidence. The log-risk for time $t=1$ has no previous value to anchor its expectation; thus, we assign a prior probability distribution directly to $\phi_1$. For this prior, *surveil* uses a normal distribution. The scale parameter, $\tau$, also requires a prior distribution, and again *surveil* uses a normal model which is diffuse relative to the log incidence rates. ### Binomial model In addition to the Poisson model, the binomial model is also available: $$y_t \sim \text{Binom}(p_t \cdot g^{-1}(\phi_t))$$ where $g$ is the logit function and $g^{-1}(x) = \frac{exp(x)}{1 + exp(x)}$ (the inverse-logit function). If the binomial model is used the rest of the model remains the same as stated above. The Poisson model is typically preferred for rare events (such as rates below .01), otherwise the binomial model is usually more appropriate. The remainder of this vignette will proceed using the Poisson model only. ## Fitting the model The time series model is fit by passing surveillance data to the `stan_rw` function. Here, `Year` and `Race` indicate the appropriate time and grouping columns in the `msa2` data frame. ```{r} fit <- stan_rw(msa2, time = Year, group = Race, iter = 1e3) ``` The `iter = 1e3` line controls how long the MCMC sampling continues for (in this case, 1,000 samples: 500 warmup, then 500 more for inference). The default is 3,000, which is more than sufficient for this example model. By default, there are four independent MCMC chains each with 500 post-warmup samples (for a total of 2,000 MCMC samples used for the estimates). To speed things up, we could take advantage of parallel processing using the `cores` argument (e.g., add `cores = 4`) to run on 4 cores simultaneously. You can suppress the messages seen above by adding `refresh = 0`. ## Printing results The `print` method will print the estimates with 95\% credible intervals to the console; adding `scale = 100e3` will display rates per 100,000: ```{r} print(fit, scale = 100e3) ``` This information is also stored in a data frame, `fit$summary`: ```{r} head(fit$summary) ``` The `fit$summary` object can be used to create custom plots and tables. ## Visualizing results If we call `plot` on a fitted *surveil* model, we get a `ggplot` object depicting risk estimates with 95\% credible intervals: ```{r fig.width = 4.75, fig.height = 3.5} plot(fit, scale = 100e3) ``` The crude incidence rates (observed values) are also plotted here as points. The `plot` method has a number of optional arguments that control its appearance. For example, the `base_size` argument controls the size of labels. The size of the points for the crude rates can be adjusted using `size`, and `size = 0` removes them altogether. We can also use `ggplot` to add custom modifications: ```{r fig.width = 7, fig.height = 3.5} fig <- plot(fit, scale = 100e3, base_size = 11, size = 0) fig + theme(legend.position = "right") + labs(title = "CRC incidence per 100,000", subtitle = "Texas MSAs, 50-79 y.o.") ``` The plot method has a `style` argument that controls how uncertainty is represented. The default, `style = "mean_qi"`, shows the mean of the posterior distribution as the estimate and adds shading to depict the 95\% credible interval (as above). The alternative, `style = "lines"`, plots MCMC samples from the joint probability distribution for the estimates: ```{r fig.width = 4.5, fig.height = 3.5} plot(fit, scale = 100e3, base_size = 11, style = "lines") ``` By default, `M = 250` samples are plotted. The `style` option is available for all of the *surveil* plot methods. This style is sometimes helpful for visualizing multiple groups when their credible intervals overlap. ## Percent change The `apc` method calculates percent change by period and cumulatively over time: ```{r} fit_pc <- apc(fit) ``` The object returned by `apc` contains two data frames. The first contains estimates of percent change from the previous period: ```{r} head(fit_pc$apc) ``` Those estimates typically have high uncertainty. The second data frame contains estimates of cumulative percent change (since the first observed period): ```{r} head(fit_pc$cpc) ``` Each value in the `cpc` column is an estimate of the difference in incidence rates between that year and the first year (in this case, 1999) expressed as a percent of the first year's rate. The `lwr` and `upr` columns are the lower and upper bounds of the 95\% credible intervals for the estimates. This information can also be plotted: ```{r fig.width = 7, fig.height = 3.5} plot(fit_pc, cumulative = TRUE) ``` If desired, the average annual percent change from the first period can be calculated by dividing the cumulative percent change (CPC) by the appropriate number of periods. For example, the CPC from 1999 to 2017 for whites is about -31 for an average annual percent change of about $-31/18 = -1.72$. The credible intervals for the average annual percent change can also be obtained from the CPC table using the same method. For this example, the correct denominator is $2017-1999=18$ (generally: the last year minus the first year). ## MCMC diagnostics If you do not see any warnings printed to the R console at the end of the model fitting process then you can simply move forward with the analysis. If there is a warning, it may say that the effective sample size is low or that the R-hat values are large. For a crash course on MCMC analysis with *surveil*, including MCMC diagnostics, see the vignette on the topic `vignette("surveil-mcmc")`. A quick and dirty summary is that you want to watch two key diagnostics, which are effective MCMC sample size (ESS) and R-hat. For all your parameters of interest, ESS should be at least 400 or so. If you want to increase the ESS, use the `iter` argument to draw more samples. The R-hat values should all be pretty near to one, such as within the range $1 \pm .02$. For the simple models we are using here, large R-hat values can often be fixed just by drawing more samples. You can find these diagnostics by printing results `print(fit$samples)` and seeing the columns `n_eff` (for ESS) and `Rhat`. The MCMC sampling is generally fast and without trouble when the numbers are not too small (as in this example model). When the incidence rates are based on small numbers (like counts less than 20), the sampling often proceeds more slowly and a higher `iter` value may be needed. ## Multiple time series In most applications, the base model specification described above will be entirely sufficient. However, *surveil* provides an option for users to add a correlation structure to the model when multiple groups are modeled together. For correlated trends, this can increase the precision of estimates. The log-rates for $k$ populations, $\boldsymbol \phi_t$, are assigned a multivariate normal distribution [@brandt_2007]: $$\boldsymbol \phi_t \sim \text{Gau}(\boldsymbol \phi_{t-1}, \boldsymbol \Sigma),$$ where $\boldsymbol \Sigma$ is a $k \times k$ covariance matrix. The covariance matrix can be decomposed into a diagonal matrix containing scale parameters for each variable, $\boldsymbol \Delta = diag(\tau_1,\dots \tau_k)$, and a symmetric correlation matrix, $\boldsymbol \Omega$ [@stan_2021]: $$\boldsymbol \Sigma = \boldsymbol \Delta \boldsymbol \Omega \boldsymbol \Delta$$ When the correlation structure is added to the model, then a prior distribution is also required for the correlation matrix. *surveil* uses the LKJ model, which has a single shape parameter, $\eta$ [@stan_2021]. If $\eta=1$, the LKJ model will place uniform prior probability on any $k \times k$ correlation matrix; as $\eta$ increases from one, it expresses ever greater skepticism towards large correlations. When $\eta <1$, the LKJ model becomes 'concave'---expressing skepticism towards correlations of zero. If we wanted to add a correlation structure to the model, we would add `cor = TRUE` to `stan_rw`, as in: ```{r eval = FALSE} fit <- stan_rw(msa2, time = Year, group = Race, cor = TRUE) ``` ## References