This vignette demonstrates basic usage of surveil for public health research. The package is designed for routine time trend analysis, namely for time trends in disease incidence rates or mortality rates. Models were built using the Stan modeling language for Bayesian inference with Markov chain Monte Carlo (MCMC), but users only need to be familiar with the R language.
The package also contains special methods for age-standardization,
printing and plotting model results, and for measuring and visualizing
health inequalities. For age-standardization see
vignette("age-standardization"). For discussion and
demonstration analysis see Donegan et al. (2022).
To use the models provided by surveil, the surveillance data minimally must contain case counts, population at risk estimates, and a discrete time period variable. The data may also include one or more grouping variables, such as race-ethnicity. Time periods should consist of equally spaced intervals.
This vignette analyzes colorectal cancer incidence data by race-ethnicity, year, and Texas MSA for ages 50-79 (data obtained from CDC Wonder). The race-ethnicity grouping includes (non-Hispanic) black, (non-Hispanic) white, and Hispanic, and the MSAs include those centered on the cities of Austin, Dallas, Houston, and San Antonio.
head(msa) |>
kable(booktabs = TRUE,
caption = "Glimpse of colorectal cancer incidence data (CDC Wonder)") | Year | Race | MSA | Count | Population |
|---|---|---|---|---|
| 1999 | Black or African American | Austin-Round Rock, TX | 28 | 14421 |
| 2000 | Black or African American | Austin-Round Rock, TX | 16 | 15215 |
| 2001 | Black or African American | Austin-Round Rock, TX | 22 | 16000 |
| 2002 | Black or African American | Austin-Round Rock, TX | 24 | 16694 |
| 2003 | Black or African American | Austin-Round Rock, TX | 34 | 17513 |
| 2004 | Black or African American | Austin-Round Rock, TX | 26 | 18429 |
The primary function in surveil is stan_rw,
which fits random walk models to surveillance data. The function is
expects the user to provide a data.frame with specific
column names. There must be one column named Count
containing case counts, and another column named
Population, containing the sizes of the populations at
risk. The user must provide the name of the column containing the time
period (the default is time = Year, to match CDC Wonder
data). Optionally, one can provide a grouping factor. For the MSA data
printed above, the grouping column is Race and the time column is
Year.
We will demonstrate using aggregated CRC cases across Texas’s top
four MSAs. The msa data from CDC Wonder already has the
necessary format (column names and contents), but these data are
dis-aggregated by MSA. So for this analysis, we first group the data by
year and race, and then combine cases across MSAs.
The following code chunk aggregates the data by year and race-ethnicity:
The following code provides a glimpse of the aggregated data:
head(msa2) |>
kable(booktabs = TRUE,
caption = "Glimpse of aggregated Texas metropolitan CRC cases, by race and year")| Year | Race | Count | Population |
|---|---|---|---|
| 1999 | Black or African American | 471 | 270430 |
| 2000 | Black or African American | 455 | 283280 |
| 2001 | Black or African American | 505 | 298287 |
| 2002 | Black or African American | 539 | 313133 |
| 2003 | Black or African American | 546 | 329481 |
| 2004 | Black or African American | 602 | 346886 |
The base surveil model is specified as follows. The Poisson model is used as the likelihood: the probability of observing a given number of cases, \(y_t\), conditional on a given level of risk, \(e^{\phi_t}\), and known population at risk, \(p_t\), is: \[y_t \sim \text{Pois}(p_t \cdot e^{\phi_t})\] where \(t\) indexes the time period.
Next, we need a model for the log-rates, \({\phi_t}\). The first-difference prior states that our expectation for the log-rate at any time is its previous value, and we assign a normal probability distribution to deviations from the previous value (Clayton 1996). This is also known as the random-walk prior: \[\phi_t \sim \text{Gau}(\phi_{t-1}, \tau^2)\] This places higher probability on a smooth trend through time, specifically implying that underlying disease risk tends to have less variation than crude incidence.
The log-risk for time \(t=1\) has no previous value to anchor its expectation; thus, we assign a prior probability distribution directly to \(\phi_1\). For this prior, surveil uses a normal distribution. The scale parameter, \(\tau\), also requires a prior distribution, and again surveil uses a normal model which is diffuse relative to the log incidence rates.
In addition to the Poisson model, the binomial model is also available: \[y_t \sim \text{Binom}(p_t \cdot g^{-1}(\phi_t))\] where \(g\) is the logit function and \(g^{-1}(x) = \frac{exp(x)}{1 + exp(x)}\) (the inverse-logit function). If the binomial model is used the rest of the model remains the same as stated above. The Poisson model is typically preferred for rare events (such as rates below .01), otherwise the binomial model is usually more appropriate. The remainder of this vignette will proceed using the Poisson model only.
The time series model is fit by passing surveillance data to the
stan_rw function. Here, Year and
Race indicate the appropriate time and grouping columns in
the msa2 data frame.
fit <- stan_rw(msa2, time = Year, group = Race, iter = 1e3)
#> Distribution: normal
#> Distribution: normal
#> [1] "Setting normal prior(s) for eta_1: "
#> location scale
#> -6 5
#> [1] "\nSetting half-normal prior for sigma: "
#> location scale
#> 0 1
#>
#> SAMPLING FOR MODEL 'RW' NOW (CHAIN 1).
#> Chain 1:
#> Chain 1: Gradient evaluation took 2.4e-05 seconds
#> Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.24 seconds.
#> Chain 1: Adjust your expectations accordingly!
#> Chain 1:
#> Chain 1:
#> Chain 1: Iteration: 1 / 1000 [ 0%] (Warmup)
#> Chain 1: Iteration: 501 / 1000 [ 50%] (Sampling)
#> Chain 1: Iteration: 1000 / 1000 [100%] (Sampling)
#> Chain 1:
#> Chain 1: Elapsed Time: 0.232 seconds (Warm-up)
#> Chain 1: 0.134 seconds (Sampling)
#> Chain 1: 0.366 seconds (Total)
#> Chain 1:
#>
#> SAMPLING FOR MODEL 'RW' NOW (CHAIN 2).
#> Chain 2:
#> Chain 2: Gradient evaluation took 1.7e-05 seconds
#> Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.17 seconds.
#> Chain 2: Adjust your expectations accordingly!
#> Chain 2:
#> Chain 2:
#> Chain 2: Iteration: 1 / 1000 [ 0%] (Warmup)
#> Chain 2: Iteration: 501 / 1000 [ 50%] (Sampling)
#> Chain 2: Iteration: 1000 / 1000 [100%] (Sampling)
#> Chain 2:
#> Chain 2: Elapsed Time: 0.197 seconds (Warm-up)
#> Chain 2: 0.135 seconds (Sampling)
#> Chain 2: 0.332 seconds (Total)
#> Chain 2:
#>
#> SAMPLING FOR MODEL 'RW' NOW (CHAIN 3).
#> Chain 3:
#> Chain 3: Gradient evaluation took 1.6e-05 seconds
#> Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.16 seconds.
#> Chain 3: Adjust your expectations accordingly!
#> Chain 3:
#> Chain 3:
#> Chain 3: Iteration: 1 / 1000 [ 0%] (Warmup)
#> Chain 3: Iteration: 501 / 1000 [ 50%] (Sampling)
#> Chain 3: Iteration: 1000 / 1000 [100%] (Sampling)
#> Chain 3:
#> Chain 3: Elapsed Time: 0.204 seconds (Warm-up)
#> Chain 3: 0.135 seconds (Sampling)
#> Chain 3: 0.339 seconds (Total)
#> Chain 3:
#>
#> SAMPLING FOR MODEL 'RW' NOW (CHAIN 4).
#> Chain 4:
#> Chain 4: Gradient evaluation took 1.6e-05 seconds
#> Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.16 seconds.
#> Chain 4: Adjust your expectations accordingly!
#> Chain 4:
#> Chain 4:
#> Chain 4: Iteration: 1 / 1000 [ 0%] (Warmup)
#> Chain 4: Iteration: 501 / 1000 [ 50%] (Sampling)
#> Chain 4: Iteration: 1000 / 1000 [100%] (Sampling)
#> Chain 4:
#> Chain 4: Elapsed Time: 0.211 seconds (Warm-up)
#> Chain 4: 0.137 seconds (Sampling)
#> Chain 4: 0.348 seconds (Total)
#> Chain 4:The iter = 1e3 line controls how long the MCMC sampling
continues for (in this case, 1,000 samples: 500 warmup, then 500 more
for inference). The default is 3,000, which is more than sufficient for
this example model. By default, there are four independent MCMC chains
each with 500 post-warmup samples (for a total of 2,000 MCMC samples
used for the estimates).
To speed things up, we could take advantage of parallel processing
using the cores argument (e.g., add cores = 4)
to run on 4 cores simultaneously. You can suppress the messages seen
above by adding refresh = 0.
The print method will print the estimates with 95%
credible intervals to the console; adding scale = 100e3
will display rates per 100,000:
print(fit, scale = 100e3)
#> Summary of surveil model results
#> Time periods: 19
#> Grouping variable: Race
#> Correlation matrix: FALSE
#> time Race mean lwr_2.5 upr_97.5
#> 1 1999 Black or African American 170.14364 158.62821 182.32728
#> 2 2000 Black or African American 166.26954 156.16961 176.98914
#> 3 2001 Black or African American 168.23229 158.64832 178.49477
#> 4 2002 Black or African American 169.16627 159.79307 179.17709
#> 5 2003 Black or African American 166.92901 157.50245 176.79463
#> 6 2004 Black or African American 166.82640 157.26297 177.57028
#> 7 2005 Black or African American 159.08627 150.39060 168.23051
#> 8 2006 Black or African American 154.70793 146.18862 163.62480
#> 9 2007 Black or African American 152.72069 144.59461 161.04462
#> 10 2008 Black or African American 149.61671 141.99479 158.03988
#> 11 2009 Black or African American 143.40588 135.94234 150.87167
#> 12 2010 Black or African American 138.74971 131.86189 146.45688
#> 13 2011 Black or African American 130.74027 123.61305 137.46380
#> 14 2012 Black or African American 125.18072 118.02952 132.14708
#> 15 2013 Black or African American 124.84318 118.26559 131.55424
#> 16 2014 Black or African American 126.21306 119.94310 132.64049
#> 17 2015 Black or African American 122.42803 116.22610 128.52661
#> 18 2016 Black or African American 122.06587 115.82318 128.49530
#> 19 2017 Black or African American 122.64021 116.36778 129.45460
#> 20 1999 Hispanic 101.50984 94.52306 108.63147
#> 21 2000 Hispanic 104.25901 98.38856 111.29326
#> 22 2001 Hispanic 102.52074 97.25622 108.30015
#> 23 2002 Hispanic 101.54715 96.11654 107.00279
#> 24 2003 Hispanic 100.53345 95.34331 105.68613
#> 25 2004 Hispanic 100.96668 95.89772 106.57276
#> 26 2005 Hispanic 98.64733 93.91431 103.53021
#> 27 2006 Hispanic 96.94852 92.35645 101.95613
#> 28 2007 Hispanic 94.23183 89.88880 98.87838
#> 29 2008 Hispanic 90.55361 85.85319 95.05414
#> 30 2009 Hispanic 88.86973 84.40469 93.09830
#> 31 2010 Hispanic 87.97037 83.55279 92.08502
#> 32 2011 Hispanic 87.43144 83.23500 91.44089
#> 33 2012 Hispanic 87.85006 83.97717 91.94725
#> 34 2013 Hispanic 87.27053 83.49179 91.31350
#> 35 2014 Hispanic 85.78938 81.78136 89.57786
#> 36 2015 Hispanic 85.84513 82.18574 89.64746
#> 37 2016 Hispanic 84.80758 80.90412 88.41607
#> 38 2017 Hispanic 85.82921 81.74868 90.09033
#> 39 1999 White 135.19494 130.05829 140.25422
#> 40 2000 White 136.49892 131.87584 141.24643
#> 41 2001 White 134.11213 129.77656 138.65486
#> 42 2002 White 130.24673 125.77987 134.75586
#> 43 2003 White 127.42196 123.08160 131.97841
#> 44 2004 White 120.00115 115.84869 124.27417
#> 45 2005 White 115.07869 111.28642 118.81980
#> 46 2006 White 109.65981 105.82927 113.50030
#> 47 2007 White 108.18801 104.50091 111.78770
#> 48 2008 White 103.44555 99.88939 107.14136
#> 49 2009 White 98.89758 95.42351 102.56414
#> 50 2010 White 96.15333 92.52126 99.78514
#> 51 2011 White 93.66377 90.42204 96.89019
#> 52 2012 White 91.09801 88.00664 94.05506
#> 53 2013 White 90.10620 87.11315 93.21516
#> 54 2014 White 91.47537 88.30101 94.57411
#> 55 2015 White 93.09773 89.89343 96.39915
#> 56 2016 White 89.09286 86.03619 92.08186
#> 57 2017 White 92.31675 89.04630 95.77808This information is also stored in a data frame,
fit$summary:
head(fit$summary)
#> time mean lwr_2.5 upr_97.5 Race Year Count
#> 1 1999 0.001701436 0.001586282 0.001823273 Black or African American 1999 471
#> 2 2000 0.001662695 0.001561696 0.001769891 Black or African American 2000 455
#> 3 2001 0.001682323 0.001586483 0.001784948 Black or African American 2001 505
#> 4 2002 0.001691663 0.001597931 0.001791771 Black or African American 2002 539
#> 5 2003 0.001669290 0.001575024 0.001767946 Black or African American 2003 546
#> 6 2004 0.001668264 0.001572630 0.001775703 Black or African American 2004 602
#> Population Crude
#> 1 270430 0.001741671
#> 2 283280 0.001606185
#> 3 298287 0.001693000
#> 4 313133 0.001721313
#> 5 329481 0.001657152
#> 6 346886 0.001735440The fit$summary object can be used to create custom
plots and tables.
If we call plot on a fitted surveil model, we
get a ggplot object depicting risk estimates with 95%
credible intervals:
The crude incidence rates (observed values) are also plotted here as
points.
The plot method has a number of optional arguments that
control its appearance. For example, the base_size argument
controls the size of labels. The size of the points for the crude rates
can be adjusted using size, and size = 0
removes them altogether. We can also use ggplot to add
custom modifications:
fig <- plot(fit, scale = 100e3, base_size = 11, size = 0)
#> Plotted rates are per 100,000
fig +
theme(legend.position = "right") +
labs(title = "CRC incidence per 100,000",
subtitle = "Texas MSAs, 50-79 y.o.")The plot method has a style argument that controls how
uncertainty is represented. The default, style = "mean_qi",
shows the mean of the posterior distribution as the estimate and adds
shading to depict the 95% credible interval (as above). The alternative,
style = "lines", plots MCMC samples from the joint
probability distribution for the estimates:
By default, M = 250 samples are plotted. The
style option is available for all of the surveil
plot methods. This style is sometimes helpful for visualizing multiple
groups when their credible intervals overlap.
The apc method calculates percent change by period and
cumulatively over time:
The object returned by apc contains two data frames. The
first contains estimates of percent change from the previous period:
head(fit_pc$apc)
#> time group apc lwr upr
#> 1 1999 Black or African American 0.00000000 0.000000 0.000000
#> 2 2000 Black or African American -2.19676740 -9.894800 4.565991
#> 3 2001 Black or African American 1.24879791 -5.239272 8.885292
#> 4 2002 Black or African American 0.60936909 -5.409300 7.553814
#> 5 2003 Black or African American -1.26682710 -7.644652 5.010945
#> 6 2004 Black or African American -0.01111352 -6.292749 6.814294Those estimates typically have high uncertainty.
The second data frame contains estimates of cumulative percent change (since the first observed period):
head(fit_pc$cpc)
#> time group cpc lwr upr
#> 1 1999 Black or African American 0.0000000 0.000000 0.000000
#> 2 2000 Black or African American -2.1967674 -9.894800 4.565991
#> 3 2001 Black or African American -1.0125987 -9.267085 7.020618
#> 4 2002 Black or African American -0.4490743 -8.702235 8.559414
#> 5 2003 Black or African American -1.7669684 -10.261507 6.949238
#> 6 2004 Black or African American -1.8338821 -10.195059 6.648326Each value in the cpc column is an estimate of the
difference in incidence rates between that year and the first year (in
this case, 1999) expressed as a percent of the first year’s rate. The
lwr and upr columns are the lower and upper
bounds of the 95% credible intervals for the estimates.
This information can also be plotted:
If desired, the average annual percent change from the first period can be calculated by dividing the cumulative percent change (CPC) by the appropriate number of periods. For example, the CPC from 1999 to 2017 for whites is about -31 for an average annual percent change of about \(-31/18 = -1.72\). The credible intervals for the average annual percent change can also be obtained from the CPC table using the same method. For this example, the correct denominator is \(2017-1999=18\) (generally: the last year minus the first year).
If you do not see any warnings printed to the R console at the end of
the model fitting process then you can simply move forward with the
analysis. If there is a warning, it may say that the effective sample
size is low or that the R-hat values are large. For a crash course on
MCMC analysis with surveil, including MCMC diagnostics, see the
vignette on the topic vignette("surveil-mcmc").
A quick and dirty summary is that you want to watch two key
diagnostics, which are effective MCMC sample size (ESS) and R-hat. For
all your parameters of interest, ESS should be at least 400 or so. If
you want to increase the ESS, use the iter argument to draw
more samples. The R-hat values should all be pretty near to one, such as
within the range \(1 \pm .02\). For the
simple models we are using here, large R-hat values can often be fixed
just by drawing more samples.
You can find these diagnostics by printing results
print(fit$samples) and seeing the columns
n_eff (for ESS) and Rhat.
The MCMC sampling is generally fast and without trouble when the
numbers are not too small (as in this example model). When the incidence
rates are based on small numbers (like counts less than 20), the
sampling often proceeds more slowly and a higher iter value
may be needed.
In most applications, the base model specification described above will be entirely sufficient. However, surveil provides an option for users to add a correlation structure to the model when multiple groups are modeled together. For correlated trends, this can increase the precision of estimates.
The log-rates for \(k\) populations, \(\boldsymbol \phi_t\), are assigned a multivariate normal distribution (Brandt and Williams 2007): \[\boldsymbol \phi_t \sim \text{Gau}(\boldsymbol \phi_{t-1}, \boldsymbol \Sigma),\] where \(\boldsymbol \Sigma\) is a \(k \times k\) covariance matrix.
The covariance matrix can be decomposed into a diagonal matrix containing scale parameters for each variable, \(\boldsymbol \Delta = diag(\tau_1,\dots \tau_k)\), and a symmetric correlation matrix, \(\boldsymbol \Omega\) (Stan Development Team 2021): \[\boldsymbol \Sigma = \boldsymbol \Delta \boldsymbol \Omega \boldsymbol \Delta\] When the correlation structure is added to the model, then a prior distribution is also required for the correlation matrix. surveil uses the LKJ model, which has a single shape parameter, \(\eta\) (Stan Development Team 2021). If \(\eta=1\), the LKJ model will place uniform prior probability on any \(k \times k\) correlation matrix; as \(\eta\) increases from one, it expresses ever greater skepticism towards large correlations. When \(\eta <1\), the LKJ model becomes ‘concave’—expressing skepticism towards correlations of zero.
If we wanted to add a correlation structure to the model, we would
add cor = TRUE to stan_rw, as in: