Package 'surveil' reference manual

Title:	Time Series Models for Disease Surveillance
Description:	Fits time trend models for routine disease surveillance tasks and returns probability distributions for a variety of quantities of interest, including age-standardized rates, period and cumulative percent change, and measures of health inequality. The models are appropriate for count data such as disease incidence and mortality data, employing a Poisson or binomial likelihood and the first-difference (random-walk) prior for unknown risk. Optionally add a covariance matrix for multiple, correlated time series models. Inference is completed using Markov chain Monte Carlo via the Stan modeling language. References: Donegan, Hughes, and Lee (2022) <doi:10.2196/34589>; Stan Development Team (2021) <https://mc-stan.org>; Theil (1972, ISBN:0-444-10378-3).
Authors:	Connor Donegan [aut, cre]
Maintainer:	Connor Donegan <[email protected]>
License:	GPL (>= 3)
Version:	0.3.0
Built:	2025-03-05 04:50:53 UTC
Source:	https://github.com/connordonegan/surveil

The 'surveil' package

Description

Fits time series models for routine disease surveillance tasks and returns probability distributions for a variety of quantities of interest, including measures of health inequality, period and cumulative percent change, and age-standardized rates. Calculates Theil's index to measure inequality among multiple groups, and can be extended to measure inequality across multiple groups nested within geographies. Inference is completed using Markov chain Monte Carlo via the Stan modeling language. The models are appropriate for disease incidence and mortality data, employing a Poisson or binomial likelihood and first-difference (random-walk) prior for unknown risk, and optional covariance matrix for multiple correlated time series models.

Author(s)

Maintainer: Connor Donegan [email protected] (ORCID)

References

Brandt P, Williams JT. Multiple time series models. Thousand Oaks, CA: SAGE Publications, 2007. ISBN:9781412906562

Clayton DG. Generalized linear mixed models. In: Gilks WR, Richardson S, Spiegelhalter DJ, editors. Markov chain Monte Carlo in practice. Boca Raton, FL: CRC Press, 1996. p. 275-302. ISBN:9780412055515

Conceicao P, Galbraith JK, Bradford P. The Theil Index in sequences of nested and hierarchic grouping structures: implications for the measurement of inequality through time, with data aggregated at different levels of industrial classification. Eastern Economic Journal 2001;27(4):491-514.

Donegan C, Hughes AE, and Lee SC (2022). Colorectal Cancer Incidence, Inequalities, and Prevention Priorities in Urban Texas: Surveillance Study With the "surveil" Software Package. JMIR Public Health & Surveillance 8(8):e34589. doi:10.2196/34589

Stan Development Team (2020). RStan: the R interface to Stan. R package version 2.21.2. https://mc-stan.org

Theil H. Statistical decomposition analysis. Amsterdam, The Netherlands: North-Holland Publishing Company, 1972. ISBN:0444103783

Annual and cumulative percent change

Description

Summarize annual and cumulative percent change in risk

Usage

apc(x)

## S3 method for class 'surveil'
apc(x)

## S3 method for class 'stand_surveil'
apc(x)
apc(x)

## S3 method for class 'surveil'
apc(x)

## S3 method for class 'stand_surveil'
apc(x)

Arguments

`x`	A fitted `surviel` model, or standardized rates (a `stand_surveil` object).

Value

An apc (list) object containing the following data frames:

apc: A data frame containing a summary of the posterior distribution for period-specific percent change. This contains the posterior mean (apc) 95 percent credible intervals (lwr and upr bounds).
cpc: A data frame containing a summary of the posterior distribution for the cumulative percent change in risk at each time period. This contains the posterior mean (cpc) and 95 percent credible interval (lwr and upr bounds).
apc_samples: MCMC samples from the posterior distribution for period percent change
cpc_samples: MCMC samples from the posterior distribution for cumulative percent change

Examples

data(cancer)
cancer2 <- subset(cancer, grepl("55|60|65|70", Age))
fit <- stan_rw(cancer2, time = Year, group = Age,
                iter = 900) # low iter for speed only
 x <- apc(fit)
 print(x)
 plot(x, cumulative = TRUE)
data(cancer)
cancer2 <- subset(cancer, grepl("55|60|65|70", Age))
fit <- stan_rw(cancer2, time = Year, group = Age,
                iter = 900) # low iter for speed only
 x <- apc(fit)
 print(x)
 plot(x, cumulative = TRUE)

US cancer incidence by age, 1999-2017

Description

Annual cancer cases (all sites) by age group for the United States.

Usage

cancer
cancer

Format

A data frame with the following columns:

Year: Year of diagnosis
Age: Age group
Count: Number of cancer cases
Population: Age-specific population estimates

Source

United States Cancer Statistics - Incidence: 1999 - 2017, WONDER Online Database. United States Department of Health and Human Services, Centers for Disease Control and Prevention and National Cancer Institute; 2020. Accessed at http://wonder.cdc.gov/cancer-v2017.html on Oct 6, 2021 12:38:09 PM

Examples

data(cancer)
head(cancer)
data(cancer)
head(cancer)

Measures of pairwise inequality

Description

Calculate pairwise measures of health inequality from a fitted surveil time series model, with credible intervals and MCMC samples. Calculates absolute and fractional rate differences (RD and population attributable risk (PAR)), rate ratios, and excess cases.

Usage

group_diff(x, target, reference)

## S3 method for class 'surveil'
group_diff(x, target, reference)

## S3 method for class 'list'
group_diff(x, ...)
group_diff(x, target, reference)

## S3 method for class 'surveil'
group_diff(x, target, reference)

## S3 method for class 'list'
group_diff(x, ...)

Arguments

`x`	Either a fitted `surveil` time series model, or a list of two `stand_surveil` objects (i.e., `surveil` models with age-standardized rates, as returned by `standardize`). If `x` is a list of `stand_surveil` objects, see details below and note that the models must contain the same number of MCMC samples—to ensure this is the case, when using `stan_rw` set `iter` and `chains` to the same values for each of the two models.
`target`	The name (character string) of the disadvantaged group that is the target of inference. If `x` is a list of `stand_surveil` objects, the `target` argument is ignored and the first listed model will serve as the `target` group.
`reference`	The name (character string) of the reference group to which `target` will be compared. If `x` is a list of `stand_surveil` objects, the `reference` argument is ignored and the second listed model will serve as the `reference` group.
`...`	Additional arguments (not used).

Details

Comparing incidence rates

For the following calculations, the terms reference and target refer to incidence rates for the respective groups; p is the size of the target population. (Target is the group that is the 'target' of our inferences, so that it is the numerator in rate ratios, etc.) The following measures are calculated by group_diff:

# rate difference
RD = target - reference
# population attributable fraction
PAR = RD/target = (RR - 1)/RR
# rate ratio
RR = target/reference
# excess cases
EC = RD * p

As the math communicates, the PAR is the rate difference expressed as a fraction of total risk for the target population. This could also be read as the fraction of risk in the target population that would have been removed had the target rate equaled the reference rate (Menvielle et al. 2017).

Comparing age-standardized rates

If the user provides a list of stand_surveil objects with age-standardized rates (instead of a single surveil model), then the exact calculations will be completed as follows. The RR is simply the ratio of age-standardized rates, and the rate difference is similarly the difference between age-standardized rates. However, excess cases is calculated for each age group separately, and the total excess cases across all age groups is returned. Similarly, the attributable risk is calculated by taking the total excess cases across all age groups per year and dividing by the total risk (i.e., by the sum of the whole number of cases across all age groups). Cumulative excess cases is the sum of the time-period specific total number of excess cases. (Notice that the PAR is not equal to (RR-1)/RR when the PAR is derived from a number of age-specific rates and the RR is based on age-standardized rates.)

Value

A list, also of class "surveil_diff", with the following elements:

summary: A tibble with a summary of posterior distributions (mean and 95 percent cred. intervals) for the target group incidence rate, the RD, RR, PAR, and excess cases.
cumulative_cases: Summary of the posterior distribution for the cumulative number of excess cases and the PAR (mean and 95 percent cred. intervals)
groups: Character string with target and reference population names
samples: A data frame of MCMC samples for each quantity of interest (target and reference rates, RD, RR, PAR, and EC, as well as Trend_Cases = Rate * Population). Indexed by time.
cum_samples: MCMC samples of the cumulative number of excess cases.

Author(s)

Connor Donegan ([email protected])

Source

Menvielle, G, Kulhanaova, I, Machenbach, JP. Assessing the impact of a public health intervention to reduce social inequalities in cancer. In: Vaccarella, S, Lortet-Tieulent, J, Saracci, R, Conway, D, Straif, K, Wild, CP, editors. Reducing Social Inequalities in Cancer: Evidence and Priorities for Research. Geneva, Switzerland: WHO Press, 2017:185-192.

Examples

data(msa)
dat = subset(msa, grepl("Houston", MSA) & grepl("Black|White", Race))
fit <- stan_rw(dat, time = Year, group = Race,
               iter = 1e3) # low iter for speed only
gd <- group_diff(fit, "Black or African American", "White")
print(gd, scale = 100e3)
plot(gd, scale = 100e3)

data(msa)
dat = subset(msa, grepl("Houston", MSA) & grepl("Black|White", Race))
fit <- stan_rw(dat, time = Year, group = Race,
               iter = 1e3) # low iter for speed only
gd <- group_diff(fit, "Black or African American", "White")
print(gd, scale = 100e3)
plot(gd, scale = 100e3)

Colorectal cancer incidence by Texas MSA, 1999-2017, ages 50-79

Description

Annual counts of colorectal cancer (cancer of colon or rectum), ages 50-79, for Texas's top four metropolitan statistical areas (MSAs), with population at risk estimates, by race-ethnicity (non-Hispanic White, non-Hispanic Black, Hispanic/Latino).

Usage

msa
msa

Format

A tibble with the following attributes:

Year: Year of diagnosis
Race: Race-ethnicity designation
MSA: Metropolitan statistical area
Count: Number of CRC cases
Population: Age-specific population estimate

Source

United States Cancer Statistics–Incidence: 1999-2017, WONDER Online Database. United States Department of Health and Human Services, Centers for Disease Control and Prevention and National Cancer Institute; 2020. Accessed at http://wonder.cdc.gov/cancer-v2017.html on Nov 9, 2020 2:59:24 PM.

Examples


data(msa)
head(msa)
data(msa)
head(msa)

Methods for fitted `surveil` models

Description

Print and plot methods for surveil model results

Usage

## S3 method for class 'surveil'
print(x, scale = 1, ...)

## S3 method for class 'surveil'
plot(
  x,
  scale = 1,
  style = c("mean_qi", "lines"),
  facet = FALSE,
  facet_scales = c("fixed", "free"),
  ncol = NULL,
  base_size = 14,
  palette = "Dark2",
  M = 250,
  alpha,
  lwd,
  fill = "gray80",
  size = 1.5,
  ...
)

## S3 method for class 'list'
plot(
  x,
  scale = 1,
  style = c("mean_qi", "lines"),
  facet = FALSE,
  ncol,
  facet_scales = c("fixed", "free"),
  M = 250,
  base_size = 14,
  palette = "Dark2",
  fill = "gray80",
  size = 1.5,
  alpha,
  lwd,
  ...
)
## S3 method for class 'surveil'
print(x, scale = 1, ...)

## S3 method for class 'surveil'
plot(
  x,
  scale = 1,
  style = c("mean_qi", "lines"),
  facet = FALSE,
  facet_scales = c("fixed", "free"),
  ncol = NULL,
  base_size = 14,
  palette = "Dark2",
  M = 250,
  alpha,
  lwd,
  fill = "gray80",
  size = 1.5,
  ...
)

## S3 method for class 'list'
plot(
  x,
  scale = 1,
  style = c("mean_qi", "lines"),
  facet = FALSE,
  ncol,
  facet_scales = c("fixed", "free"),
  M = 250,
  base_size = 14,
  palette = "Dark2",
  fill = "gray80",
  size = 1.5,
  alpha,
  lwd,
  ...
)

Arguments

`x`	A fitted `surveil` model, or a list of `stand_surveil` objects (as produced by `standardize`).
`scale`	Scale the rates by this amount; e.g., `scale = 100e3` will print rates per 100,000 at risk.
`...`	For the plot method, additional arguments will be passed to '`theme`; for the print method, additional arguments will be passed to `print.data.frame`.
`style`	If `style = "mean_qi"`, then the posterior mean and 95 percent credible interval will be plotted; if `style = "lines"`, then `M` samples from the joint probability distribution of the annual rates will be plotted.
`facet`	If `facet = TRUE`, `facet_wrap` will be used instead of differentiating by line color.
`facet_scales`	When `facet = TRUE`, this argument controls behavior of the scales for each sub-plot. See the `scales` argument to `facet_wrap`.
`ncol`	Number of columns for the plotting device; optional and only used if `facet = TRUE`. If `ncol = 1`, the three plots will be aligned vertically in one column; if `ncol = 3` they will b aligned horizontally in one row. Defaults to `ncol = NULL` to allow `facet_wrap` to automatically determine the number of columns.
`base_size`	Passed to `theme_classic()` to control size of plot components (text).
`palette`	For multiple groups, choose the color palette. For a list of options, see `scale_color_brewer`. The default is `palette = "Dark2"`. Not used if `facet = TRUE`.
`M`	If `style = "lines"`, then `M` is the number of samples from the posterior distribution that will be plotted; the default is `M = 250`.
`alpha`	Numeric value from zero to one. When `style = "lines"`, this controls transparency of lines; passed to `geom_line`. For 'style = "mean_qi", this controls the transparency of the shaded credible interval; passed to `geom_ribbon`.
`lwd`	Numeric value indicating linewidth. Passed to `geom_line`
`fill`	Color for the shaded credible intervals; only used when `style = "mean_qi"`.
`size`	Positive numeric value. For `style = "mean_qi"`, this controls the size of the points representing crude rates. To exclude these points from the plot altogether, use `size = 0`.

Value

The plot method returns a ggplot object; the print method returns nothing but prints a summary of results to the R console. If x is a list of stand_surveil objects, the plotted lines will be labeled using the names returned by names(x); if elements of the list are not named, plotted lines will simply be numbered.

Author(s)

Connor Donegan ([email protected])

Examples

data(msa)
dat <- aggregate(cbind(Count, Population) ~ Year + Race, data = msa, FUN = sum)
fit <- stan_rw(dat, time = Year, group = Race,
               iter = 1e3, chains = 2) # for speed only, use 4 chains

print(fit)
plot(fit, facet = TRUE, scale = 100e3)

 ## as a ggplot, you can customize the output
library(ggplot2)
plot(fit) + theme_bw()

data(msa)
dat <- aggregate(cbind(Count, Population) ~ Year + Race, data = msa, FUN = sum)
fit <- stan_rw(dat, time = Year, group = Race,
               iter = 1e3, chains = 2) # for speed only, use 4 chains

print(fit)
plot(fit, facet = TRUE, scale = 100e3)

 ## as a ggplot, you can customize the output
library(ggplot2)
plot(fit) + theme_bw()

Methods for Theil's index

Description

Printing and plotting methods for Theil's inequality index

Usage

## S3 method for class 'theil'
plot(
  x,
  style = c("mean_qi", "lines"),
  M = 250,
  col = "black",
  fill = "black",
  alpha,
  lwd,
  base_size = 14,
  scale = 100,
  labels = x$summary$time,
  ...
)

## S3 method for class 'theil_list'
plot(
  x,
  style = c("mean_qi", "lines"),
  M = 250,
  col = "black",
  fill = "black",
  alpha,
  lwd,
  between_title = "Between",
  within_title = "Within",
  total_title = "Total",
  scale = 100,
  plot = TRUE,
  ncol = 3,
  base_size = 14,
  ...
)

## S3 method for class 'theil'
print(x, scale = 100, digits = 3, ...)

## S3 method for class 'theil_list'
print(x, scale = 100, digits = 3, ...)
## S3 method for class 'theil'
plot(
  x,
  style = c("mean_qi", "lines"),
  M = 250,
  col = "black",
  fill = "black",
  alpha,
  lwd,
  base_size = 14,
  scale = 100,
  labels = x$summary$time,
  ...
)

## S3 method for class 'theil_list'
plot(
  x,
  style = c("mean_qi", "lines"),
  M = 250,
  col = "black",
  fill = "black",
  alpha,
  lwd,
  between_title = "Between",
  within_title = "Within",
  total_title = "Total",
  scale = 100,
  plot = TRUE,
  ncol = 3,
  base_size = 14,
  ...
)

## S3 method for class 'theil'
print(x, scale = 100, digits = 3, ...)

## S3 method for class 'theil_list'
print(x, scale = 100, digits = 3, ...)

Arguments

`x`	An object of class `⁠thiel' or ⁠`theil_list`⁠, as returned by calling ⁠`theil`⁠on a list of fitted⁠`surveil' models
`style`	If `style = "mean_qi"`, then the posterior mean and 95 percent credible interval will be plotted; if `style = "lines"`, then `M` samples from the joint probability distribution will be plotted.
`M`	If `style = "lines"`, then `M` is the number of samples from the posterior distribution that will be plotted; the default is `M = 250`.
`col`	Line color
`fill`	Fill color
`alpha`	For `style = "mean_qi"`, this controls the transparency for the credible interval (passed to `geom_ribbon`) and defaults to `alpha = 0.5`; for `style = "lines"`, this controls the transparency of the lines and defaults to `alpha = 0.7`.
`lwd`	Line width; for `style = "mean_qi"`, the default is `lwd = 1`; for `style = "lines"`, the default is `lwd = 0.05`.
`base_size`	Passed to `theme_classic` to control size of plot elements (e.g., text)
`scale`	Scale Theil's index by `scale`
`labels`	x-axis labels (time periods)
`...`	additional arguments
`between_title`	Plot title for the between geography component of Theil's T; defaults to "Between".
`within_title`	Plot title for the within geography component of Theil's T; defaults to "Within".
`total_title`	Plot title for Theil's index; defaults to "Total".
`plot`	If `FALSE`, return a list of `ggplot`s. Not used when `style = "lines"`.
`ncol`	Number of columns for the plotting device. If `ncol = 1`, the three plots will be aligned vertically in one column; if `ncol = 3` they will b aligned horizontally in one row.
`digits`	number of digits to print (passed to `print.data.frame`)

Value

plot.theil

The plot method returns an object of class ggplot.

plot.theil_list

If style = "lines", the plot method for theil_list objects returns a ggplot with facets for each component of inequality (between-areas, within-areas, and total). For style = "mean_qi", the plot method returns either a list of plots (all of class ggplot) or, when plot = TRUE, it will draw them to current plotting device using grid.arrange.

print.theil

The print returns nothing and method prints a summary of results to the R console.

Methods for APC objects

Description

Methods for APC objects

Usage

## S3 method for class 'apc'
print(x, digits = 1, max = 20, ...)

## S3 method for class 'apc'
plot(
  x,
  cumulative = FALSE,
  style = c("mean_qi", "lines"),
  M = 250,
  col = "black",
  fill = "black",
  alpha,
  lwd,
  base_size = 14,
  ...
)
## S3 method for class 'apc'
print(x, digits = 1, max = 20, ...)

## S3 method for class 'apc'
plot(
  x,
  cumulative = FALSE,
  style = c("mean_qi", "lines"),
  M = 250,
  col = "black",
  fill = "black",
  alpha,
  lwd,
  base_size = 14,
  ...
)

Arguments

`x`	An `apc` object returned by `apc`
`digits`	Print this many digits (passed to `print.data.frame`)
`max`	Maximum number of time periods (rows) to print
`...`	additional arguments; for the print argument, these will be passed to `print.data.frame`. For the plot method, these will be passed to `theme`.
`cumulative`	Plot cumulative percent change? Defaults to `cumulative = FALSE`
`style`	If `style = "mean_qi"`, then the posterior mean and 95 percent credible interval will be plotted; if `style = "lines"`, then `M` samples from the joint probability distribution will be plotted.
`M`	If `style = "lines"`, then `M` is the number of samples from the posterior distribution that will be plotted; the default is `M = 250`.
`col`	Line color
`fill`	Fill color for the 95 percent credible interval
`alpha`	For `style = "mean_qi"`, this controls the transparency for the credible interval (passed to `geom_ribbon`) and defaults to `alpha = 0.5`; for `style = "lines"`, this controls the transparency of the lines and defaults to `alpha = 0.7`.
`lwd`	Line width
`base_size`	Size of plot attributes, passed to '`theme_classic`

Value

print

The print method does not have a return value, but prints a summary of results to the R console.

Plot

The plot method returns a ggplot.

Methods for age-standardized rates

Description

Print and plot methods for stand_surveil (standardized rates obtained from a fitted surveil model)

Usage

## S3 method for class 'stand_surveil'
print(x, scale = 1, digits = 3, ...)

## S3 method for class 'stand_surveil'
plot(
  x,
  scale = 1,
  style = c("mean_qi", "lines"),
  M = 250,
  base_size = 14,
  col = "black",
  fill = "gray80",
  alpha,
  lwd,
  ...
)
## S3 method for class 'stand_surveil'
print(x, scale = 1, digits = 3, ...)

## S3 method for class 'stand_surveil'
plot(
  x,
  scale = 1,
  style = c("mean_qi", "lines"),
  M = 250,
  base_size = 14,
  col = "black",
  fill = "gray80",
  alpha,
  lwd,
  ...
)

Arguments

`x`	An object of `stand_surveil` obtained by calling `standardize` on a fitted `surveil` model
`scale`	Scale the rates by this amount; e.g., `scale = 100e3` will print rates per 100,000 at risk.
`digits`	Number of digits to print
`...`	additional arguments
`style`	If `style = "mean_qi"`, then the posterior means and 95 percent credible intervals will be plotted; if `style = "lines"`, then `M` samples from the joint posterior distribution will be plotted.
`M`	Number of samples to plot when `style = "lines"`
`base_size`	Passed to `theme_classic()` to control size of plot components (text).
`col`	Line color
`fill`	Fill color for the 95 percent credible intervals
`alpha`	For `style = "mean_qi"`, this controls the transparency for the credible interval (passed to `geom_ribbon`) and defaults to `alpha = 0.5`; for `style = "lines"`, this controls the transparency of the lines and defaults to `alpha = 0.7`.
`lwd`	Line width; for `style = "mean_qi"`, the default is `lwd = 1`; for `style = "lines"`, the default is `lwd = 0.05`.

Details

Calling standardize on a fitted surveil model will create a new object that contains the surveil model results as well standardized rates. This new stand_surveil object has its own methods for printing and plotting.

print.stand_surveil

Any additional arguments (...) will be passed to print.data.frame

plot.stand_surveil

Any additional arguments (...) will be passed to 'theme.

Value

print.stand_surveil

The print method returns nothing but prints a summary of results to the console.

plot.stand_surveil

The plot method returns an object of class ggplot.

Prior distributions

Description

Prior distributions

Usage

normal(location = 0, scale, k = 1)

lkj(eta)
normal(location = 0, scale, k = 1)

lkj(eta)

Arguments

`location`	Location parameter (numeric)
`scale`	Scale parameter (positive numeric)
`k`	Optional; number of groups for which priors are needed. This is a shortcut to avoid using the `rep` function to repeat the same prior for each group, as in: `⁠normal(location = rep(0, times = 3), scale = rep(1, times = 3)⁠`. To provide distinct priors for each group, simply specify them individually, as in `normal(location = c(-5, -6, -8), scale = c(2, 2, 2))`.
`eta`	The shape parameter for the LKJ prior

Details

The prior distribution functions are used to set the values of prior parameters.

Users can control the values of the parameters, but the distribution (model) itself is fixed. The first log-rate (eta[t], t=1) and the scale parameters (sigma) are assigned Gaussian (normal) prior distribution. (The scale parameter, sigma, is constrained to be positive, making it a half-normal prior.) For correlated time series, the correlation matrix is assigned the LKJ prior.

Parameterizations

For details on how any distribution is parameterized, see the Stan Language Functions Reference document: https://mc-stan.org/users/documentation/.

LKJ prior

The LKJ prior for correlation matrix has a single parameter, eta (eta > 0). If eta=1, then you are placing a uniform prior on any K-by-K correlation matrix. For eta > 1, there is a higher probability on the identity matrix, such that as eta increases beyond 1, you are expressing greater skepticism towards large correlations. If 0 < eta < 1, then you will be expressing skepticism towards correlations of zero and favoring non-zero correlations. See Stan documentation: https://mc-stan.org/docs/2_27/functions-reference/lkj-correlation.html.

Value

An object of class prior which will be used internally by surveil to set parameters of prior distributions.

Source

Stan Development Team. Stan Functions Reference Version 2.27. https://mc-stan.org/docs/2_27/functions-reference/lkj-correlation.html

Examples


# note there are three groups in the data, each requires a prior
prior <- list()
prior$eta_1 <- normal(location = -6, scale = 4, k = 3)
## by default, location = 0
prior$sigma <- normal(scale = 1, k = 3)
prior$omega <- lkj(2)


dfw <- msa[grep("Dallas", msa$MSA), ]
fit <- stan_rw(dfw, time = Year, group = Race, prior = prior,
               chains = 2, iter = 900) # for speed only
plot(fit)


# note there are three groups in the data, each requires a prior
prior <- list()
prior$eta_1 <- normal(location = -6, scale = 4, k = 3)
## by default, location = 0
prior$sigma <- normal(scale = 1, k = 3)
prior$omega <- lkj(2)


dfw <- msa[grep("Dallas", msa$MSA), ]
fit <- stan_rw(dfw, time = Year, group = Race, prior = prior,
               chains = 2, iter = 900) # for speed only
plot(fit)

Time series models for mortality and disease incidence

Description

Model time-varying incidence rates given a time series of case (or death) counts and population at risk.

Usage

stan_rw(
  data,
  group,
  time,
  cor = FALSE,
  family = poisson(),
  prior = list(),
  chains = 4,
  cores = 1,
  iter = 3000,
  refresh = 1500,
  control = list(adapt_delta = 0.98),
  ...
)
stan_rw(
  data,
  group,
  time,
  cor = FALSE,
  family = poisson(),
  prior = list(),
  chains = 4,
  cores = 1,
  iter = 3000,
  refresh = 1500,
  control = list(adapt_delta = 0.98),
  ...
)

Arguments

`data`	A `data.frame` containing the following columns: Count Number of cases or deaths; this column must be named 'Count'. Population Size of population at risk; this column must be named 'Population'. time Time period indicator. (Provide the (unquoted) column name using the `time` argument.) group Optional grouping variable. (Provide the (unquoted) column name using the `group` argument.)
`group`	If `data` is aggregated by demographic group, provide the (unquoted) name of the column in `data` containing the grouping structure, such as age brackets or race-ethnicity. E.g., if `data` has column names `Year`, `Race`, `Cases`, and `Population`, then you would provide `group = Race`.
`time`	Specify the (unquoted) name of the time variable in `data`, as in `time = Year`. This variable must be numeric-alike (i.e., `as.numeric(data$time)` will not fail).
`cor`	For correlated random walks use `cor = TRUE`; default value is `FALSE`. Note this only applies when the `group` argument is used.
`family`	The default specification is a Poisson model with log link function (`family = poisson()`). For a Binomial model with logit link function, use `family = binomial()`.
`prior`	Optionally provide a named `list` with prior parameters. If any of the following items are missing, default priors will be assigned and printed to the console. eta_1 The first value of log-risk in each series must be assigned a Gaussian prior probability distribution. Provide the location and scale parameters for each demographic group in a list, where each parameter is a `k`-length vector. For example, with `k=2` demographic groups, the following code will assign priors of `normal(-6.5, 5)` to the starting values of both series: `⁠prior = list(eta_1 = normal(location = -6.5, scale = 5, k = 2)⁠`. Note, `eta` is the log-rate, so centering the prior for `eta_1` on `-6.5` is similar to centering the prior rate on `⁠exp(-6.5)*100,000 = 150⁠` cases per 100,000 person-years at risk. Note, however, that the translation from log-rate to rate is non-linear. sigma Each demographic group has a scale parameter assigned to its log-rate. This is the scale of the annual deviations from the previous year's log-rate. The scale parameters are assigned independent half-normal prior distributions (these `half` normal distributions are restricted to be positive-valued only). omega If `cor = TRUE`, an LKJ prior is assigned to the correlation matrix, Omega.
`chains`	Number of independent MCMC chains to initiate (passed to `sampling`).
`cores`	The number of cores to use when executing the Markov chains in parallel (passed to `sampling`).
`iter`	Total number of MCMC iterations. Warmup draws are automatically half of `iter`.
`refresh`	How often to print the MCMC sampling progress to the console.
`control`	A named list of parameters to control Stan's sampling behavior. The most common parameters to control are `adapt_delta`, which may be raised to address divergent transitions, and `max_treedepth`. For example, `control = list(adapt_delta = 0.99, max_treedepth = 13)`, may be a reasonable specification to address a divergent transitions or maximum treedepth warning from Stan.
`...`	Other arguments passed to `sampling`.

Details

By default, the models have Poisson likelihoods for the case counts, with log link function. Alternatively, a Binomial model with logit link function can be specified using the family argument (family = binomial()).

For time t = 1,...n, the models assign Poisson probability distribution to the case counts, given log-risk eta and population at tirks P; the log-risk is modeled using the first-difference (or random-walk) prior:

 y_t ~ Poisson(p_t * exp(eta_t))
 eta_t ~ Normal(eta_{t-1}, sigma)
 eta_1 ~ Normal(-6, 5) (-Inf, 0)
 sigma ~ Normal(0, 1) (0, Inf)

This style of model has been discussed in Bayesian (bio)statistics for quite some time. See Clayton (1996).

The above model can be used for multiple distinct groups; in that case, each group will have its own independent time series model.

It is also possible to add a correlation structure to that set of models. Let Y_t be a k-length vector of observations for each of k groups at time t (the capital letter now indicates a vector), then:

 Y_t ~ Poisson(P_t * exp(Eta_t))
 Eta_t ~ MVNormal(Eta_{t-1}, Sigma)
 Eta_1 ~ Normal(-6, 5)  (-Inf, 0)
 Sigma = diag(sigma) * Omega * diag(sigma)
 Omega ~ LKJ(2)
 sigma ~ Normal(0, 1) (0, Inf)

where Omega is a correlation matrix and diag(sigma) is a diagonal matrix with scale parameters on the diagonal. This was adopted from Brandt and Williams (2007); for the LKJ prior, see the Stan Users Guide and Reference Manual.

If the binomial model is used instead of the Poisson, then the first line of the model specifications will be:

 y_t ~ binomial(P_t, inverse_logit(eta_t))

All else is remains the same. The logit function is log(r/(1-r)), where r is a rate between zero and one; the inverse-logit function is exp(x)/(1 + exp(x)).

Value

The function returns a list, also of class surveil, containing the following elements:

summary: A data.frame with posterior means and 95 percent credible intervals, as well as the raw data (Count, Population, time period, grouping variable if any, and crude rates).
samples: A stanfit object returned by sampling. This contains the MCMC samples from the posterior distribution of the fitted model.
cor: Logical value indicating if the model included a correlation structure.
time: A list containing the name of the time-period column in the user-provided data and a data.frame of observed time periods and their index.
group: If a grouping variable was used, this will be a list containing the name of the grouping variable and a data.frame with group labels and index values.
family: The user-provided family argument.

Author(s)

Connor Donegan ([email protected])

Source

Brandt P and Williams JT. Multiple time series models. Thousand Oaks, CA: SAGE Publications, 2007.

Clayton, DG. Generalized linear mixed models. In: Gilks WR, Richardson S, Spiegelhalter DJ, editors. Markov Chain Monte Carlo in Practice: Interdisciplinary Statistics. Boca Raton, FL: CRC Press, 1996. p. 275-302.

Stan Development Team. Stan Modeling Language Users Guide and Reference Manual, 2.28. 2021. https://mc-stan.org

Examples

data(msa)
dat <- aggregate(cbind(Count, Population) ~ Year, data = msa, FUN = sum)

fit <- stan_rw(dat, time = Year)

## print summary of results
print(fit)
print(fit$summary)

## plot time trends (rates per 10,000)
plot(fit, scale = 10e3)
plot(fit, style = 'lines', scale = 10e3)

## Summary with MCMC diagnostics (n_eff, Rhat; from Rstan)
print(fit$samples)

## cumulative percent change
fit_pc <- apc(fit)
print(fit_pc$cpc)
plot(fit_pc, cumulative = TRUE)


## age-specific rates 
data(cancer)
cancer2 <- subset(cancer, grepl("55-59|60-64|65-69", Age))
fit <- stan_rw(cancer2, time = Year, group = Age,
               chains = 3, iter = 1e3) # for speed only

## plot trends 
plot(fit, scale = 10e3)

## age-standardized rates
data(standard)
fit_stands <- standardize(fit,
                          label = standard$age,
                          standard_pop = standard$standard_pop)
print(fit_stands)
plot(fit_stands)

## percent change for age-standardized rates
fit_stands_apc <- apc(fit_stands)
plot(fit_stands_apc)
print(fit_stands_apc)


data(msa)
dat <- aggregate(cbind(Count, Population) ~ Year, data = msa, FUN = sum)

fit <- stan_rw(dat, time = Year)

## print summary of results
print(fit)
print(fit$summary)

## plot time trends (rates per 10,000)
plot(fit, scale = 10e3)
plot(fit, style = 'lines', scale = 10e3)

## Summary with MCMC diagnostics (n_eff, Rhat; from Rstan)
print(fit$samples)

## cumulative percent change
fit_pc <- apc(fit)
print(fit_pc$cpc)
plot(fit_pc, cumulative = TRUE)


## age-specific rates 
data(cancer)
cancer2 <- subset(cancer, grepl("55-59|60-64|65-69", Age))
fit <- stan_rw(cancer2, time = Year, group = Age,
               chains = 3, iter = 1e3) # for speed only

## plot trends 
plot(fit, scale = 10e3)

## age-standardized rates
data(standard)
fit_stands <- standardize(fit,
                          label = standard$age,
                          standard_pop = standard$standard_pop)
print(fit_stands)
plot(fit_stands)

## percent change for age-standardized rates
fit_stands_apc <- apc(fit_stands)
plot(fit_stands_apc)
print(fit_stands_apc)

2000 U.S. standard million population

Description

2000 U.S. standard million population

Usage

standard
standard

Format

An object of class data.frame with 19 rows and 3 columns.

Source

National Cancer Institute. Standard Populations - 19 Age Groups. Accessed at https://seer.cancer.gov/stdpopulations/stdpop.19ages.html on Oct. 8, 2021.

Examples

data(standard)
head(standard)
data(standard)
head(standard)

Age-standardized rates

Description

Convert surveil model results to age standardized rates using a fixed age distribution

Usage

standardize(x, label, standard_pop)
standardize(x, label, standard_pop)

Arguments

`x`	A fitted `surveil` model
`label`	Labels (character strings) for the age groups that correspond to the values of `stand_pop`. All levels of the grouping variable used to fit the model (i.e., `names(x$data$cases)`) must be present in `label`.
`standard_pop`	Standard population values corresponding to the age groups specified by `label`.

Value

A list, also of class "stand_surveil", containing the entire contents of the user-provided surveil model plus the following:

standard_summary: summary data frame of standardized rates (means and 95 percent credible intervals)
standard_samples: a data frame of Markov chain Monte Carlo (MCMC) samples from the posterior probability distribution for the standardized rates
standard_label: user-provided age-group labels
standard_pop: user-provided standardized population sizes (ordered as standard_label)

Examples

data(cancer)
data(standard)

head(standard)
head(cancer)

cancer2 <- subset(cancer, grepl("55-59|60-64|65-69", Age))
fit <- stan_rw(cancer2, time = Year, group = Age,
              chains = 2, iter = 1e3) # for speed only

stands <- standardize(fit,
                      label = standard$age,
                      standard_pop = standard$standard_pop)
print(stands)
plot(stands, style = "lines")

data(cancer)
data(standard)

head(standard)
head(cancer)

cancer2 <- subset(cancer, grepl("55-59|60-64|65-69", Age))
fit <- stan_rw(cancer2, time = Year, group = Age,
              chains = 2, iter = 1e3) # for speed only

stands <- standardize(fit,
                      label = standard$age,
                      standard_pop = standard$standard_pop)
print(stands)
plot(stands, style = "lines")

Methods for `surveil_diff` objects

Description

Methods for surveil_diff objects

print surveil_diff objects for analyses of inequality

Usage

## S3 method for class 'surveil_diff'
plot(
  x,
  style = c("mean_qi", "lines"),
  M = 250,
  col = "black",
  fill = "gray80",
  lwd,
  alpha,
  plot = TRUE,
  scale = 1e+05,
  PAR = TRUE,
  ncol = 3,
  base_size = 14,
  ...
)

## S3 method for class 'surveil_diff'
print(x, scale = 1, ...)
## S3 method for class 'surveil_diff'
plot(
  x,
  style = c("mean_qi", "lines"),
  M = 250,
  col = "black",
  fill = "gray80",
  lwd,
  alpha,
  plot = TRUE,
  scale = 1e+05,
  PAR = TRUE,
  ncol = 3,
  base_size = 14,
  ...
)

## S3 method for class 'surveil_diff'
print(x, scale = 1, ...)

Arguments

`x`	Object of class `surveil_diff`, as returned by calling `group_diff` on a fitted `surveil` model
`style`	If `style = "mean_qi"`, then the posterior mean and 95 percent credible interval will be plotted; if `style = "lines"`, then `M` samples from the joint probability distribution of the annual rates will be plotted.
`M`	If `style = "lines"`, then `M` is the number of samples from the posterior distribution that will be plotted; the default is `M = 250`.
`col`	Line color
`fill`	Fill color for credible intervals, passed to `geom_ribbon`
`lwd`	Linewidth
`alpha`	transparency; for `⁠style = "mean_qi", controls the credible interval shading; for ⁠`style = "lines"', this is applied to the lines
`plot`	If `plot = FALSE`, a list of `ggplot`s will be returned
`scale`	Print rates and rate differences as per `scale` at risk, e.g., per 10,000 at risk.
`PAR`	Return population attributable risk? IF `FALSE`, then the rate ratio will be used instead of PAR.
`ncol`	Number of columns for the plotting device. If `ncol = 1`, the three plots will be aligned vertically in one column; if `ncol = 3` they will b aligned horizontally in one row.
`base_size`	Passed to `theme_classic` to control size of plot elements (e.g., text)
`...`	additional print arguments

Value

plot.surveil_diff

By default or whenever plot = TRUE, the plot method draws a series of plots to the current plotting device using grid.arrange. If plot = FALSE, then a list of ggplots is returned.

print.surveil_diff

The print method returns nothing and prints a summary of results to the console.

Theil's inequality index

Description

Theil's entropy-based index of inequality

Usage

theil(x)

theil2(Count, Population, rates, total = TRUE)

## S3 method for class 'surveil'
theil(x)

## S3 method for class 'list'
theil(x)
theil(x)

theil2(Count, Population, rates, total = TRUE)

## S3 method for class 'surveil'
theil(x)

## S3 method for class 'list'
theil(x)

Arguments

`x`	A fitted `surveil` model, from `stan_rw`; or, a list of fitted `surveil` models, where each model represents a different geographic area (e.g., states).
`Count`	Case counts, integers
`Population`	Population at risk, integers
`rates`	If `Count` is not provided, then `rates` must be provided (`Count = rates * Population`).
`total`	If `total = TRUE`, Theil's index will be returned. Each unit contributes to Theil's index; if `total = FALSE`, all of the elements that sum to Theil's index will be returned.

Details

Theil's index is a good index of inequality in disease and mortality burdens when multiple groups are being considered. It provides a summary measure of inequality across a set of demographic groups that may be tracked over time (and/or space). Also, it is interesting because it is additive, and thus admits of simple decompositions.

The index measures discrepancies between a population's share of the disease burden, omega, and their share of the population, eta. A situation of zero inequality would imply that each population's share of cases is equal to its population share, or, omega=eta. Each population's contribution to total inequality is calculated as:

             T_i = omega_i * [log(omega_i/eta_i)],

the log-ratio of case-share to population-share, weighted by their share of cases. Theil's index for all areas is the sum of each area's T_i:

             T = sum_(i=1)^n T_i.

Theil's T is thus a weighted mean of log-ratios of case shares to population shares, where each log-ratio (which we may describe as a raw inequality score) is weighted by its share of total cases. The index has a minimum of zero and a maximum of log(N), where N is the number of units (e.g., number of states).

Theil's index, which is based on Shannon's information theory, can be extended to measure inequality across multiple groups nested within non-overlapping geographies (e.g., states).

Value

theil2

If total = TRUE (the default), theil2 returns Theil's index as a numeric value. Else, theil2 returns a vector of values that sum to Theil's index.

theil.surveil

A named list with the following elements:

summary: A data.frame summarizing the posterior probability distribution for Theil's T, including the mean and 95 percent credible interval for each time period
samples: A data.frame with MCMC samples for Theil's T

theil.list

A list (also of class theil_list) containing a summary data frame and a tbl_df containing MCMC samples for Theil's index at each time period.

The summary data frame includes the following columns:

time: time period
Theil: Posterior mean for Theil's index; equal to the sum of Theil_between and Theil_within.
Theil_between: The between-areas component to Theil's inequality index
Theil_within: The within-areas component to Theil's inequality index

Additional columns contain the upper and lower limits of the 95 percent credible intervals for each component of Theil's index.

The data frame of samples contains the following columns:

time: Time period indicator
.draw: An id for each MCMC sample; note that samples are from the joint distribution
Theil_between: The between-geographies component of Theil's index
Theil_within: The within-geographies component of Theil's index
Theil: Theil's inequality index (T = Between + Within)

Source

Conceicao, P. and P. Ferreira (2000). The young person's guide to the Theil Index: Suggesting intuitive interpretations and exploring analytical applications. University of Texas Inequality Project. UTIP Working Paper Number 14. Accessed May 1, 2021 from https://utip.gov.utexas.edu/papers.html

Conceicao, P, Galbraith, JK, Bradford, P. (2001). The Theil Index in sequences of nested and hierarchic grouping structures: implications for the measurement of inequality through time, with data aggregated at different levels of industrial classification. Eastern Economic Journal. 27(4): 491-514.

Theil, Henri (1972). Statistical Decomposition Analysis. Amsterdam, The Netherlands and London, UK: North-Holland Publishing Company.

Shannon, Claude E. and Weaver, Warren (1963). The Mathematical Theory of Communication. Urbana and Chicago, USA: University if Illinois Press.

Examples

 houston <- subset(msa, grepl("Houston", MSA))
 fit <- stan_rw(houston, time = Year, group = Race,
               chains = 2, iter = 900) # for speed only
 theil_dfw <- theil(fit)
 plot(theil_dfw)


Count <- c(10, 12, 3, 111)
Pop <- c(1000, 1200, 4000, 9000)
theil2(Count, Pop)
theil2(Count, Pop, total = FALSE)

houston <- subset(msa, grepl("Houston", MSA))
 fit <- stan_rw(houston, time = Year, group = Race,
               chains = 2, iter = 900) # for speed only
 theil_dfw <- theil(fit)
 plot(theil_dfw)


Count <- c(10, 12, 3, 111)
Pop <- c(1000, 1200, 4000, 9000)
theil2(Count, Pop)
theil2(Count, Pop, total = FALSE)

Widely Applicable Information Criteria

Description

Widely Application Information Criteria (WAIC) for model comparison

Usage

waic(fit, pointwise = FALSE, digits = 2)
waic(fit, pointwise = FALSE, digits = 2)

Arguments

`fit`	An `surveil` object
`pointwise`	Logical (defaults to `FALSE`); if `pointwise = TRUE`, a vector of values for each observation will be returned.
`digits`	Round results to this many digits.

Value

A vector of length 3 with WAIC, a rough measure of the effective number of parameters estimated by the model Eff_pars, and log predictive density Lpd. If pointwise = TRUE, results are returned in a data.frame.

Source

Watanabe, S. (2010). Asymptotic equivalence of Bayes cross validation and widely application information criterion in singular learning theory. Journal of Machine Learning Research 11, 3571-3594.

Examples


data(msa)
austin_w <- subset(msa, grepl("Austin", MSA) & Race == "White")
fit <- stan_rw(austin_w, time = Year)
waic(fit)
 
data(msa)
austin_w <- subset(msa, grepl("Austin", MSA) & Race == "White")
fit <- stan_rw(austin_w, time = Year)
waic(fit)

Package 'surveil'

Help Index

The 'surveil' package

Description

Author(s)

References

See Also

Annual and cumulative percent change

Description

Usage

Arguments

Value

See Also

Examples

US cancer incidence by age, 1999-2017

Description

Usage

Format

Source

Examples

Measures of pairwise inequality

Description

Usage

Arguments

Details

Comparing incidence rates

Comparing age-standardized rates

Value

Author(s)

Source

See Also

Examples

Colorectal cancer incidence by Texas MSA, 1999-2017, ages 50-79

Description

Usage

Format

Source

Examples

Methods for fitted surveil models

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Methods for Theil's index

Description

Usage

Arguments

Value

plot.theil

plot.theil_list

print.theil

See Also

Methods for APC objects

Description

Usage

Arguments

Value

print

Plot

See Also

Methods for age-standardized rates

Description

Usage

Arguments

Details

print.stand_surveil

plot.stand_surveil

Value

print.stand_surveil

plot.stand_surveil

See Also

Prior distributions

Description

Usage

Arguments

Details

Parameterizations

Methods for fitted `surveil` models

Methods for `surveil_diff` objects