Title: | Time Series Models for Disease Surveillance |
---|---|
Description: | Fits time trend models for routine disease surveillance tasks and returns probability distributions for a variety of quantities of interest, including age-standardized rates, period and cumulative percent change, and measures of health inequality. The models are appropriate for count data such as disease incidence and mortality data, employing a Poisson or binomial likelihood and the first-difference (random-walk) prior for unknown risk. Optionally add a covariance matrix for multiple, correlated time series models. Inference is completed using Markov chain Monte Carlo via the Stan modeling language. References: Donegan, Hughes, and Lee (2022) <doi:10.2196/34589>; Stan Development Team (2021) <https://mc-stan.org>; Theil (1972, ISBN:0-444-10378-3). |
Authors: | Connor Donegan [aut, cre] |
Maintainer: | Connor Donegan <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.3.0 |
Built: | 2025-02-03 04:57:45 UTC |
Source: | https://github.com/connordonegan/surveil |
Fits time series models for routine disease surveillance tasks and returns probability distributions for a variety of quantities of interest, including measures of health inequality, period and cumulative percent change, and age-standardized rates. Calculates Theil's index to measure inequality among multiple groups, and can be extended to measure inequality across multiple groups nested within geographies. Inference is completed using Markov chain Monte Carlo via the Stan modeling language. The models are appropriate for disease incidence and mortality data, employing a Poisson or binomial likelihood and first-difference (random-walk) prior for unknown risk, and optional covariance matrix for multiple correlated time series models.
Maintainer: Connor Donegan [email protected] (ORCID)
Brandt P, Williams JT. Multiple time series models. Thousand Oaks, CA: SAGE Publications, 2007. ISBN:9781412906562
Clayton DG. Generalized linear mixed models. In: Gilks WR, Richardson S, Spiegelhalter DJ, editors. Markov chain Monte Carlo in practice. Boca Raton, FL: CRC Press, 1996. p. 275-302. ISBN:9780412055515
Conceicao P, Galbraith JK, Bradford P. The Theil Index in sequences of nested and hierarchic grouping structures: implications for the measurement of inequality through time, with data aggregated at different levels of industrial classification. Eastern Economic Journal 2001;27(4):491-514.
Donegan C, Hughes AE, and Lee SC (2022). Colorectal Cancer Incidence, Inequalities, and Prevention Priorities in Urban Texas: Surveillance Study With the "surveil" Software Package. JMIR Public Health & Surveillance 8(8):e34589. doi:10.2196/34589
Stan Development Team (2020). RStan: the R interface to Stan. R package version 2.21.2. https://mc-stan.org
Theil H. Statistical decomposition analysis. Amsterdam, The Netherlands: North-Holland Publishing Company, 1972. ISBN:0444103783
Useful links:
Summarize annual and cumulative percent change in risk
apc(x) ## S3 method for class 'surveil' apc(x) ## S3 method for class 'stand_surveil' apc(x)
apc(x) ## S3 method for class 'surveil' apc(x) ## S3 method for class 'stand_surveil' apc(x)
x |
A fitted |
An apc
(list) object containing the following data frames:
A data frame containing a summary of the posterior distribution for period-specific percent change. This contains the posterior mean (apc
) 95 percent credible intervals (lwr
and upr
bounds).
A data frame containing a summary of the posterior distribution for the cumulative percent change in risk at each time period. This contains the posterior mean (cpc
) and 95 percent credible interval (lwr
and upr
bounds).
MCMC samples from the posterior distribution for period percent change
MCMC samples from the posterior distribution for cumulative percent change
plot.apc
print.apc
stan_rw
standardize
data(cancer) cancer2 <- subset(cancer, grepl("55|60|65|70", Age)) fit <- stan_rw(cancer2, time = Year, group = Age, iter = 900) # low iter for speed only x <- apc(fit) print(x) plot(x, cumulative = TRUE)
data(cancer) cancer2 <- subset(cancer, grepl("55|60|65|70", Age)) fit <- stan_rw(cancer2, time = Year, group = Age, iter = 900) # low iter for speed only x <- apc(fit) print(x) plot(x, cumulative = TRUE)
Annual cancer cases (all sites) by age group for the United States.
cancer
cancer
A data frame with the following columns:
Year of diagnosis
Age group
Number of cancer cases
Age-specific population estimates
United States Cancer Statistics - Incidence: 1999 - 2017, WONDER Online Database. United States Department of Health and Human Services, Centers for Disease Control and Prevention and National Cancer Institute; 2020. Accessed at http://wonder.cdc.gov/cancer-v2017.html on Oct 6, 2021 12:38:09 PM
data(cancer) head(cancer)
data(cancer) head(cancer)
Calculate pairwise measures of health inequality from a fitted surveil
time series model, with credible intervals and MCMC samples. Calculates absolute and fractional rate differences (RD and population attributable risk (PAR)), rate ratios, and excess cases.
group_diff(x, target, reference) ## S3 method for class 'surveil' group_diff(x, target, reference) ## S3 method for class 'list' group_diff(x, ...)
group_diff(x, target, reference) ## S3 method for class 'surveil' group_diff(x, target, reference) ## S3 method for class 'list' group_diff(x, ...)
x |
Either a fitted |
target |
The name (character string) of the disadvantaged group that is the target of inference. If |
reference |
The name (character string) of the reference group to which |
... |
Additional arguments (not used). |
For the following calculations, the terms reference
and target
refer to incidence rates for the respective groups; p
is the size of the target population. (Target is the group that is the 'target' of our inferences, so that it is the numerator in rate ratios, etc.) The following measures are calculated by group_diff
:
# rate difference RD = target - reference # population attributable fraction PAR = RD/target = (RR - 1)/RR # rate ratio RR = target/reference # excess cases EC = RD * p
As the math communicates, the PAR is the rate difference expressed as a fraction of total risk for the target population. This could also be read as the fraction of risk in the target population that would have been removed had the target rate equaled the reference rate (Menvielle et al. 2017).
If the user provides a list of stand_surveil
objects with age-standardized rates (instead of a single surveil
model), then the exact calculations will be completed as follows. The RR is simply the ratio of age-standardized rates, and the rate difference is similarly the difference between age-standardized rates. However, excess cases is calculated for each age group separately, and the total excess cases across all age groups is returned. Similarly, the attributable risk is calculated by taking the total excess cases across all age groups per year and dividing by the total risk (i.e., by the sum of the whole number of cases across all age groups). Cumulative excess cases is the sum of the time-period specific total number of excess cases. (Notice that the PAR is not equal to (RR-1)/RR when the PAR is derived from a number of age-specific rates and the RR is based on age-standardized rates.)
A list, also of class "surveil_diff", with the following elements:
A tibble with a summary of posterior distributions (mean and 95 percent cred. intervals) for the target group incidence rate, the RD, RR, PAR, and excess cases.
Summary of the posterior distribution for the cumulative number of excess cases and the PAR (mean and 95 percent cred. intervals)
Character string with target and reference population names
A data frame of MCMC samples for each quantity of interest (target and reference rates, RD, RR, PAR, and EC, as well as Trend_Cases = Rate * Population
). Indexed by time.
MCMC samples of the cumulative number of excess cases.
Connor Donegan ([email protected])
Menvielle, G, Kulhanaova, I, Machenbach, JP. Assessing the impact of a public health intervention to reduce social inequalities in cancer. In: Vaccarella, S, Lortet-Tieulent, J, Saracci, R, Conway, D, Straif, K, Wild, CP, editors. Reducing Social Inequalities in Cancer: Evidence and Priorities for Research. Geneva, Switzerland: WHO Press, 2017:185-192.
plot.surveil_diff
print.surveil_diff
theil
data(msa) dat = subset(msa, grepl("Houston", MSA) & grepl("Black|White", Race)) fit <- stan_rw(dat, time = Year, group = Race, iter = 1e3) # low iter for speed only gd <- group_diff(fit, "Black or African American", "White") print(gd, scale = 100e3) plot(gd, scale = 100e3)
data(msa) dat = subset(msa, grepl("Houston", MSA) & grepl("Black|White", Race)) fit <- stan_rw(dat, time = Year, group = Race, iter = 1e3) # low iter for speed only gd <- group_diff(fit, "Black or African American", "White") print(gd, scale = 100e3) plot(gd, scale = 100e3)
Annual counts of colorectal cancer (cancer of colon or rectum), ages 50-79, for Texas's top four metropolitan statistical areas (MSAs), with population at risk estimates, by race-ethnicity (non-Hispanic White, non-Hispanic Black, Hispanic/Latino).
msa
msa
A tibble
with the following attributes:
Year of diagnosis
Race-ethnicity designation
Metropolitan statistical area
Number of CRC cases
Age-specific population estimate
United States Cancer Statistics–Incidence: 1999-2017, WONDER Online Database. United States Department of Health and Human Services, Centers for Disease Control and Prevention and National Cancer Institute; 2020. Accessed at http://wonder.cdc.gov/cancer-v2017.html on Nov 9, 2020 2:59:24 PM.
data(msa) head(msa)
data(msa) head(msa)
surveil
modelsPrint and plot methods for surveil
model results
## S3 method for class 'surveil' print(x, scale = 1, ...) ## S3 method for class 'surveil' plot( x, scale = 1, style = c("mean_qi", "lines"), facet = FALSE, facet_scales = c("fixed", "free"), ncol = NULL, base_size = 14, palette = "Dark2", M = 250, alpha, lwd, fill = "gray80", size = 1.5, ... ) ## S3 method for class 'list' plot( x, scale = 1, style = c("mean_qi", "lines"), facet = FALSE, ncol, facet_scales = c("fixed", "free"), M = 250, base_size = 14, palette = "Dark2", fill = "gray80", size = 1.5, alpha, lwd, ... )
## S3 method for class 'surveil' print(x, scale = 1, ...) ## S3 method for class 'surveil' plot( x, scale = 1, style = c("mean_qi", "lines"), facet = FALSE, facet_scales = c("fixed", "free"), ncol = NULL, base_size = 14, palette = "Dark2", M = 250, alpha, lwd, fill = "gray80", size = 1.5, ... ) ## S3 method for class 'list' plot( x, scale = 1, style = c("mean_qi", "lines"), facet = FALSE, ncol, facet_scales = c("fixed", "free"), M = 250, base_size = 14, palette = "Dark2", fill = "gray80", size = 1.5, alpha, lwd, ... )
x |
A fitted |
scale |
Scale the rates by this amount; e.g., |
... |
For the plot method, additional arguments will be passed to ' |
style |
If |
facet |
If |
facet_scales |
When |
ncol |
Number of columns for the plotting device; optional and only used if |
base_size |
Passed to |
palette |
For multiple groups, choose the color palette. For a list of options, see |
M |
If |
alpha |
Numeric value from zero to one. When |
lwd |
Numeric value indicating linewidth. Passed to |
fill |
Color for the shaded credible intervals; only used when |
size |
Positive numeric value. For |
The plot method returns a ggplot
object; the print method returns nothing but prints a summary of results to the R console. If x
is a list of stand_surveil
objects, the plotted lines will be labeled using the names returned by names(x)
; if elements of the list are not named, plotted lines will simply be numbered.
Connor Donegan ([email protected])
data(msa) dat <- aggregate(cbind(Count, Population) ~ Year + Race, data = msa, FUN = sum) fit <- stan_rw(dat, time = Year, group = Race, iter = 1e3, chains = 2) # for speed only, use 4 chains print(fit) plot(fit, facet = TRUE, scale = 100e3) ## as a ggplot, you can customize the output library(ggplot2) plot(fit) + theme_bw()
data(msa) dat <- aggregate(cbind(Count, Population) ~ Year + Race, data = msa, FUN = sum) fit <- stan_rw(dat, time = Year, group = Race, iter = 1e3, chains = 2) # for speed only, use 4 chains print(fit) plot(fit, facet = TRUE, scale = 100e3) ## as a ggplot, you can customize the output library(ggplot2) plot(fit) + theme_bw()
Printing and plotting methods for Theil's inequality index
## S3 method for class 'theil' plot( x, style = c("mean_qi", "lines"), M = 250, col = "black", fill = "black", alpha, lwd, base_size = 14, scale = 100, labels = x$summary$time, ... ) ## S3 method for class 'theil_list' plot( x, style = c("mean_qi", "lines"), M = 250, col = "black", fill = "black", alpha, lwd, between_title = "Between", within_title = "Within", total_title = "Total", scale = 100, plot = TRUE, ncol = 3, base_size = 14, ... ) ## S3 method for class 'theil' print(x, scale = 100, digits = 3, ...) ## S3 method for class 'theil_list' print(x, scale = 100, digits = 3, ...)
## S3 method for class 'theil' plot( x, style = c("mean_qi", "lines"), M = 250, col = "black", fill = "black", alpha, lwd, base_size = 14, scale = 100, labels = x$summary$time, ... ) ## S3 method for class 'theil_list' plot( x, style = c("mean_qi", "lines"), M = 250, col = "black", fill = "black", alpha, lwd, between_title = "Between", within_title = "Within", total_title = "Total", scale = 100, plot = TRUE, ncol = 3, base_size = 14, ... ) ## S3 method for class 'theil' print(x, scale = 100, digits = 3, ...) ## S3 method for class 'theil_list' print(x, scale = 100, digits = 3, ...)
x |
An object of class |
style |
If |
M |
If |
col |
Line color |
fill |
Fill color |
alpha |
For |
lwd |
Line width; for |
base_size |
Passed to |
scale |
Scale Theil's index by |
labels |
x-axis labels (time periods) |
... |
additional arguments |
between_title |
Plot title for the between geography component of Theil's T; defaults to "Between". |
within_title |
Plot title for the within geography component of Theil's T; defaults to "Within". |
total_title |
Plot title for Theil's index; defaults to "Total". |
plot |
If |
ncol |
Number of columns for the plotting device. If |
digits |
number of digits to print (passed to |
The plot method returns an object of class ggplot
.
If style = "lines"
, the plot method for theil_list
objects returns a ggplot
with facets for each component of inequality (between-areas, within-areas, and total). For style = "mean_qi"
, the plot method returns either a list of plots (all of class ggplot
) or, when plot = TRUE
, it will draw them to current plotting device using grid.arrange
.
The print returns nothing and method prints a summary of results to the R console.
Methods for APC objects
## S3 method for class 'apc' print(x, digits = 1, max = 20, ...) ## S3 method for class 'apc' plot( x, cumulative = FALSE, style = c("mean_qi", "lines"), M = 250, col = "black", fill = "black", alpha, lwd, base_size = 14, ... )
## S3 method for class 'apc' print(x, digits = 1, max = 20, ...) ## S3 method for class 'apc' plot( x, cumulative = FALSE, style = c("mean_qi", "lines"), M = 250, col = "black", fill = "black", alpha, lwd, base_size = 14, ... )
x |
An |
digits |
Print this many digits (passed to |
max |
Maximum number of time periods (rows) to print |
... |
additional arguments; for the print argument, these will be passed to |
cumulative |
Plot cumulative percent change? Defaults to |
style |
If |
M |
If |
col |
Line color |
fill |
Fill color for the 95 percent credible interval |
alpha |
For |
lwd |
Line width |
base_size |
Size of plot attributes, passed to ' |
The print method does not have a return value, but prints a summary of results to the R console.
The plot method returns a ggplot
.
Print and plot methods for stand_surveil
(standardized rates obtained from a fitted surveil
model)
## S3 method for class 'stand_surveil' print(x, scale = 1, digits = 3, ...) ## S3 method for class 'stand_surveil' plot( x, scale = 1, style = c("mean_qi", "lines"), M = 250, base_size = 14, col = "black", fill = "gray80", alpha, lwd, ... )
## S3 method for class 'stand_surveil' print(x, scale = 1, digits = 3, ...) ## S3 method for class 'stand_surveil' plot( x, scale = 1, style = c("mean_qi", "lines"), M = 250, base_size = 14, col = "black", fill = "gray80", alpha, lwd, ... )
x |
An object of |
scale |
Scale the rates by this amount; e.g., |
digits |
Number of digits to print |
... |
additional arguments |
style |
If |
M |
Number of samples to plot when |
base_size |
Passed to |
col |
Line color |
fill |
Fill color for the 95 percent credible intervals |
alpha |
For |
lwd |
Line width; for |
Calling standardize
on a fitted surveil
model will create a new object that contains the surveil
model results as well standardized rates. This new stand_surveil
object has its own methods for printing and plotting.
Any additional arguments (...
) will be passed to print.data.frame
Any additional arguments (...
) will be passed to 'theme
.
The print method returns nothing but prints a summary of results to the console.
The plot method returns an object of class ggplot
.
Prior distributions
normal(location = 0, scale, k = 1) lkj(eta)
normal(location = 0, scale, k = 1) lkj(eta)
location |
Location parameter (numeric) |
scale |
Scale parameter (positive numeric) |
k |
Optional; number of groups for which priors are needed. This is a shortcut to avoid using the |
eta |
The shape parameter for the LKJ prior |
The prior distribution functions are used to set the values of prior parameters.
Users can control the values of the parameters, but the distribution (model) itself is fixed. The first log-rate (eta[t]
, t=1
) and the scale parameters (sigma) are assigned Gaussian (normal
) prior distribution. (The scale parameter, sigma, is constrained to be positive, making it a half-normal prior.) For correlated time series, the correlation matrix is assigned the LKJ prior.
For details on how any distribution is parameterized, see the Stan Language Functions Reference document: https://mc-stan.org/users/documentation/.
The LKJ prior for correlation matrix has a single parameter, eta (eta > 0). If eta=1
, then you are placing a uniform prior on any K-by-K correlation matrix. For eta > 1, there is a higher probability on the identity matrix, such that as eta increases beyond 1, you are expressing greater skepticism towards large correlations. If 0 < eta < 1, then you will be expressing skepticism towards correlations of zero and favoring non-zero correlations. See Stan documentation: https://mc-stan.org/docs/2_27/functions-reference/lkj-correlation.html.
An object of class prior
which will be used internally by surveil to set parameters of prior distributions.
Stan Development Team. Stan Functions Reference Version 2.27. https://mc-stan.org/docs/2_27/functions-reference/lkj-correlation.html
# note there are three groups in the data, each requires a prior prior <- list() prior$eta_1 <- normal(location = -6, scale = 4, k = 3) ## by default, location = 0 prior$sigma <- normal(scale = 1, k = 3) prior$omega <- lkj(2) dfw <- msa[grep("Dallas", msa$MSA), ] fit <- stan_rw(dfw, time = Year, group = Race, prior = prior, chains = 2, iter = 900) # for speed only plot(fit)
# note there are three groups in the data, each requires a prior prior <- list() prior$eta_1 <- normal(location = -6, scale = 4, k = 3) ## by default, location = 0 prior$sigma <- normal(scale = 1, k = 3) prior$omega <- lkj(2) dfw <- msa[grep("Dallas", msa$MSA), ] fit <- stan_rw(dfw, time = Year, group = Race, prior = prior, chains = 2, iter = 900) # for speed only plot(fit)
Model time-varying incidence rates given a time series of case (or death) counts and population at risk.
stan_rw( data, group, time, cor = FALSE, family = poisson(), prior = list(), chains = 4, cores = 1, iter = 3000, refresh = 1500, control = list(adapt_delta = 0.98), ... )
stan_rw( data, group, time, cor = FALSE, family = poisson(), prior = list(), chains = 4, cores = 1, iter = 3000, refresh = 1500, control = list(adapt_delta = 0.98), ... )
data |
A
|
group |
If |
time |
Specify the (unquoted) name of the time variable in |
cor |
For correlated random walks use |
family |
The default specification is a Poisson model with log link function ( |
prior |
Optionally provide a named
|
chains |
Number of independent MCMC chains to initiate (passed to |
cores |
The number of cores to use when executing the Markov chains in parallel (passed to |
iter |
Total number of MCMC iterations. Warmup draws are automatically half of |
refresh |
How often to print the MCMC sampling progress to the console. |
control |
A named list of parameters to control Stan's sampling behavior. The most common parameters to control are |
... |
Other arguments passed to |
By default, the models have Poisson likelihoods for the case counts, with log link function. Alternatively, a Binomial model with logit link function can be specified using the family
argument (family = binomial()
).
For time t = 1,...n, the models assign Poisson probability distribution to the case counts, given log-risk eta
and population at tirks P; the log-risk is modeled using the first-difference (or random-walk) prior:
y_t ~ Poisson(p_t * exp(eta_t)) eta_t ~ Normal(eta_{t-1}, sigma) eta_1 ~ Normal(-6, 5) (-Inf, 0) sigma ~ Normal(0, 1) (0, Inf)
This style of model has been discussed in Bayesian (bio)statistics for quite some time. See Clayton (1996).
The above model can be used for multiple distinct groups; in that case, each group will have its own independent time series model.
It is also possible to add a correlation structure to that set of models. Let Y_t
be a k-length vector of observations for each of k groups at time t (the capital letter now indicates a vector), then:
Y_t ~ Poisson(P_t * exp(Eta_t)) Eta_t ~ MVNormal(Eta_{t-1}, Sigma) Eta_1 ~ Normal(-6, 5) (-Inf, 0) Sigma = diag(sigma) * Omega * diag(sigma) Omega ~ LKJ(2) sigma ~ Normal(0, 1) (0, Inf)
where Omega
is a correlation matrix and diag(sigma)
is a diagonal matrix with scale parameters on the diagonal. This was adopted from Brandt and Williams (2007); for the LKJ prior, see the Stan Users Guide and Reference Manual.
If the binomial model is used instead of the Poisson, then the first line of the model specifications will be:
y_t ~ binomial(P_t, inverse_logit(eta_t))
All else is remains the same. The logit function is log(r/(1-r))
, where r
is a rate between zero and one; the inverse-logit function is exp(x)/(1 + exp(x))
.
The function returns a list, also of class surveil
, containing the following elements:
A data.frame
with posterior means and 95 percent credible intervals, as well as the raw data (Count, Population, time period, grouping variable if any, and crude rates).
A stanfit
object returned by sampling
. This contains the MCMC samples from the posterior distribution of the fitted model.
Logical value indicating if the model included a correlation structure.
A list containing the name of the time-period column in the user-provided data and a data.frame
of observed time periods and their index.
If a grouping variable was used, this will be a list containing the name of the grouping variable and a data.frame
with group labels and index values.
The user-provided family
argument.
Connor Donegan ([email protected])
Brandt P and Williams JT. Multiple time series models. Thousand Oaks, CA: SAGE Publications, 2007.
Clayton, DG. Generalized linear mixed models. In: Gilks WR, Richardson S, Spiegelhalter DJ, editors. Markov Chain Monte Carlo in Practice: Interdisciplinary Statistics. Boca Raton, FL: CRC Press, 1996. p. 275-302.
Donegan C, Hughes AE, and Lee SC (2022). Colorectal Cancer Incidence, Inequalities, and Prevention Priorities in Urban Texas: Surveillance Study With the "surveil" Software Package. JMIR Public Health & Surveillance 8(8):e34589. doi:10.2196/34589
Stan Development Team. Stan Modeling Language Users Guide and Reference Manual, 2.28. 2021. https://mc-stan.org
vignette("demonstration", package = "surveil")
vignette("age-standardization", package = "surveil")
apc
standardize
data(msa) dat <- aggregate(cbind(Count, Population) ~ Year, data = msa, FUN = sum) fit <- stan_rw(dat, time = Year) ## print summary of results print(fit) print(fit$summary) ## plot time trends (rates per 10,000) plot(fit, scale = 10e3) plot(fit, style = 'lines', scale = 10e3) ## Summary with MCMC diagnostics (n_eff, Rhat; from Rstan) print(fit$samples) ## cumulative percent change fit_pc <- apc(fit) print(fit_pc$cpc) plot(fit_pc, cumulative = TRUE) ## age-specific rates data(cancer) cancer2 <- subset(cancer, grepl("55-59|60-64|65-69", Age)) fit <- stan_rw(cancer2, time = Year, group = Age, chains = 3, iter = 1e3) # for speed only ## plot trends plot(fit, scale = 10e3) ## age-standardized rates data(standard) fit_stands <- standardize(fit, label = standard$age, standard_pop = standard$standard_pop) print(fit_stands) plot(fit_stands) ## percent change for age-standardized rates fit_stands_apc <- apc(fit_stands) plot(fit_stands_apc) print(fit_stands_apc)
data(msa) dat <- aggregate(cbind(Count, Population) ~ Year, data = msa, FUN = sum) fit <- stan_rw(dat, time = Year) ## print summary of results print(fit) print(fit$summary) ## plot time trends (rates per 10,000) plot(fit, scale = 10e3) plot(fit, style = 'lines', scale = 10e3) ## Summary with MCMC diagnostics (n_eff, Rhat; from Rstan) print(fit$samples) ## cumulative percent change fit_pc <- apc(fit) print(fit_pc$cpc) plot(fit_pc, cumulative = TRUE) ## age-specific rates data(cancer) cancer2 <- subset(cancer, grepl("55-59|60-64|65-69", Age)) fit <- stan_rw(cancer2, time = Year, group = Age, chains = 3, iter = 1e3) # for speed only ## plot trends plot(fit, scale = 10e3) ## age-standardized rates data(standard) fit_stands <- standardize(fit, label = standard$age, standard_pop = standard$standard_pop) print(fit_stands) plot(fit_stands) ## percent change for age-standardized rates fit_stands_apc <- apc(fit_stands) plot(fit_stands_apc) print(fit_stands_apc)
2000 U.S. standard million population
standard
standard
An object of class data.frame
with 19 rows and 3 columns.
National Cancer Institute. Standard Populations - 19 Age Groups. Accessed at https://seer.cancer.gov/stdpopulations/stdpop.19ages.html on Oct. 8, 2021.
data(standard) head(standard)
data(standard) head(standard)
Convert surveil
model results to age standardized rates using a fixed age distribution
standardize(x, label, standard_pop)
standardize(x, label, standard_pop)
x |
A fitted |
label |
Labels (character strings) for the age groups that correspond to the values of |
standard_pop |
Standard population values corresponding to the age groups specified by |
A list, also of class "stand_surveil", containing the entire contents of the user-provided surveil
model plus the following:
summary data frame of standardized rates (means and 95 percent credible intervals)
a data frame of Markov chain Monte Carlo (MCMC) samples from the posterior probability distribution for the standardized rates
user-provided age-group labels
user-provided standardized population sizes (ordered as standard_label
)
vignette("age-standardization", package = "surveil")
stan_rw
plot.stand_surveil
print.stand_surveil
data(cancer) data(standard) head(standard) head(cancer) cancer2 <- subset(cancer, grepl("55-59|60-64|65-69", Age)) fit <- stan_rw(cancer2, time = Year, group = Age, chains = 2, iter = 1e3) # for speed only stands <- standardize(fit, label = standard$age, standard_pop = standard$standard_pop) print(stands) plot(stands, style = "lines")
data(cancer) data(standard) head(standard) head(cancer) cancer2 <- subset(cancer, grepl("55-59|60-64|65-69", Age)) fit <- stan_rw(cancer2, time = Year, group = Age, chains = 2, iter = 1e3) # for speed only stands <- standardize(fit, label = standard$age, standard_pop = standard$standard_pop) print(stands) plot(stands, style = "lines")
surveil_diff
objectsMethods for surveil_diff
objects
print surveil_diff objects for analyses of inequality
## S3 method for class 'surveil_diff' plot( x, style = c("mean_qi", "lines"), M = 250, col = "black", fill = "gray80", lwd, alpha, plot = TRUE, scale = 1e+05, PAR = TRUE, ncol = 3, base_size = 14, ... ) ## S3 method for class 'surveil_diff' print(x, scale = 1, ...)
## S3 method for class 'surveil_diff' plot( x, style = c("mean_qi", "lines"), M = 250, col = "black", fill = "gray80", lwd, alpha, plot = TRUE, scale = 1e+05, PAR = TRUE, ncol = 3, base_size = 14, ... ) ## S3 method for class 'surveil_diff' print(x, scale = 1, ...)
x |
Object of class |
style |
If |
M |
If |
col |
Line color |
fill |
Fill color for credible intervals, passed to |
lwd |
Linewidth |
alpha |
transparency; for |
plot |
If |
scale |
Print rates and rate differences as per |
PAR |
Return population attributable risk? IF |
ncol |
Number of columns for the plotting device. If |
base_size |
Passed to |
... |
additional print arguments |
By default or whenever plot = TRUE
, the plot method draws a series of plots to the current plotting device using grid.arrange
. If plot = FALSE
, then a list of ggplot
s is returned.
The print method returns nothing and prints a summary of results to the console.
Theil's entropy-based index of inequality
theil(x) theil2(Count, Population, rates, total = TRUE) ## S3 method for class 'surveil' theil(x) ## S3 method for class 'list' theil(x)
theil(x) theil2(Count, Population, rates, total = TRUE) ## S3 method for class 'surveil' theil(x) ## S3 method for class 'list' theil(x)
x |
A fitted |
Count |
Case counts, integers |
Population |
Population at risk, integers |
rates |
If |
total |
If |
Theil's index is a good index of inequality in disease and mortality burdens when multiple groups are being considered. It provides a summary measure of inequality across a set of demographic groups that may be tracked over time (and/or space). Also, it is interesting because it is additive, and thus admits of simple decompositions.
The index measures discrepancies between a population's share of the disease burden, omega
, and their share of the population, eta
. A situation of zero inequality would imply that each population's share of cases is equal to its population share, or, omega=eta
. Each population's contribution to total inequality is calculated as:
T_i = omega_i * [log(omega_i/eta_i)],
the log-ratio of case-share to population-share, weighted by their share of cases. Theil's index for all areas is the sum of each area's T_i:
T = sum_(i=1)^n T_i.
Theil's T is thus a weighted mean of log-ratios of case shares to population shares, where each log-ratio (which we may describe as a raw inequality score) is weighted by its share of total cases. The index has a minimum of zero and a maximum of log(N)
, where N
is the number of units (e.g., number of states).
Theil's index, which is based on Shannon's information theory, can be extended to measure inequality across multiple groups nested within non-overlapping geographies (e.g., states).
If total = TRUE
(the default), theil2
returns Theil's index as a numeric value. Else, theil2
returns a vector of values that sum to Theil's index.
A named list with the following elements:
A data.frame
summarizing the posterior probability distribution for Theil's T, including the mean and 95 percent credible interval for each time period
A data.frame
with MCMC samples for Theil's T
A list (also of class theil_list
) containing a summary data frame and a tbl_df
containing MCMC samples for Theil's index at each time period.
The summary data frame includes the following columns:
time period
Posterior mean for Theil's index; equal to the sum of Theil_between
and Theil_within
.
The between-areas component to Theil's inequality index
The within-areas component to Theil's inequality index
Additional columns contain the upper and lower limits of the 95 percent credible intervals for each component of Theil's index.
The data frame of samples contains the following columns:
Time period indicator
An id for each MCMC sample; note that samples are from the joint distribution
The between-geographies component of Theil's index
The within-geographies component of Theil's index
Theil's inequality index (T = Between + Within)
.
Conceicao, P. and P. Ferreira (2000). The young person's guide to the Theil Index: Suggesting intuitive interpretations and exploring analytical applications. University of Texas Inequality Project. UTIP Working Paper Number 14. Accessed May 1, 2021 from https://utip.gov.utexas.edu/papers.html
Conceicao, P, Galbraith, JK, Bradford, P. (2001). The Theil Index in sequences of nested and hierarchic grouping structures: implications for the measurement of inequality through time, with data aggregated at different levels of industrial classification. Eastern Economic Journal. 27(4): 491-514.
Theil, Henri (1972). Statistical Decomposition Analysis. Amsterdam, The Netherlands and London, UK: North-Holland Publishing Company.
Shannon, Claude E. and Weaver, Warren (1963). The Mathematical Theory of Communication. Urbana and Chicago, USA: University if Illinois Press.
plot.theil
print.theil
plot.theil_list
houston <- subset(msa, grepl("Houston", MSA)) fit <- stan_rw(houston, time = Year, group = Race, chains = 2, iter = 900) # for speed only theil_dfw <- theil(fit) plot(theil_dfw) Count <- c(10, 12, 3, 111) Pop <- c(1000, 1200, 4000, 9000) theil2(Count, Pop) theil2(Count, Pop, total = FALSE)
houston <- subset(msa, grepl("Houston", MSA)) fit <- stan_rw(houston, time = Year, group = Race, chains = 2, iter = 900) # for speed only theil_dfw <- theil(fit) plot(theil_dfw) Count <- c(10, 12, 3, 111) Pop <- c(1000, 1200, 4000, 9000) theil2(Count, Pop) theil2(Count, Pop, total = FALSE)
Widely Application Information Criteria (WAIC) for model comparison
waic(fit, pointwise = FALSE, digits = 2)
waic(fit, pointwise = FALSE, digits = 2)
fit |
An |
pointwise |
Logical (defaults to |
digits |
Round results to this many digits. |
A vector of length 3 with WAIC
, a rough measure of the effective number of parameters estimated by the model Eff_pars
, and log predictive density Lpd
. If pointwise = TRUE
, results are returned in a data.frame
.
Watanabe, S. (2010). Asymptotic equivalence of Bayes cross validation and widely application information criterion in singular learning theory. Journal of Machine Learning Research 11, 3571-3594.
data(msa) austin_w <- subset(msa, grepl("Austin", MSA) & Race == "White") fit <- stan_rw(austin_w, time = Year) waic(fit)
data(msa) austin_w <- subset(msa, grepl("Austin", MSA) & Race == "White") fit <- stan_rw(austin_w, time = Year) waic(fit)