Title: | Bayesian Spatial Analysis |
---|---|
Description: | For spatial data analysis; provides exploratory spatial analysis tools, spatial regression, spatial econometric, and disease mapping models, model diagnostics, and special methods for inference with small area survey data (e.g., the America Community Survey (ACS)) and censored population health monitoring data. Models are pre-specified using the Stan programming language, a platform for Bayesian inference using Markov chain Monte Carlo (MCMC). References: Carpenter et al. (2017) <doi:10.18637/jss.v076.i01>; Donegan (2021) <doi:10.31219/osf.io/3ey65>; Donegan (2022) <doi:10.21105/joss.04716>; Donegan, Chun and Hughes (2020) <doi:10.1016/j.spasta.2020.100450>; Donegan, Chun and Griffith (2021) <doi:10.3390/ijerph18136856>; Morris et al. (2019) <doi:10.1016/j.sste.2019.100301>. |
Authors: | Connor Donegan [aut, cre] |
Maintainer: | Connor Donegan <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.8.1 |
Built: | 2025-02-14 06:01:08 UTC |
Source: | https://github.com/connordonegan/geostan |
Bayesian spatial modeling powered by Stan. geostan provides access to a variety of hierarchical spatial models using the R formula interface, supporting a complete spatial analysis workflow with a suite of spatial analysis tools. It is designed primarily for public health and social science research but is generally applicable to modeling areal data. Unique features of the package include its spatial measurement error model (for inference with small area estimates such as those from the American Community Survey), its fast proper conditional autoregressive (CAR) and simultaneous autoregressive (SAR) models, and its eigenvector spatial filtering (ESF) models. The package also supports spatial regression with raster layers.
Maintainer: Connor Donegan [email protected] (ORCID)
Other contributors:
Mitzi Morris [contributor]
Amy Tims [contributor]
Carpenter, B., Gelman, A., Hoffman, M.D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., Riddell, A., 2017. Stan: A probabilistic programming language. Journal of statistical software 76. doi:10.18637/jss.v076.i01.
Donegan, C., Y. Chun and A. E. Hughes (2020). Bayesian estimation of spatial filters with Moran’s Eigenvectors and hierarchical shrinkage priors. Spatial Statistics. doi:10.1016/j.spasta.2020.100450 (open access: doi:10.31219/osf.io/fah3z).
Donegan, Connor and Chun, Yongwan and Griffith, Daniel A. (2021). Modeling community health with areal data: Bayesian inference with survey standard errors and spatial structure. Int. J. Env. Res. and Public Health 18 (13): 6856. doi:10.3390/ijerph18136856. Supplementary material: https://github.com/ConnorDonegan/survey-HBM.
Donegan, Connor (2021). Building spatial conditional autoregressive models in the Stan programming language. OSF Preprints. doi:10.31219/osf.io/3ey65.
Donegan, Connor (2022) geostan: An R package for Bayesian spatial analysis. The Journal of Open Source Software. 7, no. 79: 4716. doi:10.21105/joss.04716.
Gabry, J., Goodrich, B. and Lysy, M. (2020). rstantools: Tools for developers of R packages interfacing with Stan. R package version 2.1.1 https://mc-stan.org/rstantools/.
Morris, M., Wheeler-Martin, K., Simpson, D., Mooney, S. J., Gelman, A., & DiMaggio, C. (2019). Bayesian hierarchical spatial models: Implementing the Besag York Mollié model in stan. Spatial and spatio-temporal epidemiology, 31, 100301. doi:10.1016/j.sste.2019.100301.
Stan Development Team (2019). RStan: the R interface to Stan. R package version 2.19.2. https://mc-stan.org
Useful links:
Report bugs at https://github.com/ConnorDonegan/geostan/issues
Creates a list of connected nodes following the graph representation of a spatial connectivity matrix.
edges(C, unique_pairs_only = TRUE, shape)
edges(C, unique_pairs_only = TRUE, shape)
C |
A connectivity matrix where connection between two nodes is indicated by non-zero entries. |
unique_pairs_only |
By default, only unique pairs of nodes (i, j) will be included in the output. |
shape |
Optional spatial object (geometry) to which |
This is used internally for stan_icar
, can be helpful for creating the scaling factor for BYM2 models fit with stan_icar
, and can be used for visualizing a spatial connectivity matrix.
If shape
is missing, this returns a data.frame
with three columns. The first two columns (node1
and node2
) contain the indices of connected pairs of nodes; only unique pairs of nodes are included (unless unique_pairs_only = FALSE
). The third column (weight
) contains the corresponding matrix element, C[node1, node2]
.
If shape
is provided, the results are joined to an sf
object so the connections can be visualized.
shape2mat
, prep_icar_data
, stan_icar
data(sentencing) C <- shape2mat(sentencing) nbs <- edges(C) head(nbs) ## similar to: head(Matrix::summary(C)) head(Matrix::summary(shape2mat(georgia, "W"))) ## add geometry for plotting library(sf) E <- edges(C, shape = sentencing) g1 = st_geometry(E) g2 = st_geometry(sentencing) plot(g1, lwd = .2) plot(g2, add = TRUE)
data(sentencing) C <- shape2mat(sentencing) nbs <- edges(C) head(nbs) ## similar to: head(Matrix::summary(C)) head(Matrix::summary(shape2mat(georgia, "W"))) ## add geometry for plotting library(sf) E <- edges(C, shape = sentencing) g1 = st_geometry(E) g2 = st_geometry(sentencing) plot(g1, lwd = .2) plot(g2, add = TRUE)
Expected value for the Moran coefficient of model residuals under the null hypothesis of no spatial autocorrelation.
expected_mc(X, C)
expected_mc(X, C)
X |
model matrix, including column of ones. |
C |
Connectivity matrix. |
Returns a numeric value.
Chun, Yongwan and Griffith, Daniel A. (2013). Spatial statistics and geostatistics. Sage, p. 18.
data(georgia) C <- shape2mat(georgia) X <- model.matrix(~ college, georgia) expected_mc(X, C)
data(georgia) C <- shape2mat(georgia) X <- model.matrix(~ college, georgia) expected_mc(X, C)
Prepare data for spatial filtering
make_EV(C, nsa = FALSE, threshold = 0.2, values = FALSE)
make_EV(C, nsa = FALSE, threshold = 0.2, values = FALSE)
C |
A binary spatial weights matrix. See |
nsa |
Logical. Default of |
threshold |
Defaults to |
values |
Should eigenvalues be returned also? Defaults to |
Returns a set of eigenvectors related to the Moran coefficient (MC), limited to those eigenvectors with |MC| > threshold
if nsa = TRUE
or MC > threshold
if nsa = FALSE
, optionally with corresponding eigenvalues.
A data.frame
of eigenvectors for spatial filtering. If values=TRUE
then a named list is returned with elements eigenvectors
and eigenvalues
.
Daniel Griffith and Yongwan Chun. 2014. "Spatial Autocorrelation and Spatial Filtering." in M. M. Fischer and P. Nijkamp (eds.), Handbook of Regional Science. Springer.
library(ggplot2) data(georgia) C <- shape2mat(georgia, style = "B") EV <- make_EV(C) head(EV) ggplot(georgia) + geom_sf(aes(fill = EV[,1])) + scale_fill_gradient2()
library(ggplot2) data(georgia) C <- shape2mat(georgia, style = "B") EV <- make_EV(C) head(EV) ggplot(georgia) + geom_sf(aes(fill = EV[,1])) + scale_fill_gradient2()
The Moran coefficient, a measure of spatial autocorrelation (also known as Global Moran's I)
mc(x, w, digits = 3, warn = TRUE, na.rm = FALSE)
mc(x, w, digits = 3, warn = TRUE, na.rm = FALSE)
x |
Numeric vector of input values, length n. |
w |
An n x n spatial connectivity matrix. See |
digits |
Number of digits to round results to. |
warn |
If |
na.rm |
If |
The formula for the Moran coefficient (MC) is
where is the number of observations and
is the sum of all values in the spatial connectivity matrix
, i.e., the sum of all row-sums:
.
If any observations with no neighbors are found (i.e. any(Matrix::rowSums(w) == 0)
) they will be dropped automatically and a message will print stating how many were dropped. (The alternative would be for those observations to have a spatial lage of zero, but zero is not a neutral value.)
The Moran coefficient, a numeric value.
Chun, Yongwan, and Daniel A. Griffith. Spatial Statistics and Geostatistics: Theory and Applications for Geographic Information Science and Technology. Sage, 2013.
Cliff, Andrew David, and J. Keith Ord. Spatial processes: models & applications. Taylor & Francis, 1981.
moran_plot, lisa, aple, gr, lg
library(sf) data(georgia) w <- shape2mat(georgia, style = "W") x <- georgia$ICE mc(x, w)
library(sf) data(georgia) w <- shape2mat(georgia, style = "W") x <- georgia$ICE mc(x, w)
Plots a set of values against their spatially lagged values and gives the Moran coefficient as a measure of spatial autocorrelation.
moran_plot( x, w, xlab = "x (centered)", ylab = "Spatial Lag", pch = 20, col = "darkred", size = 2, alpha = 1, lwd = 0.5, na.rm = FALSE )
moran_plot( x, w, xlab = "x (centered)", ylab = "Spatial Lag", pch = 20, col = "darkred", size = 2, alpha = 1, lwd = 0.5, na.rm = FALSE )
x |
A numeric vector of length n. |
w |
An n x n spatial connectivity matrix. |
xlab |
Label for the x-axis. |
ylab |
Label for the y-axis. |
pch |
Symbol type. |
col |
Symbol color. |
size |
Symbol size. |
alpha |
Symbol transparency. |
lwd |
Width of the regression line. |
na.rm |
If |
For details on the symbol parameters see the documentation for geom_point.
If any observations with no neighbors are found (i.e. any(Matrix::rowSums(w) == 0)
) they will be dropped automatically and a message will print stating how many were dropped.
Returns a gg
plot, a scatter plot with x
on the horizontal and its spatially lagged values on the vertical axis (i.e. a Moran scatter plot).
Anselin, Luc. "Local indicators of spatial association—LISA." Geographical analysis 27, no. 2 (1995): 93-115.
data(georgia) x <- georgia$income w <- shape2mat(georgia, "W") moran_plot(x, w)
data(georgia) x <- georgia$income w <- shape2mat(georgia, "W") moran_plot(x, w)
Draw samples from the posterior predictive distribution of a fitted geostan
model.
posterior_predict( object, S, summary = FALSE, width = 0.95, approx = TRUE, K = 20, preserve_order = FALSE, seed )
posterior_predict( object, S, summary = FALSE, width = 0.95, approx = TRUE, K = 20, preserve_order = FALSE, seed )
object |
A |
S |
Optional; number of samples to take from the posterior distribution. The default, and maximum, is the total number of samples stored in the model. |
summary |
Should the predictive distribution be summarized by its means and central quantile intervals? If |
width |
Only used if |
approx |
For SAR models only; |
K |
For SAR models only; number of matrix powers to for the matrix inverse approximation (used when |
preserve_order |
If |
seed |
A single integer value to be used in a call to |
This method returns samples from the posterior predictive distribution of the model (at the observed values of covariates, etc.). The predictions incorporate uncertainty of all parameter values (used to calculate the expected value of the model, for example) plus the error term (the model's description of the amount of variability of observations around the expected value). If the model includes measurement error in the covariates, this source of uncertainty (about ) is passed into the posterior predictive distribution as well.
For SAR models (and all other models), the observed outcomes are not used to formulate the posterior predictive distribution. The posterior predictive distribution for the SLM (see stan_sar) is given by
The SDLM is the same but includes spatially-lagged covariates in . The
approx = FALSE
method for SAR models requires a call to Matrix::solve(I - rho * W)
for each MCMC sample; the approx = TRUE
method uses an approximation based on matrix powers (LeSage and Pace 2009). The approximation will deteriorate if is not near zero, so use with care.
A matrix of size S x N containing samples from the posterior predictive distribution, where S is the number of samples drawn and N is the number of observations. If summary = TRUE
, a data.frame
with N rows and 3 columns is returned (with column names mu
, lwr
, and upr
).
LeSage, James, & Robert kelley Pace (2009). Introduction to Spatial Econometrics. Chapman and Hall/CRC.
Gelman, A., J. B.Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, & D. B. Rubin, D. B. (2014). Bayesian data analysis (3rd ed.). CRC Press.
McElreath, Richard (2016). Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press, Ch. 3.
E <- sentencing$expected_sents sentencing$log_E <- log(E) fit <- stan_glm(sents ~ offset(log_E), re = ~ name, data = sentencing, family = poisson(), chains = 2, iter = 600) # for speed only yrep <- posterior_predict(fit, S = 65) plot(density(yrep[1,] / E )) for (i in 2:nrow(yrep)) lines(density(yrep[i,] / E), col = 'gray30') lines(density(sentencing$sents / E), col = 'darkred', lwd = 2) sars <- prep_sar_data2(row = 9, col = 9) W <- sars$W y <- sim_sar(rho = .9, w = W) fit <- stan_sar(y ~ 1, data = data.frame(y=y), sar = sars, iter = 650, quiet = TRUE) yrep <- posterior_predict(fit, S = 15)
E <- sentencing$expected_sents sentencing$log_E <- log(E) fit <- stan_glm(sents ~ offset(log_E), re = ~ name, data = sentencing, family = poisson(), chains = 2, iter = 600) # for speed only yrep <- posterior_predict(fit, S = 65) plot(density(yrep[1,] / E )) for (i in 2:nrow(yrep)) lines(density(yrep[i,] / E), col = 'gray30') lines(density(sentencing$sents / E), col = 'darkred', lwd = 2) sars <- prep_sar_data2(row = 9, col = 9) W <- sars$W y <- sim_sar(rho = .9, w = W) fit <- stan_sar(y ~ 1, data = data.frame(y=y), sar = sars, iter = 650, quiet = TRUE) yrep <- posterior_predict(fit, S = 15)
Simple features (sf
) with historic (1910) county boundaries of Florida with aggregated state prison sentencing counts and census data.
Sentencing and population counts are aggregates over the period 1905-1910, where populations were interpolated linearly between decennial censuses of 1900 and 1910.
sentencing
sentencing
Simple features (sf
)/data.frame
with the following attributes:
County name
White population total for years 1905-1910
Black population total for years 1905-1910
Number of state prison sentences, 1905-1910
Binary indicator for inclusion in the plantation belt
Percent of land area in agriculture, 1910
Expected sentences given demographic information and state level sentencing rates by race
Standardized incident ratio (observed/expected sentences)
Donegan, Connor. "The Making of Florida's 'Criminal Class': Race, Modernity and the Convict Leasing Program." Florida Historical Quarterly 97.4 (2019): 408-434. https://osf.io/2wj7s/.
Mullen, Lincoln A. and Bratt, Jordon. "USABoundaries: Historical and Contemporary Boundaries of the United States of America," Journal of Open Source Software 3, no. 23 (2018): 314, doi:10.21105/joss.00314.
data(sentencing) print(sentencing)
data(sentencing) print(sentencing)
Creates sparse matrix representations of spatial connectivity structures
shape2mat( shape, style = c("B", "W"), queen, method = c("queen", "rook", "knn"), k = 1, longlat = NULL, snap = sqrt(.Machine$double.eps), t = 1, st.style = c("contemp", "lag"), quiet = FALSE )
shape2mat( shape, style = c("B", "W"), queen, method = c("queen", "rook", "knn"), k = 1, longlat = NULL, snap = sqrt(.Machine$double.eps), t = 1, st.style = c("contemp", "lag"), quiet = FALSE )
shape |
An object of class |
style |
What kind of coding scheme should be used to create the spatial connectivity matrix? Defaults to "B" for binary; use "W" for row-standardized weights. |
queen |
Deprecated: use the ‘method’ argument instead. This option is passed to |
method |
Method for determining neighbors: queen, rook, or k-nearest neighbors. See Details for more information. |
k |
Number of neighbors to select for k-nearest neighbor method. Passed to |
longlat |
If longlat = TRUE, Great Circle (rather than Euclidean) distances are used; great circle circle distances account for curvature of the Earth. |
snap |
Passed to |
t |
Number of time periods. Only the binary coding scheme is available for space-time connectivity matrices. |
st.style |
For space-time data, what type of space-time connectivity structure should be used? Options are "lag" for the lagged specification and "contemp" (the default) for contemporaneous specification (see Details). |
quiet |
If |
The method argument currently has three options. The queen contiguity condition defines neighbors as polygons that share at least one point with one another. The rook condition requires that they share a line or border with one another. K-nearest neighbors is based on distance between centroids. All methods are implemented using the spdep package and then converted to sparse matrix format.
Alternatively, one can use spdep directly to create a listw
object and then convert that to a sparse matrix using as(listw, 'CsparseMatrix')
for use with geostan.
Haining and Li (Ch. 4) provide a helpful discussion of spatial connectivity matrices (Ch. 4).
The space-time connectivity matrix can be used for eigenvector space-time filtering (stan_esf
. The 'lagged' space-time structure connects each observation to its own past (one period lagged) value and the past value of its neighbors. The 'contemporaneous' specification links each observation to its neighbors and to its own in situ past (one period lagged) value (Griffith 2012, p. 23).
A spatial connectivity matrix in sparse matrix format. Binary matrices are of class ngCMatrix
, row-standardized are of class dgCMatrix
, created by sparseMatrix
.
Bivand, Roger S. and Pebesma, Edzer and Gomez-Rubio, Virgilio (2013). Applied spatial data analysis with R, Second edition. Springer, NY. https://asdar-book.org/
Griffith, Daniel A. (2012). Space, time, and space-time eigenvector filter specifications that account for autocorrelation. Estadística Espanola, 54(177), 7-34.
Haining, Robert P. and Li, Guangquan (2020). Modelling Spatial and Spatial-Temporal Data: A Bayesian Approach. CRC Press.
data(georgia) ## binary adjacency matrix C <- shape2mat(georgia, "B", method = 'rook') ## number of neighbors per observation summary( n_nbs(C) ) head(Matrix::summary(C)) ## row-standardized matrix W <- shape2mat(georgia, "W", method = 'rook') ## summary of weights E <- edges(W, unique_pairs_only = FALSE) summary(E$weight) ## space-time matricies ## for eigenvector space-time filtering ## if you have multiple years with same geometry/geography, ## provide the geometry (for a single year!) and number of years \code{t} Cst <- shape2mat(georgia, t = 5) dim(Cst) EVst <- make_EV(Cst) dim(EVst)
data(georgia) ## binary adjacency matrix C <- shape2mat(georgia, "B", method = 'rook') ## number of neighbors per observation summary( n_nbs(C) ) head(Matrix::summary(C)) ## row-standardized matrix W <- shape2mat(georgia, "W", method = 'rook') ## summary of weights E <- edges(W, unique_pairs_only = FALSE) summary(E$weight) ## space-time matricies ## for eigenvector space-time filtering ## if you have multiple years with same geometry/geography, ## provide the geometry (for a single year!) and number of years \code{t} Cst <- shape2mat(georgia, t = 5) dim(Cst) EVst <- make_EV(Cst) dim(EVst)
Fit a spatial regression model using eigenvector spatial filtering (ESF).
stan_esf( formula, slx, re, data, C, EV = make_EV(C, nsa = nsa, threshold = threshold), nsa = FALSE, threshold = 0.25, family = gaussian(), prior = NULL, ME = NULL, centerx = FALSE, censor_point, prior_only = FALSE, chains = 4, iter = 2000, refresh = 500, keep_all = FALSE, slim = FALSE, drop = NULL, pars = NULL, control = NULL, quiet = FALSE, ... )
stan_esf( formula, slx, re, data, C, EV = make_EV(C, nsa = nsa, threshold = threshold), nsa = FALSE, threshold = 0.25, family = gaussian(), prior = NULL, ME = NULL, centerx = FALSE, censor_point, prior_only = FALSE, chains = 4, iter = 2000, refresh = 500, keep_all = FALSE, slim = FALSE, drop = NULL, pars = NULL, control = NULL, quiet = FALSE, ... )
formula |
A model formula, following the R |
slx |
Formula to specify any spatially-lagged covariates. As in, |
re |
To include a varying intercept (or "random effects") term, alpha_re ~ N(0, alpha_tau) alpha_tau ~ Student_t(d.f., location, scale). |
data |
A |
C |
Spatial connectivity matrix. This will be used to calculate eigenvectors if |
EV |
A matrix of eigenvectors from any (transformed) connectivity matrix, presumably spatial or network-based (see |
nsa |
Include eigenvectors representing negative spatial autocorrelation? Defaults to |
threshold |
Eigenvectors with standardized Moran coefficient values below this |
family |
The likelihood function for the outcome variable. Current options are |
prior |
A named list of parameters for prior distributions (see
.
|
ME |
To model observational uncertainty (i.e. measurement or sampling error) in any or all of the covariates, provide a list of data as constructed by the |
centerx |
To center predictors on their mean values, use |
censor_point |
Integer value indicating the maximum censored value; this argument is for modeling censored (suppressed) outcome data, typically disease case counts or deaths. For example, the US Centers for Disease Control and Prevention censors (does not report) death counts that are nine or fewer, so if you're using CDC WONDER mortality data you could provide |
prior_only |
Draw samples from the prior distributions of parameters only. |
chains |
Number of MCMC chains to estimate. Default |
iter |
Number of samples per chain. Default |
refresh |
Stan will print the progress of the sampler every |
keep_all |
If |
slim |
If |
drop |
Provide a vector of character strings to specify the names of any parameters that you do not want MCMC samples for. Dropping parameters in this way can improve sampling speed and reduce memory usage. The following parameter vectors can potentially be dropped from ESF models:
If |
pars |
Optional; specify any additional parameters you'd like stored from the Stan model. |
control |
A named list of parameters to control the sampler's behavior. See stan for details. |
quiet |
By default, any prior distributions that have not been assigned by the user are printed to the console. If |
... |
Other arguments passed to sampling. |
Eigenvector spatial filtering (ESF) is a method for spatial regression analysis. ESF is extensively covered in Griffith et al. (2019). This function implements the methodology introduced in Donegan et al. (2020), which uses Piironen and Vehtari's (2017) regularized horseshoe prior.
By adding a spatial filter to a regression model, spatial autocorrelation patterns are shifted from the residuals to the spatial filter. ESF models take the spectral decomposition of a transformed spatial connectivity matrix, . The resulting eigenvectors,
, are mutually orthogonal and uncorrelated map patterns (at various scales, 'local' to 'regional' to 'global'). The spatial filter equals
where
is a vector of coefficients.
ESF decomposes the data into a global mean, , global patterns contributed by covariates
, spatial trends
, and residual variation. Thus, for
family=gaussian()
,
An ESF component can be incorporated into the linear predictor of any generalized linear model. For example, using stan_esf
with family = poisson()
and adding a 'random effects' term for each spatial unit (via the re
argument) will produce a model that resembles the BYM model (combining spatially structured and spatially-unstructured components).
The spatial.geostan_fit method will return .
The model can also be extended to the space-time domain; see shape2mat to specify a space-time connectivity matrix.
The coefficients are assigned the regularized horseshoe prior (Piironen and Vehtari, 2017), resulting in a relatively sparse model specification. In addition, numerous eigenvectors are automatically dropped because they represent trace amounts of spatial autocorrelation (this is controlled by the
threshold
argument). By default, stan_esf
will drop all eigenvectors representing negative spatial autocorrelation patterns. You can change this behavior using the nsa
argument.
The ESF models can also incorporate spatially-lagged covariates, measurement/sampling error in covariates (particularly when using small area survey estimates as covariates), missing outcome data, and censored outcomes (such as arise when a disease surveillance system suppresses data for privacy reasons). For details on these options, please see the Details section in the documentation for stan_glm.
An object of class class geostan_fit
(a list) containing:
Summaries of the main parameters of interest; a data frame
Residual spatial autocorrelation as measured by the Moran coefficient.
a data frame containing the model data
A matrix of eigenvectors created with w
and geostan::make_EV
The spatial weights matrix used to construct EV
the user-provided or default family
argument used to fit the model
The model formula provided by the user (not including ESF component)
The slx
formula
A list containing re
, the random effects (varying intercepts) formula if provided, and
data
a data frame with columns id
, the grouping variable, and idx
, the index values assigned to each group.
Prior specifications.
If covariates are centered internally (centerx = TRUE
), then x_center
is a numeric vector of the values on which covariates were centered.
The ME
data list, if one was provided by the user for measurement error models.
A data frame with the name of the spatial component parameter ("esf") and method ("ESF")
an object of class stanfit
returned by rstan::stan
Connor Donegan, [email protected]
Chun, Y., D. A. Griffith, M. Lee and P. Sinha (2016). Eigenvector selection with stepwise regression techniques to construct eigenvector spatial filters. Journal of Geographical Systems, 18(1), 67-85. doi:10.1007/s10109-015-0225-3.
Dray, S., P. Legendre & P. R. Peres-Neto (2006). Spatial modelling: a comprehensive framework for principal coordinate analysis of neighbour matrices (PCNM). Ecological Modeling, 196(3-4), 483-493.
Donegan, C., Y. Chun and A. E. Hughes (2020). Bayesian estimation of spatial filters with Moran’s Eigenvectors and hierarchical shrinkage priors. Spatial Statistics. doi:10.1016/j.spasta.2020.100450 (open access: doi:10.31219/osf.io/fah3z).
Griffith, Daniel A., and P. R. Peres-Neto (2006). Spatial modeling in ecology: the flexibility of eigenfunction spatial analyses. Ecology 87(10), 2603-2613.
Griffith, D., and Y. Chun (2014). Spatial autocorrelation and spatial filtering, Handbook of Regional Science. Fischer, MM and Nijkamp, P. eds.
Griffith, D., Chun, Y. and Li, B. (2019). Spatial Regression Analysis Using Eigenvector Spatial Filtering. Elsevier.
Piironen, J and A. Vehtari (2017). Sparsity information and regularization in the horseshoe and other shrinkage priors. In Electronic Journal of Statistics, 11(2):5018-5051.
data(sentencing) # spatial weights matrix with binary coding scheme C <- shape2mat(sentencing, style = "B", quiet = TRUE) # expected number of sentences log_e <- log(sentencing$expected_sents) # fit spatial Poisson model with ESF + unstructured 'random effects' fit.esf <- stan_esf(sents ~ offset(log_e), re = ~ name, family = poisson(), data = sentencing, C = C, chains = 2, iter = 800) # for speed only # spatial diagnostics sp_diag(fit.esf, sentencing) # plot marginal posterior distributions of beta_ev (eigenvector coefficients) plot(fit.esf, pars = "beta_ev") # calculate log-standardized incidence ratios (SIR) # # SIR = observed/exected number of cases # in this case, prison sentences library(ggplot2) library(sf) f <- fitted(fit.esf, rates = FALSE)$mean SSR <- f / sentencing$expected_sents log.SSR <- log( SSR, base = 2 ) # map the log-SSRs ggplot(sentencing) + geom_sf(aes(fill = log.SSR)) + scale_fill_gradient2( midpoint = 0, name = NULL, breaks = seq(-3, 3, by = 0.5) ) + labs(title = "Log-Standardized Sentencing Ratios", subtitle = "log( Fitted/Expected ), base 2" ) + theme_void()
data(sentencing) # spatial weights matrix with binary coding scheme C <- shape2mat(sentencing, style = "B", quiet = TRUE) # expected number of sentences log_e <- log(sentencing$expected_sents) # fit spatial Poisson model with ESF + unstructured 'random effects' fit.esf <- stan_esf(sents ~ offset(log_e), re = ~ name, family = poisson(), data = sentencing, C = C, chains = 2, iter = 800) # for speed only # spatial diagnostics sp_diag(fit.esf, sentencing) # plot marginal posterior distributions of beta_ev (eigenvector coefficients) plot(fit.esf, pars = "beta_ev") # calculate log-standardized incidence ratios (SIR) # # SIR = observed/exected number of cases # in this case, prison sentences library(ggplot2) library(sf) f <- fitted(fit.esf, rates = FALSE)$mean SSR <- f / sentencing$expected_sents log.SSR <- log( SSR, base = 2 ) # map the log-SSRs ggplot(sentencing) + geom_sf(aes(fill = log.SSR)) + scale_fill_gradient2( midpoint = 0, name = NULL, breaks = seq(-3, 3, by = 0.5) ) + labs(title = "Log-Standardized Sentencing Ratios", subtitle = "log( Fitted/Expected ), base 2" ) + theme_void()
The intrinsic conditional auto-regressive (ICAR) model for spatial count data. Options include the BYM model, the BYM2 model, and a solo ICAR term.
stan_icar( formula, slx, re, data, C, family = poisson(), type = c("icar", "bym", "bym2"), scale_factor = NULL, prior = NULL, ME = NULL, centerx = FALSE, censor_point, prior_only = FALSE, chains = 4, iter = 2000, refresh = 500, keep_all = FALSE, slim = FALSE, drop = NULL, pars = NULL, control = NULL, quiet = FALSE, ... )
stan_icar( formula, slx, re, data, C, family = poisson(), type = c("icar", "bym", "bym2"), scale_factor = NULL, prior = NULL, ME = NULL, centerx = FALSE, censor_point, prior_only = FALSE, chains = 4, iter = 2000, refresh = 500, keep_all = FALSE, slim = FALSE, drop = NULL, pars = NULL, control = NULL, quiet = FALSE, ... )
formula |
A model formula, following the R formula syntax. Binomial models can be specified by setting the left hand side of the equation to a data frame of successes and failures, as in |
slx |
Formula to specify any spatially-lagged covariates. As in, |
re |
To include a varying intercept (or "random effects") term, alpha_re ~ N(0, alpha_tau) alpha_tau ~ Student_t(d.f., location, scale). Before using this term, read the |
data |
A |
C |
Spatial connectivity matrix which will be used to construct an edge list for the ICAR model, and to calculate residual spatial autocorrelation as well as any user specified |
family |
The likelihood function for the outcome variable. Current options are |
type |
Defaults to "icar" (partial pooling of neighboring observations through parameter |
scale_factor |
For the BYM2 model, optional. If missing, this will be set to a vector of ones. See |
prior |
A named list of parameters for prior distributions (see
.
|
ME |
To model observational uncertainty (i.e. measurement or sampling error) in any or all of the covariates, provide a list of data as constructed by the |
centerx |
To center predictors on their mean values, use |
censor_point |
Integer value indicating the maximum censored value; this argument is for modeling censored (suppressed) outcome data, typically disease case counts or deaths. For example, the US Centers for Disease Control and Prevention censors (does not report) death counts that are nine or fewer, so if you're using CDC WONDER mortality data you could provide |
prior_only |
Draw samples from the prior distributions of parameters only. |
chains |
Number of MCMC chains to estimate. |
iter |
Number of samples per chain. . |
refresh |
Stan will print the progress of the sampler every |
keep_all |
If |
slim |
If |
drop |
Provide a vector of character strings to specify the names of any parameters that you do not want MCMC samples for. Dropping parameters in this way can improve sampling speed and reduce memory usage. The following parameter vectors can potentially be dropped from ICAR models:
If |
pars |
Optional; specify any additional parameters you'd like stored from the Stan model. |
control |
A named list of parameters to control the sampler's behavior. See |
quiet |
Controls (most) automatic printing to the console. By default, any prior distributions that have not been assigned by the user are printed to the console. If |
... |
Other arguments passed to sampling. |
The intrinsic conditional autoregressive (ICAR) model for spatial data was introduced by Besag et al. (1991). The Stan code for the ICAR component of the model and the BYM2 option is from Morris et al. (2019) with adjustments to enable non-binary weights and disconnected graph structures (see Freni-Sterrantino (2018) and Donegan (2021)).
The exact specification depends on the type
argument.
For Poisson models for count data, y, the basic model specification (type = "icar"
) is:
where contains an intercept and potentially covariates. The spatial trend
has a mean of zero and a single scale parameter
(which user's will see printed as the parameter named
spatial_scale
).
The ICAR prior model is a CAR model that has a spatial autocorrelation parameter equal to 1 (see stan_car). Thus the ICAR prior places high probability on a very smooth spatially (or temporally) varying mean. This is rarely sufficient to model the amount of variation present in social and health data. For this reason, the BYM model is typically employed.
Often, an observational-level random effect term, theta
, is added to capture (heterogeneous or unstructured) deviations from . The combined term is referred to as a convolution term:
This is known as the BYM model (Besag et al. 1991), and can be specified using type = "bym"
:
The model is named after Besag, York, and Mollié (1991).
Riebler et al. (2016) introduce a variation on the BYM model (type = "bym2"
). This specification combines and
using a mixing parameter
that controls the proportion of the variation that is attributable to the spatially autocorrelated term
rather than the spatially unstructured term
. The terms share a single scale parameter
:
The terms ,
are standard normal deviates,
is restricted to values between zero and one, and
is the 'scale_factor' (a constant term provided by the user). By default, the 'scale_factor' is equal to one, so that it does nothing. Riebler et al. (2016) argue that the interpretation or meaning of the scale of the ICAR model depends on the graph structure of the connectivity matrix
. This implies that the same prior distribution assigned to
will differ in its implications if
is changed; in other words, the priors are not transportable across models, and models that use the same nominal prior actually have different priors assigned to
.
Borrowing R
code from Morris (2017) and following Freni-Sterrantino et al. (2018), the following R
code can be used to create the 'scale_factor' for the BYM2 model (note, this requires the INLA R package), given a spatial adjacency matrix,
:
## create a list of data for stan_icar icar.data <- geostan::prep_icar_data(C) ## calculate scale_factor for each of k connected group of nodes k <- icar.data$k scale_factor <- vector(mode = "numeric", length = k) for (j in 1:k) { g.idx <- which(icar.data$comp_id == j) if (length(g.idx) == 1) { scale_factor[j] <- 1 next } Cg <- C[g.idx, g.idx] scale_factor[j] <- scale_c(Cg) }
This code adjusts for 'islands' or areas with zero neighbors, and it also handles disconnected graph structures (see Donegan and Morris 2021). Following Freni-Sterrantino (2018), disconnected components of the graph structure are given their own intercept term; however, this value is added to automatically inside the Stan model. Therefore, the user never needs to make any adjustments for this term. (To avoid complications from using a disconnected graph structure, you can apply a proper CAR model instead of the ICAR:
stan_car
).
Note, the code above requires the scale_c
function; it has package dependencies that are not included in geostan
. To use scale_c
, you have to load the following R
function:
#' compute scaling factor for adjacency matrix, accounting for differences in spatial connectivity #' #' @param C connectivity matrix #' #' @details #' #' Requires the following packages: #' #' library(Matrix) #' library(INLA); #' library(spdep) #' library(igraph) #' #' @source Morris (2017) #' scale_c <- function(C) { geometric_mean <- function(x) exp(mean(log(x))) N = dim(C)[1] Q = Diagonal(N, rowSums(C)) - C Q_pert = Q + Diagonal(N) * max(diag(Q)) * sqrt(.Machine$double.eps) Q_inv = inla.qinv(Q_pert, constr=list(A = matrix(1,1,N),e=0)) scaling_factor <- geometric_mean(Matrix::diag(Q_inv)) return(scaling_factor) }
The ICAR models can also incorporate spatially-lagged covariates, measurement/sampling error in covariates (particularly when using small area survey estimates as covariates), missing outcome data, and censored outcomes (such as arise when a disease surveillance system suppresses data for privacy reasons). For details on these options, please see the Details section in the documentation for stan_glm.
An object of class class geostan_fit
(a list) containing:
Summaries of the main parameters of interest; a data frame
Residual spatial autocorrelation as measured by the Moran coefficient.
an object of class stanfit
returned by rstan::stan
a data frame containing the model data
The edge list representing all unique sets of neighbors and the weight attached to each pair (i.e., their corresponding element in the connectivity matrix C
Spatial connectivity matrix
the user-provided or default family
argument used to fit the model
The model formula provided by the user (not including ICAR component)
The slx
formula
A list with two name elements, formula
and Data
, containing the formula re
and a data frame with columns id
(the grouping variable) and idx
(the index values assigned to each group).
Prior specifications.
If covariates are centered internally (centerx = TRUE
), then x_center
is a numeric vector of the values on which covariates were centered.
A data frame with the name of the spatial parameter ("phi"
if type = "icar"
else "convolution"
) and method (toupper(type)
).
Connor Donegan, [email protected]
Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society: Series B (Methodological), 36(2), 192-225.
Besag, J., York, J., and Mollié, A. (1991). Bayesian image restoration, with two applications in spatial statistics. Annals of the Institute of Statistical Mathematics, 43(1), 1-20.
Donegan, Connor and Morris, Mitzi (2021). Flexible functions for ICAR, BYM, and BYM2 models in Stan. Code repository. https://github.com/ConnorDonegan/Stan-IAR
Donegan, Connor (2021b). Building spatial conditional autoregressive (CAR) models in the Stan programming language. OSF Preprints. doi:10.31219/osf.io/3ey65.
Freni-Sterrantino, Anna, Massimo Ventrucci, and Håvard Rue (2018). A Note on Intrinsic Conditional Autoregressive Models for Disconnected Graphs. Spatial and Spatio-Temporal Epidemiology, 26: 25–34.
Morris, Mitzi (2017). Spatial Models in Stan: Intrinsic Auto-Regressive Models for Areal Data. https://mc-stan.org/users/documentation/case-studies/icar_stan.html
Morris, M., Wheeler-Martin, K., Simpson, D., Mooney, S. J., Gelman, A., & DiMaggio, C. (2019). Bayesian hierarchical spatial models: Implementing the Besag York Mollié model in stan. Spatial and spatio-temporal epidemiology, 31, 100301.
Riebler, A., Sorbye, S. H., Simpson, D., & Rue, H. (2016). An intuitive Bayesian spatial model for disease mapping that accounts for scaling. Statistical Methods in Medical Research, 25(4), 1145-1165.
shape2mat, stan_car, stan_esf, stan_glm, prep_icar_data
data(sentencing) C <- shape2mat(sentencing, "B") log_e <- log(sentencing$expected_sents) fit.bym <- stan_icar(sents ~ offset(log_e), family = poisson(), data = sentencing, type = "bym", C = C, chains = 2, iter = 800) # for speed only # spatial diagnostics sp_diag(fit.bym, sentencing) # check effective sample size and convergence library(rstan) rstan::stan_ess(fit.bym$stanfit) rstan::stan_rhat(fit.bym$stanfit)
data(sentencing) C <- shape2mat(sentencing, "B") log_e <- log(sentencing$expected_sents) fit.bym <- stan_icar(sents ~ offset(log_e), family = poisson(), data = sentencing, type = "bym", C = C, chains = 2, iter = 800) # for speed only # spatial diagnostics sp_diag(fit.bym, sentencing) # check effective sample size and convergence library(rstan) rstan::stan_ess(fit.bym$stanfit) rstan::stan_rhat(fit.bym$stanfit)
Deviance Information Criteria (DIC) and Widely Application Information Criteria (WAIC) for model comparison.
waic(object, pointwise = FALSE, digits = 2) dic(object, digits = 1)
waic(object, pointwise = FALSE, digits = 2) dic(object, digits = 1)
object |
A fitted |
pointwise |
Logical (defaults to |
digits |
Round results to this many digits. |
WAIC (widely applicable information criteria) and DIC (deviance information criteria) are used for model comparison. They are based on theories of out-of-sample predictive accuracy. The DIC is implemented with penalty term defined as 1/2 times the posterior variance of the deviance (Spiegelhatler et al. 2014).
The limitations of these methods include that DIC is less robust than WAIC and that WAIC is not strictly valid for autocorrelated data (viz. geostan's spatial models).
For both DIC and WAIC, lower values indicate better models.
WAIC returns a vector of length 3 with the WAIC
value, a penalty term which measures the effective number of parameters estimated by the model Eff_pars
, and log predictive density Lpd
. If pointwise = TRUE
, results are returned in a data.frame
.
DIC returns a vector of length 2: the DIC value and the penalty term (which is part of the DIC calculation).
D. Spiegelhatler, N. G. Best, B. P. Carlin and G. Linde (2014) The Deviance Information Criterion: 12 Years on. J. Royal Statistical Society Series B: Stat Methodology. 76(3): 485-493.
Watanabe, S. (2010). Asymptotic equivalence of Bayes cross validation and widely application information criterion in singular learning theory. Journal of Machine Learning Research 11, 3571-3594.
data(georgia) fit <- stan_glm(log(rate.male) ~ 1, data = georgia, iter=600, chains = 2, quiet = TRUE) fit2 <- stan_glm(log(rate.male) ~ log(income), data = georgia, centerx = TRUE, iter=600, chains = 2, quiet = TRUE) dic(fit) dic(fit2) waic(fit) waic(fit2)
data(georgia) fit <- stan_glm(log(rate.male) ~ 1, data = georgia, iter=600, chains = 2, quiet = TRUE) fit2 <- stan_glm(log(rate.male) ~ log(income), data = georgia, centerx = TRUE, iter=600, chains = 2, quiet = TRUE) dic(fit) dic(fit2) waic(fit) waic(fit2)