This vignette introduces users to the spatial measurement error (ME) models implemented in the geostan package (Donegan, Chun, and Griffith 2021). Variations on this methodology have been examined previously by multiple authors (Bernadinelli et al. 1997; Kang, Liu, and Cressie 2009; Logan et al. 2020; Xia and Carlin 1998).
These models are designed for spatial models that use survey estimates as covariates. They are designed for use with American Community Survey (ACS) data and other large, government-backed surveys, although other data types may be appropriate as well. A premise of this methodology is that the survey includes a systematic spatial sampling design (i.e., the sampling procedure was stratified by the areal unit of interest, whether they be block groups, counties, or states).
In a (spatial) regression analysis, measurement/sampling error in covariates can lead to model parameter estimates that are overly-confident and prone to bias. This can lead to under- or over-estimation of population risks and needs, mainly because the noisy survey-based covariate passes directly into predictive values. This may impact real communities and service providers (see Bazuin and Fraser 2013). Examining the standard errors or margins of error of small-area survey estimates is an important first step that can help one determine whether a measurement error model should be considered.
These models account for sampling error which does not include any potential sources of systematic bias or recording errors that may be present in survey estimates.
The measurement error (ME) models implemented in geostan are hierarchical Bayesian models (HBMs) that incorporate two sources of information:
The model treats the true value x as an unknown parameter or a latent variable. The goal is to obtain a probability distribution for x. That information can then be incorporated into any of geostan’s regression or disease mapping models.
The sampling distribution for the model states that the true value xi is probably within two standard errors of the estimate, or zi = xi + ei, ei ∼ Normal(0, si2), where zi is the estimate, si is its standard error, and xi is the true value.
Our background knowledge on x is the second component of the model. This is simply the knowledge that social variables tend to be spatially patterned: relatively extreme values are not implausible but they tend to be clustered together. This information is encoded into a prior probability distribution for the unknown values x. Our prior distribution is the spatial conditional autoregressive (CAR) model (see Donegan 2022 Table 1, WCAR specification).
If your initial model for the outcome variable y is an ordinary linear model with a covariate x, as in y = α + β * x + ϵ, then adding our spatial ME model would result in the following specification:
In this model x is an unobserved model parameter, z is our survey estimate of x, s is the standard error of the estimate, and the prior for the vector x is a spatial CAR model.
If we were to apply the spatial ME model to a covariate within a hierarchical Poisson model then we would write:
where u is some kind of spatial component or ‘random effects’ vector.
If x is a rate variable—e.g., the poverty rate—it can be important to use a logit-transformation before applying the CAR prior (Logan et al. 2020) as follows:
The CAR model has three scalar parameters (mean μ, spatial dependence ρ, and scale τ) that require prior probability distributions; geostan uses the following by default:
The default prior for ρ is uniform across its entire support (determined by the extreme eigenvalues of the spatial weights matrix).
For a sense of how all of this information gets combined in the model, consider two hypothetical survey estimates.
The first has a small standard error—it is a reliable data point—and it is also similar to its surrounding counties. In that case, the probability distribution for xi is going to look similar to the raw estimate and its standard error. Now consider a second county, where the estimate has a large standard error—it is unreliable—and the estimate is also dissimilar from its neighbors. In this case, we ought to consider it quite probable that the estimate xi is far from the truth. This means that the probability distribution for xi that comes from our model may be quite different from the raw estimate. That shift is often referred to as “shrinkage,” although the shift may be in either direction (up or down).
When we add a spatial ME model to a spatial regression model, the model results will automatically average over the uncertain inferences we make about x. The inferential biases that follow from ignoring measurement/observational uncertainty will be addressed, meaning that the probability distributions for the parameters will reflect the information we have on data quality.
From the R console, load the geostan package.
The line data(georgia)
loads the georgia
data set from the geostan package into your working environment. You can
learn more about the data by entering ?georgia
to the R
console.
Users can add a spatial ME model to any geostan model. A list of data
for the ME models can be prepared using the prep_me_data
function. The function requires at least two inputs from the user:
se
A data frame with survey standard errors.car_parts
A list of data created by the
prep_car_data
function.It also has optional inputs:
prior
A list of prior distributions for the CAR model
parameters.logit
A vector of logical values (TRUE
or
FALSE
) specifying if the variate should be logit
transformed. Only use this for rates (proportions between 0 and 1).The default prior distributions are designed to be fairly vague relative to ACS variables (including percentages from 0-100), assuming that magnitudes such as income have been log-transformed already.
The logit-transformation is often required for skewed rates, such as
the poverty rate. The user does not do this transformation on
their own (i.e., before passing the data to the model), simply use the
logit
argument in prep_me_data
.1
As an example, we will be using estimates of the percent covered by health insurance in Georgia counties. We are going to convert this to a rate so that we can use the logit transform:
data(georgia)
georgia$insurance <- georgia$insurance / 100
georgia$insurance.se <- georgia$insurance.se / 100
Now we need to gather the standard errors
georgia$insurance.se
into a data frame. The column name for
the standard errors must match the name of the variable it refers to
(“insurance”):
If we intended to model additional survey-based covariates, we would simply add their standard errors to this data frame as another column.
We create a binary spatial connectivity matrix, and then pass it to
prep_car_data
to prepare it for the CAR model:
## Range of permissible rho values: -1.661, 1
Now we use prep_me_data
to combine these items. We are
going to use the logit-transform, since we are working with rates:
Now we are ready to begin modeling the data.
The following code chunk sets up a spatial model for male county
mortality rates (ages 55-64) using the insurance
variable
as a covariate (without the spatial ME model):
fit <- stan_car(deaths.male ~ offset(log(pop.at.risk.male)) + insurance,
centerx = TRUE,
data = georgia,
family = poisson(),
car_parts = cars)
To add the spatial ME model to the above specification, we simply
provide our ME_list
to the ME
argument:
fit <- stan_car(deaths.male ~ offset(log(pop.at.risk.male)) + insurance,
centerx = TRUE,
ME = ME_list,
family = poisson(),
data = georgia,
car_parts = cars,
iter = 650, # for demo speed
quiet = TRUE,
)
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#tail-ess
## Spatial Model Results
## Formula: deaths.male ~ offset(log(pop.at.risk.male)) + insurance
## Likelihood: poisson
## Link: log
## Spatial method: CAR
## Residual Moran Coefficient: -0.0004461538
## Observations: 159
## Data models (ME): insurance
## Data model (ME prior): CAR (auto-normal)
## Inference for Stan model: foundation.
## 4 chains, each with iter=650; warmup=325; thin=1;
## post-warmup draws per chain=325, total post-warmup draws=1300.
##
## mean se_mean sd 2.5% 20% 50% 80% 97.5% n_eff Rhat
## intercept -4.164 0.002 0.055 -4.276 -4.205 -4.165 -4.126 -4.057 889 0.999
## insurance -2.747 0.024 0.808 -4.309 -3.411 -2.747 -2.064 -1.204 1156 1.000
## car_rho 0.863 0.002 0.088 0.654 0.794 0.882 0.939 0.983 1263 0.998
## car_scale 0.450 0.001 0.033 0.391 0.421 0.448 0.475 0.520 986 1.003
##
## Samples were drawn using NUTS(diag_e) at Fri Feb 14 06:01:00 2025.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
The CAR models in geostan can be highly efficient in terms of MCMC
sampling, but the required number of iterations will depend on many
characteristics of the model and data. Often the default
iter = 2000
is more than sufficient (with
cores = 4
). To speed up sampling with multi-core
processing, use cores = 4
(to sample 4 chains in
parallel).
In the following section, methods for examining MCMC samples of the
modeled covariate values x will be illustrated. Note
that the process of storing MCMC samples for x can become computationally
burdensome if you have multiple covariates and moderately large N. If
you do not need the samples of x you can use the
drop
argument, as in:
fit2 <- stan_car(deaths.male ~ offset(log(pop.at.risk.male)) + insurance,
centerx = TRUE,
ME = ME_list,
drop = 'x_true',
family = poisson(),
data = georgia,
car_parts = cars
)
Using slim = TRUE
may be faster (particularly for larger
data sets), but will prevent the collection of MCMC samples for
quantities of interest such as fitted values, pointwise log-likelihood
values (required for WAIC), as well as modeled covariates x.
The me_diag
provides some useful diagnostics for
understanding how the raw estimates compare with the modeled values.
First, it provides a point-interval scatter plot that compares the raw survey estimates (on the horizontal axis) to the modeled values (on the vertical axis). The (posterior) probability distributions for the modeled values are represented by their 95% credible intervals. The intervals provide a sense of the quality of the ACS data—if the credible intervals on the modeled value are large, this tells us that the data are not particularly reliable.
The me_diag
function also provides more direct visual
depictions of the difference between the raw survey estimates zi an the
modeled values xi: δi = zi − xi.
We want to look for any strong spatial pattern in these δi values,
because that would be an indication of a bias. However, the magnitude of
the δi
value is important to consider—there may be a pattern, but if the amount
of shrinkage is very small, that pattern may not matter. (The model
returns M samples from the
posterior distribution of each xi; or, indexing
by MCMC sample xim
(m is an index, not an
exponent). The reported δi values is the
MCMC mean $\delta_i = \frac{1}{M} \sum_m z_i
- x_i^m$.)
Two figures are provided to evaluate patterns in δi: first, a Moran scatter plot and, second, a map.
In this case, the results do not look too concerning insofar as there are no conspicuous patterns. However, a number of the credible intervals on the modeled values are large, which indicates low data reliability. The fact that some of the δi values are substantial also points to low data quality for those estimates.
It can be important to understand that the ME models are not
independent from the rest of the model. When the modeled covariate x enters a regression
relationship, that regression relationship can impact the probability
distribution for x
itself. (This is a feature of all Bayesian ME models). To examine the ME
model in isolation, you can set the prior_only
argument to
TRUE
(see ?stan_glm
or any of the spatial
models).
geostan consists of pre-compiled Stan models, and users can always access the Markov chain Monte Carlo (MCMC) samples returned by Stan. When extracted as a matrix of samples (as below), each row of the matrix represents a draw from the joint probability distribution for all model parameters, and each column consists of samples from the marginal distribution of each parameter.
The ME models return samples for xi as well as
the model parameters μ
(“mu_x_true”), ρ
(“car_rho_x_true”), and τ
(“sigma_x_true”). We can access these using as.matrix
(also
as.array
and as.data.frame
).
## [1] 1300 1
## parameters
## iterations mu_x_true[1]
## [1,] 1.806760
## [2,] 1.760781
## [3,] 1.783774
## [4,] 1.799491
## [5,] 1.734753
## [6,] 1.829441
## [1] 1.792787
We can visualize these using plot
or print a
summary:
## Inference for Stan model: foundation.
## 4 chains, each with iter=650; warmup=325; thin=1;
## post-warmup draws per chain=325, total post-warmup draws=1300.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## mu_x_true[1] 1.79 0 0.09 1.63 1.75 1.79 1.83 1.98 433 1
## car_rho_x_true[1] 0.91 0 0.07 0.75 0.88 0.92 0.96 0.99 977 1
## sigma_x_true[1] 0.48 0 0.03 0.42 0.46 0.48 0.51 0.56 1094 1
##
## Samples were drawn using NUTS(diag_e) at Fri Feb 14 06:01:00 2025.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
To extract samples from the joint probability distribution for x, use the generic parameter name “x_true”:
## [1] 1300 159
If we wanted to calculate the mean of each of these marginal
distributions (one for every xi), we could
use apply
with MARGIN = 2
to summarize by
column:
## x_insurance[1] x_insurance[2] x_insurance[3] x_insurance[4] x_insurance[5]
## 0.8366963 0.8003596 0.8517316 0.8520667 0.9179845
## x_insurance[6]
## 0.8702509
The vector x.mu
contains estimates (posterior means) for
xi. We
might want to use these to plot the residuals or fitted values against
the predictor:
res <- resid(fit)$mean
plot(x.mu, res,
xlab = 'Insurance rate',
ylab = 'Residual',
type = 'n',
bty = 'L')
abline(h = 0)
points(x.mu, res,
col = 4,
pch = 20)
If the ME
list doesn’t have a slot with
car_parts
, geostan will automatically use a non-spatial
Student’s t model instead of the CAR model: p(x|ℳ) = Student(x|ν, μ, σ),
with degrees of freedom ν,
mean μ, and scale σ.
To model multiple covariates, simply add them to the data frame of standard errors:
georgia$college <- georgia$college / 100
georgia$college.se <- georgia$college.se / 100
se = data.frame(insurance = georgia$insurance.se,
college = georgia$college.se)
ME <- prep_me_data(se = se, logit = c(TRUE, TRUE))
fit <- stan_glm(deaths.male ~ offset(log(pop.at.risk.male)) + insurance + college,
ME = ME,
re = ~ GEOID,
family = poisson(),
data = georgia,
iter = 700
)
Results can be examined one at a time usign
me_diag(fit, 'insurance', georgia)
and
me_diag(fit, 'college', georgia)
.
Income and other magnitudes are often log transformed. In that case,
the survey standard errors also need to be transformed. The
se_log
function provide approximate standard errors for
this purpose.
When using se_log
, pass in the variable itself
(untransformed) and its standard errors. Here is a full
workflow for a spatial mortality model with covariate measurement
error:
data(georgia)
# income in $1,000s
georgia$income <- georgia$income / 1e3
georgia$income.se <- georgia$income.se /1e3
# create log income
georgia$log_income <- log( georgia$income )
# create SEs for log income
log_income_se <- se_log( georgia$income, georgia$income.se )
# prepare spatial CAR data
C <- shape2mat(georgia, "B")
cars <- prep_car_data(C)
# prepare ME data
se <- data.frame( log_income = log_income_se )
ME <- prep_me_data( se = se, car_parts = cars )
# fit model
fit <- stan_car(deaths.male ~ offset(log(pop.at.risk.male)) + log_income,
ME = ME,
car_parts = cars,
family = poisson(),
data = georgia,
cores = 4
)
# check ME model
me_diag(fit, 'log_income', georgia)
# check disease model
sp_diag(fit, georgia)
# coefficient estimates
plot(fit)
# plot the income-mortality gradient
values <- seq( min(georgia$log_income), max(georgia$log_income), length.out = 100 )
new_data <- data.frame(pop.at.risk.male = 1,
log_income = values)
preds <- predict(fit, new_data, type = 'response')
preds$Income <- exp( preds$log_income )
preds$Mortality <- preds$mean * 100e3
plot(preds$Income, preds$Mortality,
type = 'l',
xlab = "Income",
ylab = "Deaths per 100,000")
There is a third optional argument, bounds
,
but it is rarely needed. It will forces the values of the latent
parameter vector to remain between the given bounds. For example, rates
must be between 0 and 1, but this is enforced automatically by logit
transform (which should be used in such cases). If it is needed for some
reason, the bounds
argument will apply to all of the
modeled covariates.↩︎