Motivation for using autoMrP

Multilevel regression with post-stratification (MrP) is the current gold standard for small-area estimation. A common application is, for example, estimating public approval for the use of force to secure oil, in each of the states in the USA using survey data. Surveys are often representative on the national level but not on regional levels especially in smaller regions with only a few respondents in a survey. MrP solves this problem in two ways. First, in a multilevel model, partial pooling ensures that information from within states and from across states are both used in estimating a state-level effects. The number of observations within a group and the amount of variation within a group determine the weighting of the group-level mean and the grand mean (Gelman and Hill, p. 253). Second, the state-level estimates from the multilevel model are post-stratified based on census information. For instance, we could estimate a multilevel model with random effects for age, gender, and race - we can only include random effects for characteristics that we have census information on. To do post-stratification, we find how common an age-gender-race combination is in a particular state. An observation is then weighted by the proportion of that “ideal type” (age-gender-race combination) in a state. It has been demonstrated that MrP is often more accurate than alternatives such as dis-aggregation (c.f. Warshaw and Rodden, 2012, Lax and Philipps, 2009).

The autoMrP package makes two contributions. First, it is - to our knowledge - the only package that implements MrP in R and thereby makes this technique much more accessible. Second, autoMrP uses machine learning techniques to further improve upon the standard MrP model’s prediction accuracy.

The machine learning approach to MrP, relaxes functional form assumptions that otherwise have to be made, it implements regularization more comprehensively and thereby reduces the risk of over-fitting. In autoMrP with machine learning, we use cross-validation to tune five classifiers: (i) multilevel regression with best subset-selection of the context level variables, (ii) multilevel regression with prinicipal components as context-level variables, (iii) multilevel regression with L1 regularization (lasso), (iv) gradient boosting, and (v) support vector machine. We use Ensemble Bayesian Model averaging to combine the five predictions into one overall prediction and finally we combine the post-stratified state-level predictions into one over-all prediction. We demonstrate that this approach outperforms standard MrP in Broniecki, Wüest, Leemann (2020) “Improving Multilevel Regression with Post-Stratification Through Machine Learning (autoMrP)” forthcoming in the Journal of Politics (final pre-print version).

autoMrP tutorial

In the following, we demonstrate how to use autoMrP in R.

We start by installing autoMrP from CRAN:

install.packages("autoMrP")

Next, we attach the package and inspect a survey item that is included in the package as well as the corresponding census data.

The survey item asked “Would you approve of the use of U.S. military troops in order to ensure the supply of oil?” Our sample includes 1,500 respondents, mimicking the size of a typical survey. The dependent variable is named YES and is 1 if an individual supports the use of force to secure oil and 0 otherwise.

The three individual-level variables that we also have census information on are:

  • L1x1 - Age group (four categories: 1 = 18-29; 2 = 30-44; 3 = 45-64; 4 = 65+)
  • L1x2 - Education level (four categories 1 = < high school; 2 = high school graduate; 3 = some college; 4 = college graduate)
  • L1x3 - Gender-race combination (six categories: 1 = white male; 2 = black male; 3 = hispanic male; 4 = white female; 5 = black female; 6 = hispanic female)

We can also estimate random effects for:

  • state - the U.S. state
  • region - the U.S. region (four categories: 1 = Northeast; 2 = Midwest; 3 = South; 4 = West)

In addition, we also have six context-level variables. These variables are fixed effects that vary accros U.S. states but are constant across individuals within a state:

  • L2.x1 - Normalized state-level share of votes for the Republican candidate in the previous presidential election
  • L2.x2 - Normalized state-level percentage of Evangelical Protestant or Mormon respondents
  • L2.x3 - Normalized state-level percentage of the population living in urban areas
  • L2.x4 - Normalized state-level unemployment rate
  • L2.x5 - Normalized state-level share of Hispanics
  • L2.x6 - Normalized state-level share of Whites
library(autoMrP)
?survey_item
?census

The standard MrP model

Estimating a standard MrP model is easy with autoMrP. We estimate a model that includes all individual-level predictors: L1x1, L1x2, L1x3 as well as a random effect for the state. The standard multilevel protects against over-fitting on the individual level to some degree by means of partial pooling. On the context-level, however, there is no equivalent regularization scheme. We, know that we can improve the performance of MrP models by including context-level variables but we also know that we can easily over-fit on the context-level. As a first model, we do not include any context-level information.

# fit the model
m1 <- auto_MrP(
  y = "YES",
  L1.x = c("L1x1", "L1x2", "L1x3"),
  L2.x = "",
  L2.unit = "state",
  survey = survey_item,
  census = census,
  bin.proportion = "proportion",
  best.subset = FALSE,
  pca = FALSE,
  lasso = FALSE,
  gb = FALSE,
  svm = FALSE,
  mrp = TRUE
)
## Starting post-stratification
## 
##  Skipping EBMA (only 1 classifier selected)

By default, autoMrP will apply machine learning (ML). To estimate the standard MrP model, we need to set the arguments for the ML classifiers best.subset, pca, lasso, gb, and svm to FALSE. At the same time, the argument mrp controls whether the standard MrP model is estimated and it defaults to FALSE. We need to explicitly set it to TRUE.

It makes sense to take a few minutes to understand the structure of the census data - this will make it easier to work with your own data. The census data is organized such that each observation corresponds to an ideal type. For instance, row 1 is a white male who did not graduate from high school. Row 2 is a white male who did graduate from high school and so forth. The census data also needs a column with information on (absolute or relative) frequency of each ideal type in a state. Here, the column proportion is the relative frequency of each ideal type within a state and it is needed for post-stratification. The argument bin.proportion = "proportion" points to the column name of that census variable. Alternatively one can also include a column with the absolute frequency of each ideal type within a state. In that case the argument bin.proportion must be replaced by the argument bin.size = "column name of census variable with absolute number of ideal type occurances in state. In addition, the census data must contain the columns with the context-level data if these are to be used in the MrP model.

The auto_MrP() function returns a list with three elements. The first element is called ebma and contains the state-level estimates according to the ensemble using Bayesian Model Averaging - this element is empty if we run the standard MrP model only. The second element is classed classifiers and contains the state-level estimates according to each of the classifiers. Here our only classifier is the standard multilevel model with post-stratification. The third element is called weights and is also empty if only one classifier is run. Otherwise it contains the weight of each classifier in the final ensemble.

# auto_mrp() returns a list with 3 elements
m1
## $ebma
## [1] "EBMA step skipped (only 1 classifier run)"
## attr(,"class")
## [1] "autoMrP"   "ensemble"  "character"
## 
## $classifiers
## # A tibble: 48 x 2
##    state   mrp
##    <fct> <dbl>
##  1 AL    0.390
##  2 AR    0.348
##  3 AZ    0.384
##  4 CA    0.325
##  5 CO    0.390
##  6 CT    0.384
##  7 DE    0.373
##  8 FL    0.393
##  9 GA    0.388
## 10 IA    0.365
## # ... with 38 more rows
## 
## $weights
## [1] "EBMA step skipped (only 1 classifier run)"
## attr(,"class")
## [1] "autoMrP"   "weights"   "character"
## 
## attr(,"class")
## [1] "autoMrP" "list"

For our second MrP model, we include all context-level variables available in the data and in addition, we also estimate random effects for the region.

# fit the model
m2 <- auto_MrP(
  y = "YES",
  L1.x = c("L1x1", "L1x2", "L1x3"),
  L2.x = c("L2.x1", "L2.x2", "L2.x3", "L2.x4", "L2.x5", "L2.x6"),
  L2.unit = "state",
  L2.reg = "region",
  L2.x.scale = FALSE,
  survey = survey_item,
  census = census,
  bin.proportion = "proportion",
  best.subset = FALSE,
  pca = FALSE,
  lasso = FALSE,
  gb = FALSE,
  svm = FALSE,
  mrp = TRUE
)
## Starting post-stratification
## 
##  Skipping EBMA (only 1 classifier selected)

Important Note: The argument L2.x.scale controls whether the context-level variables are normalized or not. The argument defaults to TRUE and this option should not be changed unless the context-level variables have already been normalized (check both the survey and the census data to verify this). We recommend normalizing context-level variables when estimating the standard MrP model, and context-level variables must be normalized if the machine learning classifiers are used because some of the ML classifiers are not scale-invariant.

The decision whether or not to include context-level information and which context-level variables to include can be very consequential with respect to the state-level estimates that we obtain. Let’s illustrate this using our state-level estimates from m1 and m2.

library(dplyr)
library(ggplot2)

# plot data
plot_data <- tibble(
  state = m1$classifiers$state,
  m1 = m1$classifiers$mrp,
  m2 = m2$classifiers$mrp) %>%
  arrange(m1) %>%
  mutate(rank = row_number()) %>%
  mutate(rank = as.factor(rank))

# y axis tick labels
ylabs <- as.character(pull(.data = plot_data, var = "state"))

# plot
ggplot(data = plot_data, aes(x = value, y = rank, color = variable)) +
  geom_point(aes(x = m1, col = "m1")) +
  geom_point(aes(x = m2, col = "m2")) +
  labs(x = "Estimates") +
  scale_y_discrete(breaks = rank, labels = ylabs, name = "States") +
  guides(color = guide_legend(title = "MrP Model"))

Clearly, the models return very different state-level estimates and we do not have a good way of knowing which specification is correct. autoMrP uses machine learning to approach this problem in a disciplined way.

autoMrP with Machine Learning

We will generate an overall prediction using all five classifiers and all context-level variables. We accept the default auto_MrP() tuning parameters. However, in your own work, we suggest that you explicitly set tuning parameters even if you accept the defaults. The reason is that we may change the defaults in future. Tuning may take a significant amount of time. We, therefore, suggest that you use all available cores on your computer for tuning.

# detect max number of cores available
max_cores <- parallel::detectCores()

# autoMrP with ML
m3 <- auto_MrP(
  y = "YES",
  L1.x = c("L1x1", "L1x2", "L1x3"),
  L2.x = c("L2.x1", "L2.x2", "L2.x3", "L2.x4", "L2.x5", "L2.x6"),
  L2.unit = "state",
  L2.reg = "region",
  L2.x.scale = FALSE,
  survey = survey_item,
  census = census,
  bin.proportion = "proportion",
  cores = max_cores
)
## Starting multilevel regression with best subset selection classifier tuning
## Starting multilevel regression with L1 regularization tuning
## Starting multilevel regression with principal components as context level variables tuning
## Starting gradient tree boosting tuning
## Starting support vector machine tuning
## Starting post-stratification
## Starting bayesian ensemble model averaging tuning

The overall state-level estimates are returned as a tibble in the object ebma.

m3$ebma
## # A tibble: 48 x 2
##    state  ebma
##    <fct> <dbl>
##  1 AL    0.393
##  2 AR    0.337
##  3 AZ    0.406
##  4 CA    0.333
##  5 CO    0.394
##  6 CT    0.353
##  7 DE    0.342
##  8 FL    0.395
##  9 GA    0.394
## 10 IA    0.340
## # ... with 38 more rows

The estimates for the individual classifiers are returned in the object classifiers.

m3$classifiers
## # A tibble: 48 x 6
##    state best_subset lasso   pca    gb   svm
##    <fct>       <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 AL          0.407 0.405 0.407 0.411 0.324
##  2 AR          0.325 0.318 0.316 0.376 0.354
##  3 AZ          0.429 0.424 0.419 0.388 0.363
##  4 CA          0.342 0.329 0.320 0.322 0.353
##  5 CO          0.403 0.404 0.401 0.386 0.373
##  6 CT          0.344 0.355 0.354 0.337 0.378
##  7 DE          0.328 0.342 0.344 0.342 0.359
##  8 FL          0.403 0.408 0.408 0.378 0.374
##  9 GA          0.403 0.411 0.414 0.394 0.339
## 10 IA          0.334 0.330 0.331 0.356 0.349
## # ... with 38 more rows

The classifiers are combined into an ensemble as a weighted average. The weights of the individual classifiers are returned in the object m3$weights.

m3$weights
## best_subset       lasso         pca          gb         svm 
##   0.2055582   0.2075498   0.2114595   0.1983000   0.1771326 
## attr(,"class")
## [1] "autoMrP" "weights" "numeric"

Uncertainty Estimates

You may have noticed that, so far, we have presented point-estimates without uncertainty of our state-level predictions. Uncertainty is implemented via bootstrapping and bootstrapping may take a significant amount of time if it is combined with the machine learning approach that we suggest. We would recommend to carry out bootstrapping over night.

In order to carry out bootstrapping, the argument uncertainty = TRUE must be added. By default auto_MrP() carries out 200 bootstrap iterations. The argument boot.iter controls the number of iterations and can be set if the user so wishes.

# autoMrP with ML & Bootstapping
m3 <- auto_MrP(
  y = "YES",
  L1.x = c("L1x1", "L1x2", "L1x3"),
  L2.x = c("L2.x1", "L2.x2", "L2.x3", "L2.x4", "L2.x5", "L2.x6"),
  L2.unit = "state",
  L2.reg = "region",
  L2.x.scale = FALSE,
  survey = survey_item,
  census = census,
  bin.proportion = "proportion",
  uncertainty = TRUE,
  cores = max_cores
)

The code above would run for a significant amount of time. To illustrate bootstrapping, we will use it on our standard MrP model - this will run very quickly - and print the results including uncertainty estimates.

# fit the model
m <- auto_MrP(
  y = "YES",
  L1.x = c("L1x1", "L1x2", "L1x3"),
  L2.x = "",
  L2.unit = "state",
  survey = survey_item,
  census = census,
  bin.proportion = "proportion",
  best.subset = FALSE,
  pca = FALSE,
  lasso = FALSE,
  gb = FALSE,
  svm = FALSE,
  mrp = TRUE,
  uncertainty = TRUE,
  cores = max_cores
)
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

autoMrP returns a tibble where each bootstrap iteration is a row in the results.

m$classifiers
## # A tibble: 9,600 x 2
##    state   mrp
##    <fct> <dbl>
##  1 AL    0.428
##  2 AR    0.254
##  3 AZ    0.427
##  4 CA    0.319
##  5 CO    0.424
##  6 CT    0.383
##  7 DE    0.370
##  8 FL    0.404
##  9 GA    0.428
## 10 IA    0.367
## # ... with 9,590 more rows

To quickly summarize the results, autoMrP has plot and summary methods.

# we set the argument n = 48 to plot all rows of the table
# ci.lvl = 0.95 sets the lower and upper bound columns to
# the 95% confidence interval
summary(object = m, n = 48, ci.lvl = 0.95)
## 
##  #  mrp  estimates:
## 
## state      Min.   Lower bound   Median   Upper bound      Max
## ------  -------  ------------  -------  ------------  -------
## AL       0.3184        0.3513   0.4323        0.5274   0.5982
## AR       0.1695        0.2231   0.3059        0.3743   0.3894
## AZ       0.2620        0.3204   0.3973        0.5013   0.5735
## CA       0.1734        0.2135   0.2876        0.3435   0.3796
## CO       0.3030        0.3247   0.4140        0.5230   0.5675
## CT       0.2564        0.2916   0.3835        0.4571   0.5156
## DE       0.2680        0.2940   0.3803        0.4748   0.6032
## FL       0.3135        0.3484   0.4239        0.5111   0.5563
## GA       0.2712        0.3403   0.4240        0.5269   0.5649
## IA       0.1817        0.2397   0.3343        0.4034   0.4656
## ID       0.3218        0.3629   0.4400        0.5597   0.6641
## IL       0.1993        0.2378   0.3172        0.3935   0.4471
## IN       0.2763        0.2837   0.3777        0.4701   0.5036
## KS       0.2536        0.3191   0.4172        0.5117   0.6092
## KY       0.1992        0.2312   0.3315        0.4238   0.4723
## LA       0.1961        0.2220   0.3088        0.3770   0.4357
## MA       0.1694        0.2680   0.3573        0.4455   0.5014
## MD       0.2039        0.2304   0.3135        0.4017   0.4639
## ME       0.2224        0.2506   0.3319        0.4095   0.4308
## MI       0.2103        0.2443   0.3245        0.4050   0.4417
## MN       0.3033        0.3354   0.4082        0.4967   0.5962
## MO       0.3130        0.3349   0.4178        0.5315   0.5864
## MS       0.2747        0.3014   0.3868        0.4884   0.5588
## MT       0.2607        0.3127   0.3907        0.4846   0.5107
## NC       0.2523        0.2873   0.3632        0.4535   0.4771
## ND       0.2871        0.3011   0.3629        0.4390   0.4679
## NE       0.2404        0.3076   0.3864        0.4748   0.5204
## NH       0.1935        0.2823   0.3611        0.4370   0.4781
## NJ       0.2402        0.2947   0.3732        0.4515   0.5003
## NM       0.2813        0.3252   0.3869        0.4846   0.5934
## NV       0.2671        0.3133   0.3856        0.4587   0.4716
## NY       0.2018        0.2366   0.3181        0.3891   0.4320
## OH       0.2990        0.3299   0.4042        0.5086   0.5599
## OK       0.1980        0.2594   0.3549        0.4300   0.4457
## OR       0.2619        0.3141   0.3961        0.4970   0.5444
## PA       0.2617        0.2854   0.3664        0.4292   0.5062
## RI       0.2365        0.2667   0.3454        0.4086   0.4481
## SC       0.2639        0.3115   0.3894        0.4716   0.5039
## SD       0.2720        0.3209   0.4012        0.4998   0.6233
## TN       0.2691        0.2906   0.3680        0.4481   0.4866
## TX       0.3235        0.3455   0.4266        0.5114   0.5432
## UT       0.3378        0.3758   0.4536        0.5623   0.7058
## VA       0.2961        0.3341   0.4226        0.5137   0.5498
## VT       0.3014        0.3200   0.3829        0.4749   0.5323
## WA       0.2778        0.3034   0.4104        0.5126   0.5643
## WI       0.2055        0.2601   0.3523        0.4259   0.5071
## WV       0.1562        0.2641   0.3499        0.4157   0.4481
## WY       0.3310        0.3573   0.4363        0.5402   0.5783
plot(m)