An Introduction to BMIselect

Jungang Zou

2025-08-25

Introduction

The R package BMIselect (Bayesian MI-LASSO for Variable Selection on Multiply-Imputed Datasets) provides a suite of Bayesian MI-LASSO for variable selection methods for multiply-imputed datasets. The package includes four Bayesian MI-LASSO models using shrinkage (Multi-Laplace, Horseshoe, ARD) and Spike-and-Slab (Spike-and-Laplace) priors, along with tools for model fitting via MCMC, four-step projection predictive variable selection, and hyperparameter calibration. Methods are suitable for continuous outcomes and both continuous and binary covariates under missing-at-random or missing-completely-at-random assumptions. See Zou, J., Wang, S. and Chen, Q. (2025), Bayesian MI-LASSO for Variable Selection on Multiply-Imputed Data. ArXiv, 2211.00114. doi:10.48550/arXiv.2211.00114 for more details.

This vignette introduces the workflow of the package, including simulating multiply-imputed dataset, running variable selection methods and reporting the post-selection inference.

library(BMIselect)
seed = 12345
set.seed(seed)

Data Preparation

The functions of sim_A(), sim_B(), sim_C() are used to generate data based on the simulation section in the original paper. Here we use sim_A() to simulate a dataset of \(n=200\) samples with \(p=20\) covariates. The \(p=20\) covariates are independent standard normals. Among them, the \(1,2,5,11,12,15\)-th covariates are used to build outcome model, with coefficients are all ones. Missingness are generated based on MAR assumption. Multiple imputation is based on MICE. We generate five imputed sets.

The object from sim_A() is a list, containing the originally generated dataset data_O, the dataset imposed with missing values data_mis, the data with only complete cases data_CC, the five imputed dataset data_MI, the vector important to indicate if covariate is related to outcome or not, the covariance matrix covmat to generate covariates matrix, and the coefficients vector beta.

# Simulate data
data <- sim_A(
  n = 100,
  p = 20,
  type = "MAR",
  seed = seed,
  n_imp = 5
)

str(data, max.level = 1)
#> List of 7
#>  $ data_O   :List of 2
#>  $ data_mis :List of 2
#>  $ data_MI  :List of 2
#>  $ data_CC  :List of 2
#>  $ important: logi [1:20] TRUE TRUE FALSE FALSE TRUE FALSE ...
#>  $ covmat   : num [1:20, 1:20] 1 0 0 0 0 0 0 0 0 0 ...
#>  $ beta     : num [1:20] 1 1 0 0 1 0 0 0 0 0 ...

In this vignette, we focus on variable selection on the multiply-imputed set data_MI. The dataset contains a three-dimensional multiply-imputed covariates data_MI$X and a two-dimensional outcome matrix data_MI$Y. The dimensions of data_MI$X is \((D, n, p)\), where \(D\) is the number of imputations. Similarly, the dimension of data_MI$Y is \((D, n)\).

str(data$data_MI)
#> List of 2
#>  $ X: num [1:5, 1:100, 1:20] 0.547 0.547 0.547 0.547 0.547 ...
#>  $ Y: num [1:5, 1:100] -2.68 -2.68 -2.68 -2.68 -2.68 ...

Variable Selection Using Bayesian MI-LASSO

Our BMI_LASSO() function performs Bayesian variable selection using a two-stage process. The function BMI_LASSO() automatically perform these two-stage algorithm and report the post-selection inference.

In the first stage, we use Markov Chain Monte Carlo (MCMC) to fit a Bayesian group LASSO model. You can choose from the following prior types:

The first three priors are called shrinkage priors, which gently push the coefficient estimates toward zero without setting them exactly to zero (similar to ridge regression). In contrast, the Spike-and-Laplace model uses a spike-and-slab prior, which combines the coefficients with hidden on/off switches to more explicitly decide whether a variable should be included or not. However, variable selection is an issue for these four priors: shrinkage models get non-zero coefficients estimates for all variables, and Spike-and-Slab prior may select different variable sets (different on/off switches) across MCMC draws. Also, the post-selection estimation is also a classical issue for variable selection. This motivates us to propose the method in the second stage.

In the second stage, we apply a four-step projection predictive variable selection method to address the limitations above:

  1. Generate candidate variable subsets using different thresholds between 0 and 1 (specified by the parameter grid). The thresholds will be used by a scaled-neighborhood criterion (see more details in the paper). A candidate set excludes all variables when threshold is 0, and keeping all variables when threshold is 1.
  2. Fit submodels using each candidate subset. Each submodel attempts to approximate the full model as closely as possible.
  3. Evaluate each submodel using the Bayesian Information Criterion (BIC). Select the optimal submodel based on the smallest BIC value.
  4. Report the post-selection estimates from the optimal submodel.

See Zou, J., Wang, S. and Chen, Q. (2025), Bayesian MI-LASSO for Variable Selection on Multiply-Imputed Data. ArXiv, 2211.00114. doi:10.48550/arXiv.2211.00114 for more details.

Example of Horseshoe model

Here is an example using "Horseshoe" model. We specify 4000 iterations for both burn-in period and sampling period.

bmilasso <- BMI_LASSO(data$data_MI$X, data$data_MI$Y, model = "Horseshoe",
                 nburn = 4000, npost = 4000, output_verbose = TRUE, seed = seed)
#> Chain1: 1000/8000, burn-in 
#> Chain1: 2000/8000, burn-in 
#> Chain1: 3000/8000, burn-in 
#> Chain1: 4000/8000, burn-in 
#> Chain1: 4001/8000, sampling 
#> Chain1: 5000/8000, sampling 
#> Chain1: 6000/8000, sampling 
#> Chain1: 7000/8000, sampling 
#> Chain1: 8000/8000, sampling 
#> The maximum of rank normalized split-Rhat of pooled beta in the selected model is 1.01
#> Running time for 1 chain: 0.22 minutes

str(bmilasso, max.level = 1)
#> List of 7
#>  $ posterior             :List of 14
#>  $ select                : logi [1:8, 1:20] FALSE TRUE TRUE TRUE TRUE TRUE ...
#>   ..- attr(*, "dimnames")=List of 2
#>  $ best_select           : logi [1, 1:20] TRUE TRUE FALSE FALSE TRUE FALSE ...
#>   ..- attr(*, "dimnames")=List of 2
#>  $ posterior_best_models :List of 8
#>  $ bic_models            : num [1:2, 1:8] 2.52 0 1.62 28.55 1.63 ...
#>   ..- attr(*, "dimnames")=List of 2
#>  $ summary_table_full    : drws_smm [22 × 10] (S3: draws_summary/tbl_df/tbl/data.frame)
#>   ..- attr(*, "num_args")= list()
#>  $ summary_table_selected: drws_smm [22 × 10] (S3: draws_summary/tbl_df/tbl/data.frame)
#>   ..- attr(*, "num_args")= list()

The returned object is a list, containing posterior draws, selected model, BIC for all submodels, and some summary tables. We first take a look at the three important objects: best_select, posterior_best_models, and summary_table_selected:

# selection vector of optimal (sub)model
bmilasso$best_select
#>      [,1] [,2]  [,3]  [,4] [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11] [,12] [,13]
#> 0.01 TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
#>      [,14] [,15] [,16] [,17] [,18] [,19] [,20]
#> 0.01 FALSE  TRUE FALSE FALSE FALSE FALSE FALSE


# posterior draws of coefficients, intercept and regression variance for the optimal (sub)model
str(bmilasso$posterior_best_models)
#> List of 8
#>  $ post_sigma2            : num [1:4000] 3.59 4.04 3.57 3.57 3.47 ...
#>  $ post_beta              : num [1:4000, 1:5, 1:20] 1.275 0.635 1.299 1.317 0.669 ...
#>  $ post_pool_beta         : num [1:20000, 1:20] 1.275 0.635 1.299 1.317 0.669 ...
#>  $ post_alpha             : NULL
#>  $ post_pool_alpha        : num(0) 
#>  $ post_beta_original     : num [1:4000, 1:5, 1:20] 1.264 0.629 1.288 1.306 0.663 ...
#>  $ post_pool_beta_original: num [1:20000, 1:20] 1.264 0.629 1.288 1.306 0.663 ...
#>  $ post_alpha_original    : num [1:4000, 1:5] -0.306 -0.222 -0.313 -0.248 -0.22 ...


# the summary table of post-selection inference
bmilasso$summary_table_selected
#> # A tibble: 22 × 10
#>    variable      mean median     sd    mad     q5    q95  rhat ess_bulk ess_tail
#>    <chr>        <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl>    <dbl>    <dbl>
#>  1 intercept_… -0.239 -0.245 0.0734 0.0817 -0.353 -0.117  1.50     1.83     13.5
#>  2 beta_pool[…  1.12   1.12  0.199  0.198   0.790  1.45   1.01    70.4    1012. 
#>  3 beta_pool[…  0.710  0.709 0.204  0.204   0.380  1.05   1.01    80.5   15213. 
#>  4 beta_pool[…  0      0     0      0       0      0     NA       NA        NA  
#>  5 beta_pool[…  0      0     0      0       0      0     NA       NA        NA  
#>  6 beta_pool[…  0.852  0.851 0.188  0.188   0.545  1.16   1.01   467.    17776. 
#>  7 beta_pool[…  0      0     0      0       0      0     NA       NA        NA  
#>  8 beta_pool[…  0      0     0      0       0      0     NA       NA        NA  
#>  9 beta_pool[…  0      0     0      0       0      0     NA       NA        NA  
#> 10 beta_pool[…  0      0     0      0       0      0     NA       NA        NA  
#> # ℹ 12 more rows

If some researcher are interested in the collections of candidate submodels and their selection vectors/BIC, the objects select and bic_models contain the information:

# selection matrix of all (sub)models
bmilasso$select
#>       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11] [,12]
#> 0    FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> 0.01  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
#> 0.32  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
#> 0.75  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
#> 0.76  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
#> 0.77  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE
#> 0.78  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#> 0.79  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#>      [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
#> 0    FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> 0.01 FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
#> 0.32 FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE
#> 0.75 FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE
#> 0.76 FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE
#> 0.77 FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE
#> 0.78 FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE
#> 0.79  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE


# BICs and degrees of freedom
bmilasso$bic_models
#>            0      0.01      0.32     0.75      0.76     0.77      0.78
#> BIC 2.518341  1.624619  1.632042  1.64313  1.653488  1.67927  1.704209
#> df  0.000000 28.547059 31.997351 33.26740 34.427437 37.29881 39.530039
#>          0.79
#> BIC  1.737539
#> df  42.521079

In some cases, if users want to test the robustness of each prior, the Bayesian group LASSO model fitted from the first stage (before projection predictive variable selection procedure) is needed. Users can access this information from objects posterior and summary_table_full. Additionally, the posterior draws of some hierarchical parameters are also included in posterior.

# posterior draws of parameters of fitted full model in the first stage (before projection predictive variable selection procedure)
str(bmilasso$posterior)
#> List of 14
#>  $ post_lambda2           : num [1:4000, 1:20] 155 499 183 147 347 ...
#>  $ post_kappa             : num [1:4000, 1:20] 0.522 8.521 0.432 1.145 0.68 ...
#>  $ post_tau2              : num [1:4000] 0.0014 0.00164 0.00129 0.00151 0.00115 ...
#>  $ post_eta               : num [1:4000] 1097 3197 413 482 577 ...
#>  $ post_alpha             : num [1:4000, 1:5] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ post_sigma2            : num [1:4000] 3.42 3.92 3.47 3.35 3.35 ...
#>  $ post_beta              : num [1:4000, 1:5, 1:20] 1.236 0.65 1.283 1.313 0.617 ...
#>  $ post_fitted_Y          : num [1:4000, 1:5, 1:100] -2.953 -0.608 -3.17 -1.881 -2.873 ...
#>  $ post_pool_beta         : num [1:20000, 1:20] 1.236 0.65 1.283 1.313 0.617 ...
#>  $ post_pool_fitted_Y     : num [1:20000, 1:100] -2.953 -0.608 -3.17 -1.881 -2.873 ...
#>  $ hat_matrix_proj        : num [1:5, 1:100, 1:100] 0.0791 0.0854 0.0863 0.0813 0.1049 ...
#>  $ post_beta_original     : num [1:4000, 1:5, 1:20] 1.225 0.645 1.271 1.301 0.612 ...
#>  $ post_pool_beta_original: num [1:20000, 1:20] 1.225 0.645 1.271 1.301 0.612 ...
#>  $ post_alpha_original    : num [1:4000, 1:5] -0.376 -0.321 -0.318 -0.247 -0.236 ...


# the summary table of coefficients, intercept, and regression variance from the fitted full model
bmilasso$summary_table_full
#> # A tibble: 22 × 10
#>    variable           mean   median     sd    mad      q5     q95  rhat ess_bulk
#>    <chr>             <dbl>    <dbl>  <dbl>  <dbl>   <dbl>   <dbl> <dbl>    <dbl>
#>  1 intercept_pool -0.269   -2.71e-1 0.0883 0.0906 -0.412  -0.123   1.32     2.46
#>  2 beta_pool[1]    1.08     1.08e+0 0.201  0.201   0.753   1.42    1.01    69.1 
#>  3 beta_pool[2]    0.746    7.45e-1 0.210  0.211   0.405   1.09    1.02    55.7 
#>  4 beta_pool[3]   -0.00641 -1.23e-3 0.0733 0.0400 -0.133   0.105   1.00  5035.  
#>  5 beta_pool[4]    0.0131   3.52e-3 0.0779 0.0418 -0.0984  0.154   1.00 14713.  
#>  6 beta_pool[5]    0.884    8.83e-1 0.191  0.191   0.571   1.20    1.00  1264.  
#>  7 beta_pool[6]    0.0290   9.03e-3 0.0910 0.0512 -0.0901  0.205   1.00  2762.  
#>  8 beta_pool[7]   -0.00292 -7.27e-4 0.0797 0.0467 -0.137   0.123   1.00 13550.  
#>  9 beta_pool[8]   -0.0258  -9.59e-3 0.0900 0.0529 -0.194   0.0940  1.00  7663.  
#> 10 beta_pool[9]   -0.00692 -1.83e-3 0.0766 0.0437 -0.138   0.114   1.00 16188.  
#> # ℹ 12 more rows
#> # ℹ 1 more variable: ess_tail <dbl>

Hyerparameters Specification

In Bayesian MI-LASSO models, "Multi_Laplace" and "Spike_Laplace" priors have two hyperparameters, respectively:

Users can specify the hyperparameters in BMI_LASSO():

# specify hyperparameters for Multi-Laplace model
bmilasso_ML <- BMI_LASSO(data$data_MI$X, data$data_MI$Y, model = "Multi_Laplace",
                 nburn = 4000, npost = 4000, seed = seed, h = 10, v = 0.5)
#> Chain 1: 1000 / 8000 (burn-in)
#> Chain 1: 2000 / 8000 (burn-in)
#> Chain 1: 3000 / 8000 (burn-in)
#> Chain 1: 4000 / 8000 (burn-in)
#> Chain 1: 4001 / 8000 (sampling)
#> Chain 1: 5000 / 8000 (sampling)
#> Chain 1: 6000 / 8000 (sampling)
#> Chain 1: 7000 / 8000 (sampling)
#> Chain 1: 8000 / 8000 (sampling)
#> The maximum of rank normalized split-Rhat of pooled beta in the selected model is 1.01
#> Running time for 1 chain: 0.21 minutes


# specify hyperparameters for Spike-Laplace model. To save time, we give an example without running
# bmilasso_SL <- BMI_LASSO(data$data_MI$X, data$data_MI$Y, model = "Spike_Laplace",
#                  nburn = 4000, npost = 4000, seed = seed, a = 10, b = 1)

Since "Horseshoe" and "ARD" priors do not require any hyperparameter tuning, we recommend users start with these two options for simplicity and stability. For "Multi_Laplace" and "Spike_Laplace", we did a sensitivity analysis for these hyperparameters in our paper. Based on the results, we suggest avoiding very small values of v and b, as they may lead to unstable model fitting and variable selection. In practice, we recommend beginning with the default values provided in the function.

Parallel Computing for Multiple Chains

If multiple cores are available, parallel computing can be used for multiple chains. This function is based on R packages doParallel and foreach. Users can specify the parameter ncores for the number of cores when running multiple chains nchain.

# run 2 chains using 2 cores in parallel
start = Sys.time()
bmilasso_parallel <- BMI_LASSO(data$data_MI$X, data$data_MI$Y, model = "Horseshoe",
                 nburn = 4000, npost = 4000, output_verbose = FALSE, nchains = 2, ncores = 2, seed = seed)
end = Sys.time()
print(paste("The running time for two chains in parallel:", difftime(end, start, units = "mins"), "mins"))
#> [1] "The running time for two chains in parallel: 0.228431232770284 mins"

The returned results from multiple chains are different from single chain. The summary tables summary_table_selected and summary_table_full are calculated by pooling posterior draws from multiple chains. For all other output objects, results from each chain are listed separately, one by one, so users can obtain and compare the posterior estimates of each chain individually.

# the posterior draws of optimal submodels of two chains
str(bmilasso_parallel$posterior_best_models)
#> List of 2
#>  $ :List of 8
#>   ..$ post_sigma2            : num [1:4000] 3.59 4.04 3.57 3.57 3.47 ...
#>   ..$ post_beta              : num [1:4000, 1:5, 1:20] 1.275 0.635 1.299 1.317 0.669 ...
#>   ..$ post_pool_beta         : num [1:20000, 1:20] 1.275 0.635 1.299 1.317 0.669 ...
#>   ..$ post_alpha             : NULL
#>   ..$ post_pool_alpha        : num(0) 
#>   ..$ post_beta_original     : num [1:4000, 1:5, 1:20] 1.264 0.629 1.288 1.306 0.663 ...
#>   ..$ post_pool_beta_original: num [1:20000, 1:20] 1.264 0.629 1.288 1.306 0.663 ...
#>   ..$ post_alpha_original    : num [1:4000, 1:5] -0.306 -0.222 -0.313 -0.248 -0.22 ...
#>  $ :List of 8
#>   ..$ post_sigma2            : num [1:4000] 3.76 3.96 3.68 3.91 3.85 ...
#>   ..$ post_beta              : num [1:4000, 1:5, 1:20] 1.091 0.886 0.654 0.767 1.3 ...
#>   ..$ post_pool_beta         : num [1:20000, 1:20] 1.091 0.886 0.654 0.767 1.3 ...
#>   ..$ post_alpha             : NULL
#>   ..$ post_pool_alpha        : num(0) 
#>   ..$ post_beta_original     : num [1:4000, 1:5, 1:20] 1.082 0.878 0.649 0.76 1.288 ...
#>   ..$ post_pool_beta_original: num [1:20000, 1:20] 1.082 0.878 0.649 0.76 1.288 ...
#>   ..$ post_alpha_original    : num [1:4000, 1:5] -0.275 -0.277 -0.189 -0.24 -0.232 ...

# summary table by pooling two chains
bmilasso_parallel$summary_table_selected
#> # A tibble: 22 × 10
#>    variable      mean median     sd    mad     q5    q95  rhat ess_bulk ess_tail
#>    <chr>        <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl>    <dbl>    <dbl>
#>  1 intercept_… -0.240 -0.245 0.0735 0.0819 -0.353 -0.117  1.36     4.48     41.2
#>  2 beta_pool[…  1.12   1.12  0.199  0.198   0.792  1.45   1.00   156.     1186. 
#>  3 beta_pool[…  0.710  0.709 0.202  0.202   0.380  1.04   1.01   267.    30533. 
#>  4 beta_pool[…  0      0     0      0       0      0     NA       NA        NA  
#>  5 beta_pool[…  0      0     0      0       0      0     NA       NA        NA  
#>  6 beta_pool[…  0.851  0.850 0.186  0.187   0.545  1.16   1.00  1413.    36590. 
#>  7 beta_pool[…  0      0     0      0       0      0     NA       NA        NA  
#>  8 beta_pool[…  0      0     0      0       0      0     NA       NA        NA  
#>  9 beta_pool[…  0      0     0      0       0      0     NA       NA        NA  
#> 10 beta_pool[…  0      0     0      0       0      0     NA       NA        NA  
#> # ℹ 12 more rows

Frequentists MI-LASSO

Although Bayesian MI-LASSO models have better performance than MI-LASSO, some users may be instereted in frequentists method. We also include the MI-LASSO method in the function MI_LASSO(), originally from https://www.columbia.edu/~qc2138/Downloads/software/MI-lasso.R. We further add a parameter ncores to allow parallel computing. We give an brief example and more details can be found in Chen, Q. and Wang, S. (2013). “Variable selection for multiply-imputed data with application to dioxin exposure study.” Statistics in Medicine, 32(21): 3646–3659.

# fit MI-LASSO
milasso = MI_LASSO(data$data_MI$X, data$data_MI$Y, ncores = 2)
#> Running time for 2 ncore: 0.04 minutes

str(milasso)
#> List of 2
#>  $ best       :List of 4
#>   ..$ coefficients: num [1:21, 1:5] -0.309 0.9 0.605 0 0 ...
#>   ..$ bic         : num 1.64
#>   ..$ varsel      : logi [1:20] TRUE TRUE FALSE FALSE TRUE FALSE ...
#>   ..$ lambda      : num 4.29
#>  $ lambda_path: num [1:101, 1:2] 0.125 0.134 0.144 0.154 0.165 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:2] "lamvec" ""

Discussion

Here are some discussions about this package:

  1. Currently, our package only supports continuous outcomes. Binary, categorical and ordinal outcomes are our research topic in the future.
  2. We support continuous and binary covariates. Selection for categorical covariates is tricky. Generally we transform a categorical variable into a set of binary dummy variables. Our package supports the selection on each dummy variable. For selection of the whole group, we will research it in the future.
  3. Similarly to other variable selection methods, our models are sensitive to multicollinearity. If some covariates are highly correlated, we suggest to do some pre-processing before using our methods.

Disclaimer

If you find any bug in this package, please contact me: .