The R package BMIselect (Bayesian MI-LASSO for Variable Selection on Multiply-Imputed Datasets) provides a suite of Bayesian MI-LASSO for variable selection methods for multiply-imputed datasets. The package includes four Bayesian MI-LASSO models using shrinkage (Multi-Laplace, Horseshoe, ARD) and Spike-and-Slab (Spike-and-Laplace) priors, along with tools for model fitting via MCMC, four-step projection predictive variable selection, and hyperparameter calibration. Methods are suitable for continuous outcomes and both continuous and binary covariates under missing-at-random or missing-completely-at-random assumptions. See Zou, J., Wang, S. and Chen, Q. (2025), Bayesian MI-LASSO for Variable Selection on Multiply-Imputed Data. ArXiv, 2211.00114. doi:10.48550/arXiv.2211.00114 for more details.
This vignette introduces the workflow of the package, including simulating multiply-imputed dataset, running variable selection methods and reporting the post-selection inference.
The functions of sim_A()
, sim_B()
,
sim_C()
are used to generate data based on the simulation
section in the original paper. Here we use sim_A()
to
simulate a dataset of \(n=200\) samples
with \(p=20\) covariates. The \(p=20\) covariates are independent standard
normals. Among them, the \(1,2,5,11,12,15\)-th covariates are used to
build outcome model, with coefficients are all ones. Missingness are
generated based on MAR assumption. Multiple imputation is based on MICE.
We generate five imputed sets.
The object from sim_A()
is a list, containing the
originally generated dataset data_O
, the dataset imposed
with missing values data_mis
, the data with only complete
cases data_CC
, the five imputed dataset
data_MI
, the vector important
to indicate if
covariate is related to outcome or not, the covariance matrix
covmat
to generate covariates matrix, and the coefficients
vector beta
.
# Simulate data
data <- sim_A(
n = 100,
p = 20,
type = "MAR",
seed = seed,
n_imp = 5
)
str(data, max.level = 1)
#> List of 7
#> $ data_O :List of 2
#> $ data_mis :List of 2
#> $ data_MI :List of 2
#> $ data_CC :List of 2
#> $ important: logi [1:20] TRUE TRUE FALSE FALSE TRUE FALSE ...
#> $ covmat : num [1:20, 1:20] 1 0 0 0 0 0 0 0 0 0 ...
#> $ beta : num [1:20] 1 1 0 0 1 0 0 0 0 0 ...
In this vignette, we focus on variable selection on the
multiply-imputed set data_MI
. The dataset contains a
three-dimensional multiply-imputed covariates data_MI$X
and
a two-dimensional outcome matrix data_MI$Y
. The dimensions
of data_MI$X
is \((D, n,
p)\), where \(D\) is the number
of imputations. Similarly, the dimension of data_MI$Y
is
\((D, n)\).
Our BMI_LASSO()
function performs Bayesian variable
selection using a two-stage process. The function
BMI_LASSO()
automatically perform these two-stage algorithm
and report the post-selection inference.
In the first stage, we use Markov Chain Monte Carlo (MCMC) to fit a Bayesian group LASSO model. You can choose from the following prior types:
"Multi_Laplace"
"Horseshoe"
"ARD"
(Automatic Relevance Determination)"Spike_Laplace"
The first three priors are called shrinkage priors, which gently push the coefficient estimates toward zero without setting them exactly to zero (similar to ridge regression). In contrast, the Spike-and-Laplace model uses a spike-and-slab prior, which combines the coefficients with hidden on/off switches to more explicitly decide whether a variable should be included or not. However, variable selection is an issue for these four priors: shrinkage models get non-zero coefficients estimates for all variables, and Spike-and-Slab prior may select different variable sets (different on/off switches) across MCMC draws. Also, the post-selection estimation is also a classical issue for variable selection. This motivates us to propose the method in the second stage.
In the second stage, we apply a four-step projection predictive variable selection method to address the limitations above:
grid
). The
thresholds will be used by a scaled-neighborhood criterion (see more
details in the paper). A candidate set excludes all variables when
threshold is 0, and keeping all variables when threshold is 1.See Zou, J., Wang, S. and Chen, Q. (2025), Bayesian MI-LASSO for Variable Selection on Multiply-Imputed Data. ArXiv, 2211.00114. doi:10.48550/arXiv.2211.00114 for more details.
Here is an example using "Horseshoe"
model. We specify
4000 iterations for both burn-in period and sampling period.
bmilasso <- BMI_LASSO(data$data_MI$X, data$data_MI$Y, model = "Horseshoe",
nburn = 4000, npost = 4000, output_verbose = TRUE, seed = seed)
#> Chain1: 1000/8000, burn-in
#> Chain1: 2000/8000, burn-in
#> Chain1: 3000/8000, burn-in
#> Chain1: 4000/8000, burn-in
#> Chain1: 4001/8000, sampling
#> Chain1: 5000/8000, sampling
#> Chain1: 6000/8000, sampling
#> Chain1: 7000/8000, sampling
#> Chain1: 8000/8000, sampling
#> The maximum of rank normalized split-Rhat of pooled beta in the selected model is 1.01
#> Running time for 1 chain: 0.22 minutes
str(bmilasso, max.level = 1)
#> List of 7
#> $ posterior :List of 14
#> $ select : logi [1:8, 1:20] FALSE TRUE TRUE TRUE TRUE TRUE ...
#> ..- attr(*, "dimnames")=List of 2
#> $ best_select : logi [1, 1:20] TRUE TRUE FALSE FALSE TRUE FALSE ...
#> ..- attr(*, "dimnames")=List of 2
#> $ posterior_best_models :List of 8
#> $ bic_models : num [1:2, 1:8] 2.52 0 1.62 28.55 1.63 ...
#> ..- attr(*, "dimnames")=List of 2
#> $ summary_table_full : drws_smm [22 × 10] (S3: draws_summary/tbl_df/tbl/data.frame)
#> ..- attr(*, "num_args")= list()
#> $ summary_table_selected: drws_smm [22 × 10] (S3: draws_summary/tbl_df/tbl/data.frame)
#> ..- attr(*, "num_args")= list()
The returned object is a list, containing posterior draws, selected
model, BIC for all submodels, and some summary tables. We first take a
look at the three important objects: best_select
,
posterior_best_models
, and
summary_table_selected
:
best_select
: a logical vector to indicate the selection
from the optimal (sub)model. Its rowname is the corresponding
threshold.posterior_best_models
: posterior draws of coefficients
(beta), intercept (alpha) and regression variance (sigma2) for the
optimal (sub)model. We also include the pooled posterior draws of
coefficients and intercept across imputations. If parameter
standardize
of BMI_LASSO()
is
TRUE
, our function will normalize each covariates and
centralize outcomes within each imputed set. The model will exclude
intercept when fitting. Then posterior_best_models
contains
the posterior draws of estimates on both normalzed scales and original
scales.summary_table_selected
: a summary table to summarize
the posterior estimates for the optimal (sub)model. If parameter
standardize
of BMI_LASSO()
is
TRUE
, we report the estimates on original scales.# selection vector of optimal (sub)model
bmilasso$best_select
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
#> 0.01 TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE
#> [,14] [,15] [,16] [,17] [,18] [,19] [,20]
#> 0.01 FALSE TRUE FALSE FALSE FALSE FALSE FALSE
# posterior draws of coefficients, intercept and regression variance for the optimal (sub)model
str(bmilasso$posterior_best_models)
#> List of 8
#> $ post_sigma2 : num [1:4000] 3.59 4.04 3.57 3.57 3.47 ...
#> $ post_beta : num [1:4000, 1:5, 1:20] 1.275 0.635 1.299 1.317 0.669 ...
#> $ post_pool_beta : num [1:20000, 1:20] 1.275 0.635 1.299 1.317 0.669 ...
#> $ post_alpha : NULL
#> $ post_pool_alpha : num(0)
#> $ post_beta_original : num [1:4000, 1:5, 1:20] 1.264 0.629 1.288 1.306 0.663 ...
#> $ post_pool_beta_original: num [1:20000, 1:20] 1.264 0.629 1.288 1.306 0.663 ...
#> $ post_alpha_original : num [1:4000, 1:5] -0.306 -0.222 -0.313 -0.248 -0.22 ...
# the summary table of post-selection inference
bmilasso$summary_table_selected
#> # A tibble: 22 × 10
#> variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 intercept_… -0.239 -0.245 0.0734 0.0817 -0.353 -0.117 1.50 1.83 13.5
#> 2 beta_pool[… 1.12 1.12 0.199 0.198 0.790 1.45 1.01 70.4 1012.
#> 3 beta_pool[… 0.710 0.709 0.204 0.204 0.380 1.05 1.01 80.5 15213.
#> 4 beta_pool[… 0 0 0 0 0 0 NA NA NA
#> 5 beta_pool[… 0 0 0 0 0 0 NA NA NA
#> 6 beta_pool[… 0.852 0.851 0.188 0.188 0.545 1.16 1.01 467. 17776.
#> 7 beta_pool[… 0 0 0 0 0 0 NA NA NA
#> 8 beta_pool[… 0 0 0 0 0 0 NA NA NA
#> 9 beta_pool[… 0 0 0 0 0 0 NA NA NA
#> 10 beta_pool[… 0 0 0 0 0 0 NA NA NA
#> # ℹ 12 more rows
If some researcher are interested in the collections of candidate
submodels and their selection vectors/BIC, the objects
select
and bic_models
contain the
information:
select
: a logical matrix. Each row is an selection
vector, whose rowname indicates its threshold. We only keep unique
selection rows.bic_models
: a matrix contains the BICs and degree of
freedoms. Each column is a submodel.# selection matrix of all (sub)models
bmilasso$select
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
#> 0 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> 0.01 TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
#> 0.32 TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
#> 0.75 TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
#> 0.76 TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
#> 0.77 TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
#> 0.78 TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> 0.79 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
#> 0 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> 0.01 FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
#> 0.32 FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE
#> 0.75 FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE
#> 0.76 FALSE TRUE TRUE FALSE TRUE FALSE TRUE FALSE
#> 0.77 FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE
#> 0.78 FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE
#> 0.79 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# BICs and degrees of freedom
bmilasso$bic_models
#> 0 0.01 0.32 0.75 0.76 0.77 0.78
#> BIC 2.518341 1.624619 1.632042 1.64313 1.653488 1.67927 1.704209
#> df 0.000000 28.547059 31.997351 33.26740 34.427437 37.29881 39.530039
#> 0.79
#> BIC 1.737539
#> df 42.521079
In some cases, if users want to test the robustness of each prior,
the Bayesian group LASSO model fitted from the first stage (before
projection predictive variable selection procedure) is needed. Users can
access this information from objects posterior
and
summary_table_full
. Additionally, the posterior draws of
some hierarchical parameters are also included in
posterior
.
# posterior draws of parameters of fitted full model in the first stage (before projection predictive variable selection procedure)
str(bmilasso$posterior)
#> List of 14
#> $ post_lambda2 : num [1:4000, 1:20] 155 499 183 147 347 ...
#> $ post_kappa : num [1:4000, 1:20] 0.522 8.521 0.432 1.145 0.68 ...
#> $ post_tau2 : num [1:4000] 0.0014 0.00164 0.00129 0.00151 0.00115 ...
#> $ post_eta : num [1:4000] 1097 3197 413 482 577 ...
#> $ post_alpha : num [1:4000, 1:5] 0 0 0 0 0 0 0 0 0 0 ...
#> $ post_sigma2 : num [1:4000] 3.42 3.92 3.47 3.35 3.35 ...
#> $ post_beta : num [1:4000, 1:5, 1:20] 1.236 0.65 1.283 1.313 0.617 ...
#> $ post_fitted_Y : num [1:4000, 1:5, 1:100] -2.953 -0.608 -3.17 -1.881 -2.873 ...
#> $ post_pool_beta : num [1:20000, 1:20] 1.236 0.65 1.283 1.313 0.617 ...
#> $ post_pool_fitted_Y : num [1:20000, 1:100] -2.953 -0.608 -3.17 -1.881 -2.873 ...
#> $ hat_matrix_proj : num [1:5, 1:100, 1:100] 0.0791 0.0854 0.0863 0.0813 0.1049 ...
#> $ post_beta_original : num [1:4000, 1:5, 1:20] 1.225 0.645 1.271 1.301 0.612 ...
#> $ post_pool_beta_original: num [1:20000, 1:20] 1.225 0.645 1.271 1.301 0.612 ...
#> $ post_alpha_original : num [1:4000, 1:5] -0.376 -0.321 -0.318 -0.247 -0.236 ...
# the summary table of coefficients, intercept, and regression variance from the fitted full model
bmilasso$summary_table_full
#> # A tibble: 22 × 10
#> variable mean median sd mad q5 q95 rhat ess_bulk
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 intercept_pool -0.269 -2.71e-1 0.0883 0.0906 -0.412 -0.123 1.32 2.46
#> 2 beta_pool[1] 1.08 1.08e+0 0.201 0.201 0.753 1.42 1.01 69.1
#> 3 beta_pool[2] 0.746 7.45e-1 0.210 0.211 0.405 1.09 1.02 55.7
#> 4 beta_pool[3] -0.00641 -1.23e-3 0.0733 0.0400 -0.133 0.105 1.00 5035.
#> 5 beta_pool[4] 0.0131 3.52e-3 0.0779 0.0418 -0.0984 0.154 1.00 14713.
#> 6 beta_pool[5] 0.884 8.83e-1 0.191 0.191 0.571 1.20 1.00 1264.
#> 7 beta_pool[6] 0.0290 9.03e-3 0.0910 0.0512 -0.0901 0.205 1.00 2762.
#> 8 beta_pool[7] -0.00292 -7.27e-4 0.0797 0.0467 -0.137 0.123 1.00 13550.
#> 9 beta_pool[8] -0.0258 -9.59e-3 0.0900 0.0529 -0.194 0.0940 1.00 7663.
#> 10 beta_pool[9] -0.00692 -1.83e-3 0.0766 0.0437 -0.138 0.114 1.00 16188.
#> # ℹ 12 more rows
#> # ℹ 1 more variable: ess_tail <dbl>
In Bayesian MI-LASSO models, "Multi_Laplace"
and
"Spike_Laplace"
priors have two hyperparameters,
respectively:
"Multi_Laplace"
: h
(shape) and
v
(scale) of Gamma hyperprior. h
> 1, and
v
> 0. Default: h
= 2 and v
=
(D+1)/(D * (h - 1))."Spike_Laplace"
: a
(shape) and
b
(scale) of Gamma hyperprior. a
> 1, and
b
> 0. Default: a
= 2 and b
=
(D+1)/(2D * (a - 1)).Users can specify the hyperparameters in
BMI_LASSO()
:
# specify hyperparameters for Multi-Laplace model
bmilasso_ML <- BMI_LASSO(data$data_MI$X, data$data_MI$Y, model = "Multi_Laplace",
nburn = 4000, npost = 4000, seed = seed, h = 10, v = 0.5)
#> Chain 1: 1000 / 8000 (burn-in)
#> Chain 1: 2000 / 8000 (burn-in)
#> Chain 1: 3000 / 8000 (burn-in)
#> Chain 1: 4000 / 8000 (burn-in)
#> Chain 1: 4001 / 8000 (sampling)
#> Chain 1: 5000 / 8000 (sampling)
#> Chain 1: 6000 / 8000 (sampling)
#> Chain 1: 7000 / 8000 (sampling)
#> Chain 1: 8000 / 8000 (sampling)
#> The maximum of rank normalized split-Rhat of pooled beta in the selected model is 1.01
#> Running time for 1 chain: 0.21 minutes
# specify hyperparameters for Spike-Laplace model. To save time, we give an example without running
# bmilasso_SL <- BMI_LASSO(data$data_MI$X, data$data_MI$Y, model = "Spike_Laplace",
# nburn = 4000, npost = 4000, seed = seed, a = 10, b = 1)
Since "Horseshoe"
and "ARD"
priors do not
require any hyperparameter tuning, we recommend users start with these
two options for simplicity and stability. For
"Multi_Laplace"
and "Spike_Laplace"
, we did a
sensitivity analysis for these hyperparameters in our paper. Based on
the results, we suggest avoiding very small values of v
and
b
, as they may lead to unstable model fitting and variable
selection. In practice, we recommend beginning with the default values
provided in the function.
If multiple cores are available, parallel computing can be used for
multiple chains. This function is based on R packages
doParallel
and foreach
. Users can specify the
parameter ncores
for the number of cores when running
multiple chains nchain
.
# run 2 chains using 2 cores in parallel
start = Sys.time()
bmilasso_parallel <- BMI_LASSO(data$data_MI$X, data$data_MI$Y, model = "Horseshoe",
nburn = 4000, npost = 4000, output_verbose = FALSE, nchains = 2, ncores = 2, seed = seed)
end = Sys.time()
print(paste("The running time for two chains in parallel:", difftime(end, start, units = "mins"), "mins"))
#> [1] "The running time for two chains in parallel: 0.228431232770284 mins"
The returned results from multiple chains are different from single
chain. The summary tables summary_table_selected
and
summary_table_full
are calculated by pooling posterior
draws from multiple chains. For all other output objects, results from
each chain are listed separately, one by one, so users can obtain and
compare the posterior estimates of each chain individually.
# the posterior draws of optimal submodels of two chains
str(bmilasso_parallel$posterior_best_models)
#> List of 2
#> $ :List of 8
#> ..$ post_sigma2 : num [1:4000] 3.59 4.04 3.57 3.57 3.47 ...
#> ..$ post_beta : num [1:4000, 1:5, 1:20] 1.275 0.635 1.299 1.317 0.669 ...
#> ..$ post_pool_beta : num [1:20000, 1:20] 1.275 0.635 1.299 1.317 0.669 ...
#> ..$ post_alpha : NULL
#> ..$ post_pool_alpha : num(0)
#> ..$ post_beta_original : num [1:4000, 1:5, 1:20] 1.264 0.629 1.288 1.306 0.663 ...
#> ..$ post_pool_beta_original: num [1:20000, 1:20] 1.264 0.629 1.288 1.306 0.663 ...
#> ..$ post_alpha_original : num [1:4000, 1:5] -0.306 -0.222 -0.313 -0.248 -0.22 ...
#> $ :List of 8
#> ..$ post_sigma2 : num [1:4000] 3.76 3.96 3.68 3.91 3.85 ...
#> ..$ post_beta : num [1:4000, 1:5, 1:20] 1.091 0.886 0.654 0.767 1.3 ...
#> ..$ post_pool_beta : num [1:20000, 1:20] 1.091 0.886 0.654 0.767 1.3 ...
#> ..$ post_alpha : NULL
#> ..$ post_pool_alpha : num(0)
#> ..$ post_beta_original : num [1:4000, 1:5, 1:20] 1.082 0.878 0.649 0.76 1.288 ...
#> ..$ post_pool_beta_original: num [1:20000, 1:20] 1.082 0.878 0.649 0.76 1.288 ...
#> ..$ post_alpha_original : num [1:4000, 1:5] -0.275 -0.277 -0.189 -0.24 -0.232 ...
# summary table by pooling two chains
bmilasso_parallel$summary_table_selected
#> # A tibble: 22 × 10
#> variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 intercept_… -0.240 -0.245 0.0735 0.0819 -0.353 -0.117 1.36 4.48 41.2
#> 2 beta_pool[… 1.12 1.12 0.199 0.198 0.792 1.45 1.00 156. 1186.
#> 3 beta_pool[… 0.710 0.709 0.202 0.202 0.380 1.04 1.01 267. 30533.
#> 4 beta_pool[… 0 0 0 0 0 0 NA NA NA
#> 5 beta_pool[… 0 0 0 0 0 0 NA NA NA
#> 6 beta_pool[… 0.851 0.850 0.186 0.187 0.545 1.16 1.00 1413. 36590.
#> 7 beta_pool[… 0 0 0 0 0 0 NA NA NA
#> 8 beta_pool[… 0 0 0 0 0 0 NA NA NA
#> 9 beta_pool[… 0 0 0 0 0 0 NA NA NA
#> 10 beta_pool[… 0 0 0 0 0 0 NA NA NA
#> # ℹ 12 more rows
Although Bayesian MI-LASSO models have better performance than
MI-LASSO, some users may be instereted in frequentists method. We also
include the MI-LASSO method in the function MI_LASSO()
,
originally from https://www.columbia.edu/~qc2138/Downloads/software/MI-lasso.R.
We further add a parameter ncores
to allow parallel
computing. We give an brief example and more details can be found in
Chen, Q. and Wang, S. (2013). “Variable selection for multiply-imputed
data with application to dioxin exposure study.” Statistics in Medicine,
32(21): 3646–3659.
# fit MI-LASSO
milasso = MI_LASSO(data$data_MI$X, data$data_MI$Y, ncores = 2)
#> Running time for 2 ncore: 0.04 minutes
str(milasso)
#> List of 2
#> $ best :List of 4
#> ..$ coefficients: num [1:21, 1:5] -0.309 0.9 0.605 0 0 ...
#> ..$ bic : num 1.64
#> ..$ varsel : logi [1:20] TRUE TRUE FALSE FALSE TRUE FALSE ...
#> ..$ lambda : num 4.29
#> $ lambda_path: num [1:101, 1:2] 0.125 0.134 0.144 0.154 0.165 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : NULL
#> .. ..$ : chr [1:2] "lamvec" ""
Here are some discussions about this package:
If you find any bug in this package, please contact me: jungang.zou@gmail.com.