---
title: "Dealing with missing data in fitting prediction rule ensembles"
author: "Marjolein Fokkema"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Dealing with missing data in fitting prediction rule ensembles}
  %\VignetteEngine{knitr::rmarkdown}
  %\usepackage[utf8]{inputenc}
bibliography: bib.bib
csl: bib_style.csl
---

## Introduction

To deal with missing data in parametric models, multiple imputation is the golden standard ([@schafer2002missing]). With GLMs, the models fitted on each imputed dataset can then be pooled. For non-parametric methods and specifically prediction rule ensembles, the jury is still out on how to best deal with missing data. There are several possible approaches:

* Listwise deletion. Although the default in `pre()`, it is certainly the least favorable option.

* Single imputation: Perform only a single imputation and fit a prediction rule ensemble on this single dataset. This is likely better than listwise deletion, but likely inferior to multiple imputation. The main advantage is that a PRE can be fit as usual.

* Multiple imputation followed by pooling. This approach takes multiple imputed datasets and fits a separate prediction rule ensemble to each of the imputed datasets, which are aggregated to form a final ensemble. The main disadvantage is that this yields an ensemble of ensembles, which will be (much) more complex and thus less interpretable than a single PRE. 

* Multiple imputation followed by stacking. This approach stacks the multiple imputed datasets and fits a single prediction rule ensemble to this dataset. Rule generation and estimation of the final model can be adjusted to counter the artificial inflation of sample size. The main advantage is that it yields a single PRE, while treating missing data in an optimal manner.   

An alternative, would be the Missing-In-Attributes approach for rule generation, followed by mean imputation for the final model. According to [@josse2019consistency], mean imputation and the Missing-In-Attributes approaches perform well from a prediction perspective and are computationally inexpensive. This will be implemented in future versions of package **`pre`**.

Below, we provide examples for the four approaches described above. 


## Example: Predicting wind speed

For the examples, we will be predicting Wind speeds using the `airquality` dataset (we focus on predicting the `wind` variable, because it does not have missing values, while variables `Ozone` and `Solar.R` do):

```{r, results='hide', message=FALSE, warning=FALSE, fig.width=5}
head(airquality)
nrow(airquality)
library("mice")
md.pattern(airquality, rotate.names = TRUE)
```

## Listwise Deletion

This option, although not recommended, is the default of function `pre()`:

```{r}
library("pre")
set.seed(43)
airq.ens <- pre(Wind ~., data = airquality)
airq.ens
```
With listwise deletion, only `r sum(complete.cases(airquality))` out of `r nrow(airquality)` observations are retained. We obtain a rather sparse ensemble.


## Single Imputation

Here we apply single imputation by replacing missing values with the mean: 

```{r}
imp0 <- airquality
imp0$Solar.R[is.na(imp0$Solar.R)] <- mean(imp0$Solar.R, na.rm=TRUE)
imp0$Ozone[is.na(imp0$Ozone)] <- mean(imp0$Ozone, na.rm=TRUE)
set.seed(43)
airq.ens.imp0 <- pre(Wind ~., data = imp0)
airq.ens.imp0
```

We obtain a larger number of rules, and slightly lower cross-validated mean squared error. However, this model cannot really be compared with the listwise deletion model, because they are computed over different sets of observations.


## Multiple Imputation

We first perform multiple imputation by chained equations, using the predictive mean matching method. We generate 5 imputed datasets:

```{r, message=FALSE, warning=FALSE, fig.width=7}
set.seed(42)
imp <- mice(airquality, m = 5, printFlag = FALSE)
```

We create a `list` with imputed datasets:

```{r}
imp1 <- complete(imp, action = "all", include = FALSE)
```

To deal with imputed data, we have two options: *Stacking* the imputed datasets or *pooling* the resulting ensembles. The former is likely most beneficial for retaining interpretability and therefore an experimental (!) version has been implemented in package **`pre`**.


### Stacking imputed datasets

Function `mi_pre` implements a so-called *stacking* approach (see also [@wood2008should]), where imputed datasets are combined into one large dataset. In addition to adjustments of the sampling procedures, adjustments to observation weight are made to counter the artificial inflation of sample size. Function `mi_pre` takes a list of imputed datasets as input data, because it is assumed imputation has already been performed.

Although the option to use the fraction of complete data for computing observation weight is provided through argument `compl_frac` of function `mi_pre`, users are not advised to use it. For example, [@du2022variable] write: "An alternative weight specification, proposed in Wan et al. (2015), is $o_i = \frac{f_i}{D}$, where $f_i$ is the number of observed predictors out of the total number of predictors for subject $i$ [...] upweighting subjects with less missingness and downweighting subjects with more missingness can, in some sense, be viewed as making the optimization more like complete-case analysis, which might be problematic for Missing at Random (MAR) and Missing not at Random (MNAR) scenarios."

We fit a rule ensemble to the imputed data using stacking:

```{r}
set.seed(42)
airq.ens.sta <- mi_pre(Wind ~ . , data = imp1)
```

All `S3` methods for objects of class `pre` are also applicable to the ensembles resulting from application of function `mi_pre`, e.g.:

```{r, fig.width=4, fig.height=7}
summary(airq.ens.sta)
coefs <- coef(airq.ens.sta)
coefs[coefs$coefficient != 0, ]
```

Computation of partial dependence plots can become computationally prohibitive, especially with multiply-imputed data. To reduce computational load, function `mi_mean` computes the average imputed dataset, which can then be supplied to the `newdata` arguments of functions `singleplot` or `pairplot` to speed up computation of partial dependence plots:

```{r}
newdata <- mi_mean(imp1)
singleplot(airq.ens.sta, newdata = newdata, varname = "Ozone")
pairplot(airq.ens.sta, newdata = newdata, varnames = c("Ozone", "Solar.R"))
```


### Pooling

To apply pooling, we create a custom function that fits PREs to several datasets contained in a list:

```{r}
pre.agg <- function(datasets, ...) {
  result <- list()
  for (i in 1:length(datasets)) {
    result[[i]] <- pre(datasets[[i]], ...)
  }
  result
}
```

We apply the new function:

```{r}
set.seed(43)
airq.agg <- pre.agg(imp1, formula = Wind ~ .)
```

Note that we can use the ellipsis (`...`) to pass arguments to `pre` (see `?pre` for an overview of arguments that can be specified).

We now define `print`, `summary`, `predict` and `coef` methods to extract results from the fitted ensemble. Again, we can use the ellipsis (`...`) to pass arguments to the `print`, `summary`, `predict` and `coef` methods of function `pre` (see e.g., `?pre:::print.pre` for more info):

```{r, results ='hide'}
print.agg <- function(object, ...) {
  result <- list()
  sink("NULL")
  for (i in 1:length(object)) {
    result[[i]] <- print(object[[i]], ...)
  }
  sink()
  print(result)
}
print.agg(airq.agg) ## results suppressed for space considerations

summary.agg <- function(object, ...) {
  for (i in 1:length(object)) summary(object[[i]], ...)
}
summary.agg(airq.agg) ## results suppressed for space considerations
```

For averaging over predictions, there is only one option for continuous outcomes. For non-continuous outcomes, we can average over the linear predictor, or over the predicted values on the scale of the response. I am not sure which would be more appropriate; the resulting predicted values will not be identical, but highly correlated, though.

```{r}
predict.agg <- function(object, newdata, ...) {
  rowMeans(sapply(object, predict, newdata = newdata, ...))
}
agg_preds <- predict.agg(airq.agg, newdata = airquality[1:4, ])
agg_preds
```

Finally, the `coef` method should return the averaged / aggregated final PRE. That is, it returns: 

1) One averaged intercept; 

2) All rules and linear terms, with their coefficients scaled by the number of datasets; 

3) In the presence of identical rules and linear terms, it aggregates those rules and their coefficients into one rule / term, and adds together the scaled coefficients. 

Note that linear terms of the same predictors, which obtained different winsorizing points across imputed datasets will be retained seperately and will not be aggregated. Note also that the labels of rules and variables may overlap between different datasets (e.g., the label `rule 12` may appear multiple times in the aggregated ensemble, but each `rule 12` will have different conditions).

```{r}
coef.agg <- function(object, ...) {
  coefs <- coef(object[[1]], ...)
  coefs <- coefs[coefs$coefficient != 0, ]
  for (i in 2:length(object)) {
    coefs_tmp <- coef(object[[i]], ...)
    coefs <- rbind(coefs, coefs_tmp[coefs_tmp$coefficient != 0, ])
  }
  ## Divide coefficients by the number of datasets:
  coefs$coefficient <- coefs$coefficient / length(object)
  ## Identify identical rules:
  duplicates <- which(duplicated(coefs$description))
  for (i in duplicates) {
    first_match <- which(coefs$description == coefs$description[i])[1]
    ## Add the coefficients:
    coefs$coefficient[first_match] <- 
      coefs$coefficient[first_match] + coefs$coefficient[i]
  }
  ## Remove duplicates:
  if (length(duplicates) > 0) coefs <- coefs[-duplicates, ]
  ## Check if there are- duplicate linear terms left and repeat:
  duplicates <- which(duplicated(coefs$rule))
  for (i in duplicates) {
    first_match <- which(coefs$rule == coefs$rule[i])[1]
    coefs$coefficient[first_match] <- 
      coefs$coefficient[first_match] + coefs$coefficient[i]
  }
  if (length(duplicates) > 0) coefs <- coefs[-duplicates, ]
  ## Return results:
  coefs
}
coef.agg(airq.agg)
```

We have obtained a final ensemble of `r nrow(coef.agg(airq.agg))-1` terms.


## Comparing accuracy and sparsity

We compare performance using 10-fold cross validation. We evaluate predictive accuracy and the number of selected rules. We only evaluate accuracy for observations that have no missing values.

```{r, eval=FALSE, results='hide', message = FALSE, warning = FALSE}
k <- 10
set.seed(43)
fold_ids <- sample(1:k, size = nrow(airquality), replace = TRUE)

observed <- c()
for (i in 1:k) {
  ## Separate training and test data
  test <- airquality[fold_ids == i, ]
  test <- test[!is.na(test$Ozone), ]
  test <- test[!is.na(test$Solar.R), ]
  observed <- c(observed, test$Wind)
}  

preds <- data.frame(observed)
preds$LWD <- preds$SI <- preds$MI <- preds$observed
nterms <- matrix(nrow = k, ncol = 3)
colnames(nterms) <- c("LWD", "SI", "MI")
row <- 1

for (i in 1:k) {

  if (i > 1) row <- row + nrow(test)
  
  ## Separate training and test data
  train <- airquality[fold_ids != i, ]
  test <- airquality[fold_ids == i, ]
  test <- test[!is.na(test$Ozone), ]
  test <- test[!is.na(test$Solar.R), ]

  ## Fit and evaluate listwise deletion
  premod <- pre(Wind ~ ., data = train)
  preds$LWD[row:(row+nrow(test)-1)] <- predict(premod, newdata = test)
  tmp <- print(premod)
  nterms[i, "LWD"] <- nrow(tmp) - 1
  
  ## Fit and evaluate single imputation
  imp0 <- train
  imp0$Solar.R[is.na(imp0$Solar.R)] <- mean(imp0$Solar.R, na.rm=TRUE)
  imp0$Ozone[is.na(imp0$Ozone)] <- mean(imp0$Ozone, na.rm=TRUE)
  premod.imp0 <- pre(Wind ~., data = imp0)
  imp1 <- test
  imp1$Solar.R[is.na(imp1$Solar.R)] <- mean(imp0$Solar.R, na.rm=TRUE)
  imp1$Ozone[is.na(imp1$Ozone)] <- mean(imp0$Ozone, na.rm=TRUE)
  preds$SI[row:(row+nrow(test)-1)] <- predict(premod.imp0, newdata = imp1)  
  tmp <- print(premod.imp0)
  nterms[i, "SI"] <- nrow(tmp) - 1
  
  ## Perform multiple imputation
  imp <- mice(train, m = 5)
  imp1 <- complete(imp, action = "all", include = FALSE)
  airq.agg <- pre.agg(imp1, formula = Wind ~ .)
  preds$MI[row:(row+nrow(test)-1)] <- predict.agg(airq.agg, newdata = test)
  nterms[i, "MI"] <- nrow(coef.agg(airq.agg)) - 1

}
```


```{r, echo=FALSE, eval=FALSE}
save(preds, nterms, file = "Missing_data_results.Rda")
```

```{r, echo=FALSE}
load("Missing_data_results.Rda")
```

```{r, fig.width=5, fig.height=5}
sapply(preds, function(x) mean((preds$observed - x)^2)) ## MSE
sapply(preds, function(x) sd((preds$observed - x)^2)/sqrt(nrow(preds))) ## SE of MSE
var(preds$observed) ## benchmark: Predict mean for all
```

Interestingly, we see that all three methods yield similar predictions and accuracy, and explain about 20% of variance in the response. Multiple imputation performed best, followed by listwise deletion, followed by single imputation. Taking into account the standard errors, however, these differences are not significant. Also, this simple evaluation on only a single dataset should not be taken too seriously. The better performance of multiple imputation does come at the cost of increased complexity: 

```{r, fig.width=4, fig.height=4}
boxplot(nterms, main = "Number of selected terms \nper missing-data method",
        cex.main = .8)
```

In line with findings of [@josse2019consistency], we expect MIA to work better for rules than mean imputation. In future versions of the package **`pre`**, we plan to implement MIA (for the rules) and combine it with mean imputation (for the linear terms).

## Session info

In case you obtained different results, these results were obtained using the following:

```{r}
sessionInfo()
```

## References

```{r, echo = FALSE}
# Austin et al. (2019) refer to Wood et al. (2008) for stacking with multiply imputed data:
# 
# "Stacked Imputed Datasets With Weighted Regressions (W1, W2, and W3)
# 
# This approach entails stacking the M imputed datasets into 1 large dataset and then conducting variable selection in this single stacked dataset. To account for the multiple observations for each subject, weights are incorporated into the regression model when conducting variable selection. Wood et al proposed 3 different sets of weights that could be used: W1: w=1/M, in which each subject is weighted using the reciprocal of the number of imputed datasets; W2: w=(1−f)/M, where f denotes the proportion of missing data across all variables; W3: wj=(1−fj)/M, where fj denotes the proportion of missing data for variable Xj. Using the third approach, a different set of weights is used when assessing the statistical significance of a given candidate predictor variable."
# 
# Austin, P. C., Lee, D. S., Ko, D. T., & White, I. R. (2019). Effect of variable selection strategy on the performance of prognostic models when using multiple imputation. Circulation: Cardiovascular Quality and Outcomes, 12(11), e005927.
# 
# Wood, A. M., White, I. R., & Royston, P. (2008). How should variable selection be performed with multiply imputed data?. Statistics in medicine, 27(17), 3227-3246.

## Does glmnet allow for such an approach? Seems not, as weights are scaled automatically, see also https://stats.stackexchange.com/a/196615/173546
```