--- title: "`mice`: The imputation and nonresponse models" author: "Gerko Vink and Stef van Buuren" date: "Multiple Imputation in Practice" output: html_document: toc: true toc_depth: 5 toc_float: true number_sections: false --- --- This is the third exercise in the series. In this exercise we will focus on analyzing the relation between the data and the missingness. For non-`R` users: In `R` one can simply call the help function for a any specific function `func` by typing `help(func)`. E.g. `help(mice)` directs you to the help page of the `mice` function. All the best, [Gerko](https://www.gerkovink.com) and [Stef](http://www.stefvanbuuren.name) --- **1. Load `R` and load the packages `mice` and `dplyr`. Set the seed to `123`.** ```{r warning=FALSE, message=FALSE} library(mice) # Data imputation library(dplyr) # Data manipulation library(lattice) # Plotting device set.seed(123) ``` We choose seed value `123`. This is an arbitrary value; any value would be an equally good seed value. Fixing the random seed enables you (and others) to exactly replicate anything that involves random number generators. If you set the seed in your `R` instance to `123`, you will get the exact same results and plots as we present in this document if you follow the order and code of the exercises precisely. --- # The `boys` data set --- **2. The `boys` dataset is part of `mice`. It is a subset of a large Dutch dataset containing growth measures from the Fourth Dutch Growth Study. Inspect the help for `boys` dataset and make yourself familiar with its contents.** To learn more about the contents of the data, use one of the two following help commands: ```{r} help(boys) ?boys ``` --- **3. Get an overview of the data. Find information about the size of the data, the variables measured and the amount of missingness.** The first 10 cases are: ```{r} head(boys, n = 10) ``` The last 10 cases are: ```{r} tail(boys, n = 10) ``` We now have a clear indication that the data are sorted. A simple evaluation ```{r} !is.unsorted(boys$age) ``` confirms this - `!is.unsorted()` evaluates the complement of `is.unsorted()`, so it tests whether the data are sorted. There is no `is.sorted` function in `R`. The dimensions of the `boys` data set are: ```{r} dim(boys) ``` We see that the `boys` data set has 748 cases over 9 variables. From those 9 variables ```{r} summary(boys) ``` function `summary()` informs us that testicular volume `tv` has the most missings, followed by the genital and pubic hair stages `gen` and `phb`, each with 503 missing cells. --- ## Missing data pattern --- **4. As we have seen before, the function `md.pattern()` can be used to display all different missing data patterns. How many different missing data patterns are present in the boys dataframe and which pattern occurs most frequently in the data?** ```{r} md.pattern(boys) ``` There are 13 patterns in total, with the pattern where `gen`, `phb` and `tv` are missing occuring the most. --- **5. How many patterns occur for which the variable `gen` (genital Tannerstage) is missing?** ```{r} mpat <- md.pattern(boys, plot = FALSE) sum(mpat[, "gen"] == 0) ``` Answer: 8 patterns (503 cases) --- ## Missingness relations --- **6. Let us focus more precisely on the missing data patterns. Does the missing data of `gen` depend on `age`? One could for example check this by making a histogram of `age` separately for the cases with known genital stages and for cases with missing genital stages.** To create said histogram in `R`, a missingness indicator for `gen` has to be created. A missingness indicator is a dummy variable with value `1` for observed values (in this case genital status) and `0` for missing values. Create a missingness indicator for `gen` by typing ```{r} R <- is.na(boys$gen) head(R, n = 100) tail(R, n = 100) length(R) ``` As we can see, the missingness indicator tells us for each of the 748 values in `gen` whether it is missing (`TRUE`) or observed (`FALSE`). A histogram can be made with the function `histogram()`. ```{r} histogram(boys$gen) ``` or, equivalently, one could use ```{r} histogram(~ gen, data = boys) ``` Writing the latter line of code for plots is more efficient than selecting every part of the `boys` data with the `boys$...` command, especially if plots become more advanced. The code for a conditional histogram of `age` given `R` is ```{r} histogram(~ age | R, data=boys) ``` The histogram shows that the missingness in `gen` is not equally distributed across `age`; or, equivalently, `age` seems to be differently distributed for observed and missing `gen`. --- ## Impute the set --- **7. Impute the `boys` dataset with mice using all default settings and name the `mids` (multiply imputed data set) object `imp`.** ```{r} imp <- mice(boys, print=FALSE) ``` --- ## Means comparison --- **8. Compare the means of the imputed data with the means of the incomplete data. One can use the function `complete()` with a `mids`-object as argument to obtain an imputed dataset. As default, the first imputed dataset will be given by this function.** ```{r} summary(boys) summary(complete(imp)) ``` Most means are roughly equal, except the mean of `tv`, which is much lower in the first imputed data set, when compared to the incomplete data. This makes sense because most genital measures are unobserved for the lower ages. When imputing these values, the means should decrease. Investigating univariate properties by using functions such as `summary()`, may not be ideal in the case of hundreds of variables. To extract just the information you need, for all imputed datasets, we can make use of the `with()` function. To obtain summaries for each imputed `tv` only, type ```{r} imp %>% with(summary(tv)) %>% summary() ``` And to obtain e.g. the means alone, run ```{r} imp %>% with(mean(tv)) %>% summary() ``` --- # The importance of the imputation model The `mammalsleep` dataset is part of `mice`. It contains the Allison and Cicchetti (1976) data for mammalian species. To learn more about this data, type ```{r} help(mammalsleep) ``` --- **9. Get an overview of the data.** Find information about the size of the data, the variables measured and the amount of missingness. ```{r} head(mammalsleep) summary(mammalsleep) str(mammalsleep) ``` As we have seen before, the function `md.pattern()` can be used to display all different missing data patterns. How many different missing data patterns are present in the `mammalsleep` dataframe and which pattern occurs most frequently in the data? ```{r} md.pattern(mammalsleep) ``` Answer: 8 patterns in total, with the pattern where everything is observed occuring the most (42 times). --- **10. Generate five imputed datasets with the default method `pmm`. Give the algorithm 10 iterations. ** ```{r} imp1 <- mice(mammalsleep, maxit = 10, print=F) ``` We ignore the `loggedEvents` for now: we'll consider that in a later exercise. To inspect the trace lines for assessing algorithmic convergence: ```{r} plot(imp1) ``` --- **11. Perform a regression analysis on the imputed dataset with `sws` as dependent variable and `log10(bw)` and `odi` as independent variables.** ```{r} fit1 <- with(imp1, lm(sws ~ log10(bw) + odi)) ``` --- **12. Pool the regression analysis and inspect the pooled analysis.** ```{r} est1 <- pool(fit1) est1 summary(est1) ``` The `fmi` and `lambda` are much too high. This is due to `species` being included in the imputation model. Because there are 62 species and mice automatically converts factors (categorical variables) to dummy variables, each species is modeled by its own imputation model. --- **13. Impute `mammalsleep` again, but now exclude `species` from the data.** ```{r, cache=FALSE} imp2 <- mice(mammalsleep[ , -1], maxit = 10, print = F) ``` --- **14. Compute and pool the regression analysis again. ** ```{r} fit2 <- with(imp2, lm(sws ~ log10(bw) + odi)) est2 <- pool(fit2) est2 summary(est2) ``` Note that the `fmi` and `lambda` have dramatically decreased. The imputation model has been greatly improved. --- **15. Plot the trace lines for the new imputations** ```{r} plot(imp2) ``` Even though the fraction of information missing due to nonresponse (fmi) and the relative increase in variance due to nonresponse (lambda) are nice and low, the convergence turns out to be a real problem. The reason is the structure in the data. Total sleep (`ts`) is the sum of paradoxical sleep (`ps`) and short wave sleep (`sws`). This relation is ignored in the imputations, but it is necessary to take this relation into account. `mice` offers a routine called *passive imputation*, which allows users to take transformations, combinations and recoded variables into account when imputing their data. We explain passive imputation in detail in the next exercise. --- # Conclusion We have seen that the practical execution of multiple imputation and pooling is straightforward with the `R` package `mice`. The package is designed to allow you to assess and control the imputations themselves, the convergence of the algorithm and the distributions and multivariate relations of the observed and imputed data. It is important to 'gain' this control as a user. After all, we are imputing values and we aim to properly adress the uncertainty about the missingness problem. --- **- End of exercise** ---