In this tutorial I want to show how you can use alluvial plots to
visualise model response in up to 4 dimensions. `easyalluvial`

generates
artificial data space using fixed values for unplotted variables or uses
the partial dependence plotting method. It is model agnostic but offers
some convenient wrappers for `caret`

models.

# Introduction

## Taking a peek

When building machine learning model we are usually faced with a trade-off between accurracy and interpretability. However even if we tend to lean towards accuracy and pick a modelling method that results in nearly uninterpretable models we can still make use of a bunch of model agnostic techniques that have been summarized in this excellent ebook Interpretable Machine Learning by Christoph Molnar.

Whithout getting to theoretical I personally always feel the urge to
simply take a peek simulate some data and see how the model reacts a
method described in Wickham H, Cook D, Hofmann H (2015) Visualizing
statistical models: Removing the blindfold. Statistical Analysis and
Data Mining 8(4)
<doi:10.1002/sam.11271>.
In order to simulate data we can generate a a vector with a sequence of
values over the entire range of a predictor variable of interest while
setting all the others to their median or mode and use this artificial
data space to obtain model predictions which we can plot against our
variable of interest. An R package that will do this for you is
(`plotmo`

)[https://cran.r-project.org/web/packages/plotmo/index.html].
Instead of ranging over 1 predictor variable we can create a data grid
using 2 predictor variables and plot the response as a third dimension.
However this is as far as you can go in a conventional plots. Alluvial
plots can line up much more than 3 dimensions on a plane next to each
other only limited by the number of flows as it will get too cluttered
when there are too many of them.

## Which variables to plot

When using conventional model response plotting beeing limited two 2 variables we can simply resolve this by generating many plots and look at them one by one. Alluvial plots require a bit more attention and cannot easily be screened and compared since visually there is so much going on. Therefore I do not recommend to brute force it by simply creating a lot of random combinations of predictor variables and multiple alluvial plots but instead to focus on those that have the highest calculated feature importance. Feature importance values are natively provided by most modelling packages. So the question is how many can we plot and it turns out 4 features will usually result in well interpretable plot.

## Generating the data space

```
suppressPackageStartupMessages( require(tidyverse) )
suppressPackageStartupMessages( require(easyalluvial) )
suppressPackageStartupMessages( require(mlbench) )
suppressPackageStartupMessages( require(randomForest) )
```

We start by creating a model

```
data('BostonHousing')
df = as_tibble( BostonHousing )
m = randomForest( lstat ~ ., df )
```

and looking at the importance features

```
imp = m$importance %>%
tidy_imp(df)
knitr::kable(imp)
```

vars | imp |
---|---|

medv | 8352.4165 |

rm | 3370.7073 |

indus | 2765.8359 |

age | 2716.4687 |

crim | 2134.6400 |

dis | 1819.3780 |

nox | 1696.9545 |

tax | 724.1518 |

b | 682.6062 |

ptratio | 340.1351 |

rad | 222.3688 |

chas | 131.2264 |

zn | 101.5024 |

When generating the data space we cannot screent and infinite amount of
values per variable. We want to create all possible combinations between
the values of the 4 variables we want to plot and an alluvial plot we
cannot distinguish more than 1000 flows I recommend to go with 5 values
which will result in 5 x 5 X 5 X 5 –> 625 combinations. That also
leaves some wiggeling room if one of the top 4 variables is a factor
with more than 5 levels. `get_data_space()`

will split the range of a
variable into 3 and picks the median of each split and add the variable
minimum and the maximum to the set.

```
dspace = get_data_space(df, imp
, degree = 4 # specifies the number of variables
, bins = 5 # the number of values per variable
)
knitr::kable( head(dspace, 10) )
```

medv | rm | indus | age | crim | dis | nox | tax | b | ptratio | rad | chas | zn |
---|---|---|---|---|---|---|---|---|---|---|---|---|

5 | 3.561 | 0.46 | 2.90 | 0.25651 | 3.20745 | 0.538 | 330 | 391.44 | 19.05 | 5 | 0 | 0 |

5 | 3.561 | 0.46 | 32.20 | 0.25651 | 3.20745 | 0.538 | 330 | 391.44 | 19.05 | 5 | 0 | 0 |

5 | 3.561 | 0.46 | 67.95 | 0.25651 | 3.20745 | 0.538 | 330 | 391.44 | 19.05 | 5 | 0 | 0 |

5 | 3.561 | 0.46 | 94.60 | 0.25651 | 3.20745 | 0.538 | 330 | 391.44 | 19.05 | 5 | 0 | 0 |

5 | 3.561 | 0.46 | 100.00 | 0.25651 | 3.20745 | 0.538 | 330 | 391.44 | 19.05 | 5 | 0 | 0 |

5 | 3.561 | 4.27 | 2.90 | 0.25651 | 3.20745 | 0.538 | 330 | 391.44 | 19.05 | 5 | 0 | 0 |

5 | 3.561 | 4.27 | 32.20 | 0.25651 | 3.20745 | 0.538 | 330 | 391.44 | 19.05 | 5 | 0 | 0 |

5 | 3.561 | 4.27 | 67.95 | 0.25651 | 3.20745 | 0.538 | 330 | 391.44 | 19.05 | 5 | 0 | 0 |

5 | 3.561 | 4.27 | 94.60 | 0.25651 | 3.20745 | 0.538 | 330 | 391.44 | 19.05 | 5 | 0 | 0 |

5 | 3.561 | 4.27 | 100.00 | 0.25651 | 3.20745 | 0.538 | 330 | 391.44 | 19.05 | 5 | 0 | 0 |

Total rows in dspace: 625

```
dspace %>%
summarise_all( ~ length( unique(.) ) ) %>%
knitr::kable()
```

medv | rm | indus | age | crim | dis | nox | tax | b | ptratio | rad | chas | zn |
---|---|---|---|---|---|---|---|---|---|---|---|---|

5 | 5 | 5 | 5 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

## Generating model response

```
pred = predict(m, newdata = dspace)
```

## Plotting

```
alluvial_model_response(pred, dspace, imp, degree = 4, bins = 5)
```

`easyalluvial`

allows you to build exploratory alluvial plots (sankey
diagrams) with a single line of code while automatically binning
numerical variables. In version `0.2.0`

marginal histograms improve the
visibility of the numerical variables. Further a method has been added
that creates model agnostic 4 dimensional partial dependence plots to
visualise model response.

# Introduction

I am happy to announce the release of `easyalluvial 0.2.0`

with some
exciting new features and some mayor changes compared to version `0.1.8`

Some improvements were made on the default plotting options which alter
the default plot output thus classify as mayor changes (see below).

```
suppressPackageStartupMessages( require(tidyverse) )
suppressPackageStartupMessages( require(easyalluvial) )
```

# Marginal Histograms

The automated binning of numerical variables hides their distribution. This new feature alleviates that.

```
p = alluvial_wide(mtcars2, max_variables = 9)
p_grid = add_marginal_histograms(p, mtcars2)
```

{{< wide-image src=“/images/marg_hits.png” title=”]” >}}

# 4 Dimensional Partial Dependence Plots

Techniques for explaining the predictions of machine learning models
have been discussed a lot lately. For an introduction into the topic I
can recommend this excellent ebook Interpretable Machine Learning by
Christoph Molnar.
Inspired by packages like `plotmo`

and `pdp`

which can make partial
dependence plots (PDP) of 2 feature variables against the response
variable I tested whether alluvial plots, which can basically line up an
unlimited amount of features on a 2 dimensional plane, can be used to
make PDPs with more than 2 feature variables. And basically it turns out
that you can use 4 features before things get to cluttered. And this is
how it looks like:

```
df = select(mtcars2, -ids)
train = caret::train( disp ~ .
, df
, method = 'rf'
, trControl = caret::trainControl( method = 'none' )
, importance = TRUE )
p = alluvial_model_response_caret(train, degree = 4, method = 'pdp')
p_grid = add_marginal_histograms(p, df, plot = F) %>%
add_imp_plot(p, df)
```

We see that the top 4 important features of the model have been selected
and that 5 values have been picked over the range of the numerical
variables which together with the levels of the categorical variable
have been used to construct a data grid of all possible combinations.
This data grid has been combined with the values of the remaining
features of each observation in the training data which then has been
used to generate model predictions which were consecutively averaged
(see the ebook for a better explanation of
PDPs).

{{< wide-image src=“/images/pdp.png” title=”]” >}}

#### Step-by-Step

- On the left we see the averaged model predictions that are generated by a specific combination of the 4 most important variables. You can find the combinations by tracing the coloured flows.
- The stratum label of the individual feature variables indicate the variable value and which fraction of the colored flows pass through it.
- On the right you see the feature importance of all variables and the proportion contributed by the plotted variables on the alluvial plot.
- On the top left you see how the distribution of the generated predictions compare to the distribution of the predicted variable (in this case disp) in the training data.
- The marginal histograms indicate the original distributions in the raining data and the lines indicate the location of the values picked for the data grid.

A more in-depth tutorial for this feature can be found on the project’s github page which will also be vailable on this blog in a few days.

If you are as enthusiastic about alluvial plots as me you will appreciate this plot because it can help you to get an intuitive mid-level understanding of how you’re model is making the predictions. Just be aware of a few limitations.

#### Limitations

- There is a loss of information when binning the numerical variables
- The combinations generated when making the grid might be outside the feature distribution space (generate combinations that are impossible)
- We only look at the combination of 4 features and disregard the others

To alleviate this you can reduce the complexity of the model by reducing features (take out correlating variables) or use additional model exploration methods such as classical PDPs, ALE plots, Shapely values, etc, …

#### We do not have to use `caret`

*Note: importance is calculated differently when using this
implementation of random forest.*

```
df = select(mtcars2, -ids)
m = randomForest::randomForest( disp ~ ., df)
imp = m$importance
dspace = get_data_space(df, imp, degree = 4)
pred = get_pdp_predictions(df, imp
, .f_predict = randomForest:::predict.randomForest
, m
, degree = 4
, bins = 5)
p = alluvial_model_response(pred, dspace, imp, degree = 4, method = 'pdp')
p_grid = add_marginal_histograms(p, df, plot = F) %>%
add_imp_plot(p, df)
```

# Changes in Default Plotting Settings

Default colors have been changed. The first 7 colors of

`palette_qualitative()`

the function that provides the default colors have hand-picked for better contrast.The stratum fill color of the variable determining the flow fill color in

`alluvial_wide()`

hast been set to match the flow fill color.label text size can now be modified via

`stratum_label_size`

parameter. Labels have gotten slightly bigger by default.

# More changes

… NEWS.md