In this tutorial I want to show how you can use alluvial plots to visualise model response in up to 4 dimensions. easyalluvial generates an artificial data space using fixed values for unplotted variables or uses the partial dependence plotting method. It is model agnostic but offers some convenient wrappers for caret models.
Introduction Taking a peek When building machine learning models we are usually faced with a trade-off between accurracy and interpretability.
easyalluvial allows you to build exploratory alluvial plots (sankey diagrams) with a single line of code while automatically binning numerical variables. In version 0.2.0 marginal histograms improve the visibility of those numerical variables. Further a method has been added that creates model agnostic 4 dimensional partial dependence alluvial plots to visualise the response of statistical models.
Introduction I am happy to announce the release of easyalluvial 0.2.0 with some exciting new features and some minor changes compared to version 0.
Introduction Packages CRAN availability of tidymodels packages: Unified Modelling Syntax Statistical Tests and Model Selection Resampling, Feature Engineering and Performance Metrics Modeling Data Response Variable lstat Correlations lstat vs categorical variables Preprocessing with recipe Summary Recipe Resampling with rsample Modelling with caret Wrapper Apply Wrapper Assess Performance with yardstick Parameters as string Get best performing model for each method Get cv-performance Get 1SE stats Plot
Introduction easyalluvial Features Install Wide Format Sample data alluvial_wide() Long Format Sample Data alluvial_long() General Missing Data Colors Connect Flows to observations in original data ggplot2 manipulations
Introduction Alluvial plots are a form of sankey diagrams that are a great tool for exploring categorical data. They group categorical data into flows that can easily be traced in the diagram.
There has been a lot of discussion about jupyter notebooks in the online channels I consume and the point of this post is to bring them together.
Coming from R and beeing a heavy user of Rmarkdown files jupyter notebooks felt familiar right away but also a bit awkward. Datacamp made the effort of comparing the to feature by feature in a blog post at the end of 2016. It is a bit out-dated but skimming through it most of it still holds to be true.
1 of 3: conda introduction 2 of 3: conda command line 3 of 3: conda jupyter Here we want to show how we can use R and python in the same jupyter notebook.
jupyter notebooks We first need to create a conda and install R and python and jupyter, then we need to activate that environment and run the jupyter notebook command. When creating a new notebook you will automatically use the active conda environment as a kernel.
1 of 3: conda introduction 2 of 3: conda command line 3 of 3: conda jupyter Managing conda environments conda navigator GUI after installing the anaconda distribution you can run the navigator app, which allows you to create environments and manage the installed packages. As always with these tools some commands will only work in the command line.
Command Line condensated version of official documentation
conda cheat sheet
1 of 3: conda introduction 2 of 3: conda command line 3 of 3: conda jupyter In this series of posts we want to show how we can use conda environments for polyglot data science projects that use both R and python.
Polyglot Environments Polyglot environments are basically environments that use more than one programming language. For data science R and python are the most popular programming languages and most projects decide on either using one or the other.
1 of 7: IDE 2 of 7: pandas 3 of 7: matplotlib and seaborn 4 of 7: plotly 5 of 7: scikitlearn 6 of 7: advanced scikitlearn 7 of 7: automated machine learning Automated Machine Learning We have seen in the previous post on advanced scikitlearn methods that using pipes in scikitlearn allows us to write pretty generalizable code, however we still need to customize our modelling pipeline to the algorithms that we want to use.
1 of 7: IDE 2 of 7: pandas 3 of 7: matplotlib and seaborn 4 of 7: plotly 5 of 7: scikitlearn 6 of 7: advanced scikitlearn 7 of 7: automated machine learning Advanced scikitlearn In the last post, we have seen some advantages of scikitlearn. Most notably the seamless integration of parallel processing. I was struggeling a bit with the fact that scikitlearn only accepts numpy arrays as input and I was missing the recipes package which makes initial data transformation in R so much easier.