Moving from R to python - 3/7 - matplotlib and seaborn
- 1 of 7: IDE
- 2 of 7: pandas
- 3 of 7: matplotlib and seaborn
- 4 of 7: plotly
- 5 of 7: scikitlearn
- 6 of 7: advanced scikitlearn
- 7 of 7: automated machine learning
Table of Contents
R I am used to work with a combination of
plotly. It seems that in
python you have
matplotlib which is fully integrated into
pandas and you have
seaborn which provides some pretty default setting for most of
matplotlib’s standard graph types.
The main difference of
ggplot2 is that it is optimised for wide formatted data tables while
ggplot2 is optimised for data in the long format. In matplotlib we we woul iterate over every column that we would want to add to our plot while in ggplot we would define x and y measurements and then select a grouping or facetting variable.
seaborn is built on top of matplotlib it provides some pretty decent defaults for
matplotlib and has a stunning example gallery.
seaborn supports long and wide format as input.
import pandas as pd from matplotlib import pyplot as plt %matplotlib inline import seaborn as sns df = sns.load_dataset('iris')
This is in fact a scatter plot function, we just have to turn of the regression fit.
sns.lmplot(x = 'petal_length', y = 'petal_width', data = df , hue = 'species' , fit_reg = False)
<seaborn.axisgrid.FacetGrid at 0x1fbd57e16a0>
is a lot more complicated, we have to add each species manually to an axes supplot object. This is very inconvenient.
# old school ax = df.loc[ df['species'] == 'setosa', : ].plot.scatter('petal_length', 'petal_width', label = 'setosa', color = 'blue') # functional indexing ax = df.query('species == "versicolor"') \ .plot.scatter( 'petal_length', 'petal_width' , label = 'versicolor' , color = 'orange' , ax = ax ) ax = df.query('species == "virginica"') \ .plot.scatter( 'petal_length', 'petal_width' , label = 'virginica' , color = 'green' , ax = ax )
From wide format
here we cannot use hue to assign groups to colors
<matplotlib.axes._subplots.AxesSubplot at 0x1fbd0345b00>
From short format
df_melt = df.melt(value_vars=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'] , id_vars = 'species') sns.boxplot('variable', 'value', data = df_melt, hue = 'species')
<matplotlib.axes._subplots.AxesSubplot at 0x1fbd4e51b00>
We can easily overlay plots as follows. The problem is that nevertheless the plot order is a bot messed up and there is no option to change the color of the box outline to black. Probably in order to fix this we would need to iterate over the box outlines and set their color attribute to ‘black’ which is a bit of a pain in the ass.
sns.violinplot('variable', 'value', data = df_melt , hue = 'species' , inner = None ## removes inner boxes , zorder = 1 ) sns.boxplot('variable', 'value', data = df_melt , hue = 'species' , palette = ['#FFFFFF','#FFFFFF','#FFFFFF'] , saturation = 1 , zorder = -1 ## send boxplot to background )
<matplotlib.axes._subplots.AxesSubplot at 0x1fbd4fcb588>
Factor Plots (facetting)
ax = sns.factorplot('variable', 'value', data = df_melt , hue = 'species' , col = 'species' , kind = 'box' ) ax.set_xticklabels(rotation = -45)
<seaborn.axisgrid.FacetGrid at 0x1fbce3f4eb8>
Customize Plots with
All seaborn plots can be tweaked and edited using
matplolib, for example we can add a title and limit the range of the x-axis.
sns.lmplot(x = 'petal_length', y = 'petal_width', data = df , hue = 'species' , fit_reg = False) plt.xlim(0,5) plt.title('Look at my custom plot')
Text(0.5,1,'Look at my custom plot')
We can also fix the overlay plot from before
# instantiate axis and figure fig, ax = plt.subplots() ax = sns.violinplot('variable', 'value', data = df_melt , hue = 'species' , inner = None ## removes inner boxes , ax = ax , legend_out = True ) ax = sns.boxplot('variable', 'value', data = df_melt , hue = 'species' , palette = ['#FFFFFF','#FFFFFF','#FFFFFF'] , saturation = 1 , ax = ax ) # the boxes are drawn onto the axis as artist objects for artist in ax.artists: artist.set_edgecolor('black') artist.set_zorder(1) # the caps and whiskers as line objects for line in ax.lines: line.set_color('black') # get legend handles and labels before drawing legend # use only 3 of them for legend handles, labels = ax.get_legend_handles_labels() plt.legend(handles[0:3], labels[0:3] , bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
<matplotlib.legend.Legend at 0x1fbd5510ba8>
Having multiple plot as output from one code chunk in markdown is a bit tricky, in jupyter notebooks it is not.
sns.violinplot('variable', 'value', data = df_melt , hue = 'species' , inner = None ## removes inner boxes ) plt.show() sns.boxplot('variable', 'value', data = df_melt , hue = 'species' , saturation = 1 ) plt.show()
Personally I find pure matplotlib very cumbersome. However
seaborn provides some nice defaults and supports the long data format. However if you want to plot something a bit more complicated then their showcase examples you get stuck tweaking the plots in
matplotlib. There is a python version of the
ggplot which I hear is quite popular and a newr package called
altair which is also meant to work on long format. However there does not seem anything in the
python world that beats pure
ggplot2. I will rather keep using the original, in a later post I will show you how you can mix up
python code in a single jupyter notebook and how to pass variables between the two environments.