Table of Contents

pandas vs. tidyverse

In base R matrices and dataframes have row name indexes which in my opinion are a bit annoying, because they add another layer of complexity to your data transformation. You naturally have to keep your column names in order and they sort of make you care about maintain sensible row names as well. So in the tidyverse row names have been declared a deprecated feature.

In pandas on the other hand indexing is brought to another level and they let you have multiple layers of row and column indexes. As far as I can tell most data reshaping methods such as melt(), stack(), unstack() and pivot() expect that you assign a unique index to each row.

The advantage of multi indexing is apparently that it is easier to dice your data when you have more than 2 dimensions in you dataframe. However, I have been using R for almost two years now and never needed more than 2 dimensions in my dataframe.

Talking to data scientists that are well-experienced with pandas I learned that hardly no-one seems to be using multi-indexes.

The tidyverse is all about functional programming, pandas not so much. Nevertheless there are some possibilities which I will get into in this post.

pandas documentation

There is a comparison of the pandas and the R synthax in the pandas documentation

Sample Data

When looking at pandas example code. The example usually starts by generating a dtaframe with random values sampled from a normal distribution. Coming from R I am used to use acutal sample data like the iris data set which are part of base R. We can use pydataset to get similar sample data sets in python.

import seaborn as sns
import pandas as pd
import numpy as np

df = sns.load_dataset('iris')

print( df.head() )
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

Indexing

We have a numerical row index and the categorical variable Species with three unique values. In R we would convert this to a factor variable and there is something similar in python but it does not seem to be very pythonic to me.

df.index
RangeIndex(start=0, stop=150, step=1)
df.species.unique()
array(['setosa', 'versicolor', 'virginica'], dtype=object)

Let’s move Species to the index

df.set_index('species'
             , inplace = True ## modifies existing object
             , append = True ## keeps exisiting index
             )

df.head()
sepal_length sepal_width petal_length petal_width
species
0 setosa 5.1 3.5 1.4 0.2
1 setosa 4.9 3.0 1.4 0.2
2 setosa 4.7 3.2 1.3 0.2
3 setosa 4.6 3.1 1.5 0.2
4 setosa 5.0 3.6 1.4 0.2

Let’s move Species back to the columns

df.reset_index( inplace = True
               , level = 'species' ## only reset Species
              )

df.head()
species sepal_length sepal_width petal_length petal_width
0 setosa 5.1 3.5 1.4 0.2
1 setosa 4.9 3.0 1.4 0.2
2 setosa 4.7 3.2 1.3 0.2
3 setosa 4.6 3.1 1.5 0.2
4 setosa 5.0 3.6 1.4 0.2

Reshape Data

stack() and unstack()

Once we set the index we can bring the dataframe from long to wide format

df.set_index('species'
             , inplace = True ## modifies existing object
             , append = True ## keeps exisiting index
             )

df_short = df.stack()  

df_short.head()
   species              
0  setosa   sepal_length    5.1
            sepal_width     3.5
            petal_length    1.4
            petal_width     0.2
1  setosa   sepal_length    4.9
dtype: float64
df_long = df_short.unstack()

df_long.head()
sepal_length sepal_width petal_length petal_width
species
0 setosa 5.1 3.5 1.4 0.2
1 setosa 4.9 3.0 1.4 0.2
2 setosa 4.7 3.2 1.3 0.2
3 setosa 4.6 3.1 1.5 0.2
4 setosa 5.0 3.6 1.4 0.2

melt() and pivot_table()

We can use melt() to select which columns we want to move into the long format. However, in the case of our dataset without an ID column we loose the information which value belongs to which row or individual plant.

df.reset_index( inplace = True
              , level = 'species') 

df_melt = df.melt( id_vars = 'species' )

df_melt.head()
species variable value
0 setosa sepal_length 5.1
1 setosa sepal_length 4.9
2 setosa sepal_length 4.7
3 setosa sepal_length 4.6
4 setosa sepal_length 5.0
df['ID'] = list( range( len(df) ) )

df.head()
species sepal_length sepal_width petal_length petal_width ID
0 setosa 5.1 3.5 1.4 0.2 0
1 setosa 4.9 3.0 1.4 0.2 1
2 setosa 4.7 3.2 1.3 0.2 2
3 setosa 4.6 3.1 1.5 0.2 3
4 setosa 5.0 3.6 1.4 0.2 4
df_melt = df.melt( id_vars = ['ID', 'species'] )

df_melt.head()
ID species variable value
0 0 setosa sepal_length 5.1
1 1 setosa sepal_length 4.9
2 2 setosa sepal_length 4.7
3 3 setosa sepal_length 4.6
4 4 setosa sepal_length 5.0
df_pivot_table = df_melt.pivot_table(columns='variable'
                                        , values ='value'
                                        , index = [ 'ID', 'species']
                                        )

df_pivot_table.head()
variable petal_length petal_width sepal_length sepal_width
ID species
0 setosa 1.4 0.2 5.1 3.5
1 setosa 1.4 0.2 4.9 3.0
2 setosa 1.3 0.2 4.7 3.2
3 setosa 1.5 0.2 4.6 3.1
4 setosa 1.4 0.2 5.0 3.6

Functional programming with pandas

Functional prgramming in R for me has two main components. That is the pipe %>% operator that let’s you string a number of functions together without having to come up with names for all the intermediate steps and the purrr package which iterates over columns in a dataframe using them as arguments for a preassigned function whose results are stringed together to form a new column in the same dataframe.

Functional Programming in python is a bit frowned upon and in general it seems to be preferred to build new columns using iterations using for loops. Nevertheless functions like map, filter and apply exist and I find no problem in using them to a similar extend like I would with R

dot . and line breaks \ can be used similar to the R pipe operator %>%

There are some implementations for pipes in python but to me they are really ugly because they require you to put the whole pipe sequence in brackets which kind of breaks up the whole python coding style which usually avoids having to put matching openeing and closing brackets spanning over multiple lines. However using the . operator and code line breaks \ signifier we can do something very similar to the pipes in R.

Let’s string together some commands we find in the pandas documentation R vs dplyr section.

df_trans = df.loc[ : , ['species', 'sepal_length', 'sepal_width']] \
    .query('sepal_length >= 4.7') \
    .groupby('species') \
    .mean() \
    .rename( axis = 'columns'
           , mapper = lambda x: 'mean_' + x ) \
    .assign( sepal_ratio_of_means = lambda x: x['mean_sepal_length'] / x['mean_sepal_width'] )
    
df_trans
mean_sepal_length mean_sepal_width sepal_ratio_of_means
species
setosa 5.119512 3.504878 1.460682
versicolor 5.936000 2.770000 2.142960
virginica 6.588000 2.974000 2.215198

I think this reads really beautifully much better than the awkward pipe operator in R. There are some setbacks to this approach though.

  • there is no ‘.’ operator that allows you to reference the piped object from within the pipe
  • we have to use the full df[‘column’] reference to call a column vector and cannot use the unquoted column name as in dplyr
  • if we want to reference a column that has only been created inside the pipe we need to use a lambda function to reference it or start a new pipe.

Using multiple columns to calculate a new one

Using vectorized calculations

df_trans = df \
    .assign( petal_ratio = df['petal_width'] / df['petal_length']
           , sepal_ratio = df['sepal_width'] / df['petal_length']
           ) \
    .assign( overall_ratio = lambda x: x['petal_ratio'] / x['sepal_ratio'] )
    
    
df_trans.head()
species sepal_length sepal_width petal_length petal_width ID petal_ratio sepal_ratio overall_ratio
0 setosa 5.1 3.5 1.4 0.2 0 0.142857 2.500000 0.057143
1 setosa 4.9 3.0 1.4 0.2 1 0.142857 2.142857 0.066667
2 setosa 4.7 3.2 1.3 0.2 2 0.153846 2.461538 0.062500
3 setosa 4.6 3.1 1.5 0.2 3 0.133333 2.066667 0.064516
4 setosa 5.0 3.6 1.4 0.2 4 0.142857 2.571429 0.055556

Using none-vectorized calculations

These would be cases in which we would use either map() or pmap() from the purrr package.

single column

We select the Series we want to transform and use the apply() method.

def range_sepal_length(x):
    if x > 5:
        return 'high'
    elif x > 4.6:
        return 'medium'
    else:
        return 'low'

df_trans = df \
    .assign( sepal_length_ranges = df['sepal_length'].apply(range_sepal_length) )
    
df_trans.head()
species sepal_length sepal_width petal_length petal_width ID sepal_length_ranges
0 setosa 5.1 3.5 1.4 0.2 0 high
1 setosa 4.9 3.0 1.4 0.2 1 medium
2 setosa 4.7 3.2 1.3 0.2 2 medium
3 setosa 4.6 3.1 1.5 0.2 3 low
4 setosa 5.0 3.6 1.4 0.2 4 medium

multiple columns

We select from the dataframe only the columns that we want to use as arguments to our function call. We use apply() again but this time we change axis to row (axis = 1) which means that it provides all values from a row as arguments to a function call.

df_trans = df \
    .assign( smallest_value = df.loc[:,['sepal_length','sepal_width','petal_length','petal_width']]\
                               .apply( min, axis = 1) )
    
df_trans.head()
species sepal_length sepal_width petal_length petal_width ID smallest_value
0 setosa 5.1 3.5 1.4 0.2 0 0.2
1 setosa 4.9 3.0 1.4 0.2 1 0.2
2 setosa 4.7 3.2 1.3 0.2 2 0.2
3 setosa 4.6 3.1 1.5 0.2 3 0.2
4 setosa 5.0 3.6 1.4 0.2 4 0.2

Aggregating, Grouping, Summarizing

One of the most commen things to do when working with tables is grouping and summarizing data. In dplyr and also in SQL we define our set of grouping columns and then define a set of new columns that contain aggregates of some of the old columns. There are two ways to do this in pandas

  • use a single window function on all none-grouped dataframe columns (see exapmple above)
  • use a set of different window functions on a limited set of columns (one window function per column)

One window function per column

We can either change existing column or create new ones which we have to add as empty columns via assign first. We can use whatever window function is implemented in pandas which we pass as strings in a dictionary or we can write lambda functions.

df_gr = df.loc[ : , ['species', 'sepal_length', 'sepal_width']]  \
    .assign( set_of_lengths = df['sepal_length'] ) \
    .groupby('species') \
    .aggregate( dict( sepal_length = 'mean'
                     , sepal_width = 'median'
                     , set_of_lengths = lambda x: set( round(x, 0) ) ) )
    
df_gr
sepal_length sepal_width set_of_lengths
species
setosa 5.006 3.4 {4.0, 5.0, 6.0}
versicolor 5.936 2.8 {5.0, 6.0, 7.0}
virginica 6.588 3.0 {8.0, 5.0, 6.0, 7.0}

Note that the grouping column is automatically moved to the index, we can revert this as follows:

df_gr.reset_index()
species sepal_length sepal_width set_of_lengths
0 setosa 5.006 3.4 {4.0, 5.0, 6.0}
1 versicolor 5.936 2.8 {5.0, 6.0, 7.0}
2 virginica 6.588 3.0 {8.0, 5.0, 6.0, 7.0}

Other Data Transformation

There is a comparison of the pandas and the R synthax in the pandas documentation

query()

query() can be used like filter() in R. Alternatively we could use boolean indexing but that gets a bit cumbersome if you want to filter on more than one column.

However the string expression does not tolerate ‘.’ in the column names thus we replace them before.

better_columns = df.columns.str.replace('\.', '_')

better_columns

df.columns = better_columns

df.head()

df.query('petal_width > 0.2 \
          & sepal_length < 5')
species sepal_length sepal_width petal_length petal_width ID
6 setosa 4.6 3.4 1.4 0.3 6
41 setosa 4.5 2.3 1.3 0.3 41
45 setosa 4.8 3.0 1.4 0.3 45
57 versicolor 4.9 2.4 3.3 1.0 57
106 virginica 4.9 2.5 4.5 1.7 106

transform()

transform can be used like the group_by() - mutate() combination in R

df = df.assign( max_petal_len_species = lambda x: x.groupby('species').transform('max')['petal_length'] )
df.head()
species sepal_length sepal_width petal_length petal_width ID max_petal_len_species
0 setosa 5.1 3.5 1.4 0.2 0 1.9
1 setosa 4.9 3.0 1.4 0.2 1 1.9
2 setosa 4.7 3.2 1.3 0.2 2 1.9
3 setosa 4.6 3.1 1.5 0.2 3 1.9
4 setosa 5.0 3.6 1.4 0.2 4 1.9
df.tail()
species sepal_length sepal_width petal_length petal_width ID max_petal_len_species
145 virginica 6.7 3.0 5.2 2.3 145 6.9
146 virginica 6.3 2.5 5.0 1.9 146 6.9
147 virginica 6.5 3.0 5.2 2.0 147 6.9
148 virginica 6.2 3.4 5.4 2.3 148 6.9
149 virginica 5.9 3.0 5.1 1.8 149 6.9

reindex()

reindex() can be used to mimic the complete() function in R which can be used to expose implicitly missing values. We can use reindex to write a python version of complete()

def complete(df, cols):
        
    assert all( [ col in df.columns for col in cols] )

    df = df.set_index( cols, append = False )

    df_reset = df.reset_index()

    for i, name in enumerate(df.index.names):
        
        if i == 0:
            df_cross = pd.DataFrame( { name : df_reset[name].unique() } ) \
                .assign( key = 0 )
                
        else:
            df_cross_name = pd.DataFrame( { name : df_reset[name].unique() } ) \
                .assign( key = 0 )
                
            df_cross = df_cross.merge(df_cross_name, on = 'key')

    df_cross = df_cross.drop('key', axis = 1) \
        .set_index( df.index.names )
        
    df = df.reindex( df_cross.index )

    return df
df = pd.DataFrame( dict( numbers = [1,1,2,2,3,3,4,5]
                        , letters = ['A','B','A','B','A','B','A','B']) ) \
    .assign( value = np.random.randn(8) )

df
numbers letters value
0 1 A -0.866225
1 1 B 0.086730
2 2 A -1.385008
3 2 B 1.729923
4 3 A 2.299604
5 3 B 0.271672
6 4 A 0.608976
7 5 B -0.055641
complete(df, ['numbers', 'letters'])
value
numbers letters
1 A -0.866225
B 0.086730
2 A -1.385008
B 1.729923
3 A 2.299604
B 0.271672
4 A 0.608976
B NaN
5 A NaN
B -0.055641