Table of Contents

Advanced scikitlearn

In the last post, we have seen some advantages of scikitlearn. Most notably the seamless integration of parallel processing. I was struggeling a bit with the fact that scikitlearn only accepts numpy arrays as input and I was missing the recipes package which makes initial data transformation in R so much easier. Then I stumbled upon sklearn-pandas which seamlessly integrates pandas with sklearn without having to worry about numpy arrays and it supports a pipe based workflow, which is a sklearn feaure I have not started to explore yet.

Apart from sklearn-pandas there are a number of projects that use the synthax and structure of scikit learn, a collection of them can be found at

-http://contrib.scikit-learn.org/imbalanced-learn/stable/index.html

-http://scikit-learn.org/stable/related_projects.html

sklearn-pandas

Core of this package is the DataFrameMapper class which maps scikit learn Transformer classes to specific columns of a dataframe and outputs either a numpy array or dataframe.

Additionally it provides a CategoricalImputer which accepts categorical data, which I had to write myself before in the last post.

import seaborn as sns
import numpy as np
import pandas as pd
import sklearn

from sklearn_pandas import DataFrameMapper, CategoricalImputer, gen_features

df = sns.load_dataset('titanic')

X = df.copy().drop(['alive','survived'], axis = 'columns')
y = df.survived

# we need to set up transformations for numerical and categorical columns
col_categorical = list( X.select_dtypes(exclude=np.number) )
col_numerical   = list( X.select_dtypes(include=np.number) )

#we need to convert to list of lists
col_categorical = [ [x] for x in col_categorical ]
col_numerical   = [ [x] for x in col_numerical ]

# we have two ways of passing the classes as a simple list or as list of dicts if we need to pass
# arguments as well
classes_categorical = [ CategoricalImputer, sklearn.preprocessing.LabelBinarizer ]
classes_numerical = [ {'class':sklearn.preprocessing.Imputer, 'strategy' : 'median'}
                    , sklearn.preprocessing.StandardScaler
                    ]

# now that we have defined the columns and the classes of transformers we can use gen_features
# in order to generate a list of tuples suitable for DataFrameMapper

feature_def = gen_features(
    columns = col_categorical
    , classes = classes_categorical
)

feature_def_numerical = gen_features(
    columns = col_numerical
    , classes = classes_numerical
)

feature_def.extend(feature_def_numerical)

# when constructing the mapper we can specify whether we want a dataframe or a numpy array as output

mapper_df = DataFrameMapper( feature_def , df_out = True )

mapper_np = DataFrameMapper( feature_def , df_out = False )

mapped_df = mapper_df.fit_transform( df.copy() )

mapped_np = mapper_np.fit_transform( df.copy() )

print( mapped_np[1:10,1:20] )

mapped_df.head(10)
[[1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 1.]
 [0. 0. 1. 1. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0.]]
sex embarked_C embarked_Q embarked_S class_First class_Second class_Third who_child who_man who_woman ... deck_G embark_town_Cherbourg embark_town_Queenstown embark_town_Southampton alone pclass age sibsp parch fare
0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.827377 -0.565736 0.432793 -0.473674 -0.502445
1 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 1.0 0.0 0.0 0.0 -1.566107 0.663861 0.432793 -0.473674 0.786845
2 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 1.0 1.0 0.827377 -0.258337 -0.474545 -0.473674 -0.488854
3 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 1.0 0.0 -1.566107 0.433312 0.432793 -0.473674 0.420730
4 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 1.0 1.0 0.827377 0.433312 -0.474545 -0.473674 -0.486337
5 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 ... 0.0 0.0 1.0 0.0 1.0 0.827377 -0.104637 -0.474545 -0.473674 -0.478116
6 1.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 1.0 1.0 -1.566107 1.893459 -0.474545 -0.473674 0.395814
7 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.827377 -2.102733 2.247470 0.767630 -0.224083
8 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 1.0 0.0 0.827377 -0.181487 -0.474545 2.008933 -0.424256
9 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 -0.369365 -1.180535 0.432793 -0.473674 -0.042956

10 rows × 27 columns

the results are looking really good, its almost as good as recipes. However if we wanted to apply a boxcox transformation on top of it we would have to write our own scikit-learn like transformer. However the transformer will be added in a future version so I would not bother with that at the moment.

To sparse or not to sparse

In the python data world data is considered to be sparse or dense. Which adresses the number of zeros in a matrix wiki. sparse means that you have a lot of them while dense means the opposite. There is no particular threshold but we should be aware that some data transformatios like dummy encoding make our data more sparse. A sparse matrix can be stored in a more memory efficient format such similar as a compressed image file and some algorithms can computationally leaverage this format to reduce computing time. lasso and boosting gradient style algorithms seem to be able to profit from the sparse data format while others neural nets, knn do not, and some like randomForest require the regular dense format and will otherwise raise an error. We can use SciPy to transform matrices to a dense format. We can measure the sparcity ratio as follows## Sparcity ratio

Sparcity Ratio

def sparsity_ratio(X):
    return 1.0 - np.count_nonzero(X) / float(X.shape[0] * X.shape[1])
print('sparcity ratio original data:', round( sparsity_ratio(X), 2) )
print('sparcity ratio tranformed data:', round( sparsity_ratio(mapped_np), 2) )
sparcity ratio original data: 0.17
sparcity ratio tranformed data: 0.56

The transformation have resulted in a matrix with a high sparcity thus we will test whether we might benefit from converting to a sparse matrix format

from scipy import sparse
from time import time
from sklearn.tree import DecisionTreeClassifier


X_sparse = sparse.coo_matrix(mapped_np)

clf_sparse = DecisionTreeClassifier()
clf_dense = DecisionTreeClassifier()

t0 = time()
clf_sparse.fit(X_sparse, y)
print('exec time sparse:', round( time() - t0,3 ) )


t0 = time()
clf_dense.fit(mapped_np, y)
print('exec time dense :', round( time() - t0,3 ) )
exec time sparse: 0.019
exec time dense : 0.008

We can see that our decision tree classifiert does not benefit from a sparse data format.

Pipelines

Pipelines are constructs that chain scikit preprocessing steps together and attaching an optional classifier or a regressor to the end.

We can then use the pipe as we would use a regular model we can fit it and get predictions, we could get crossvalidated performance scores or perform parameter tuning. This has a couple of advantages.

  • The code becomes more compact and readable
  • Instead of saving multiple transformers (scaling, boxcox ) we can simply store one to apply to future data
  • We can tune several steps of the pipeline in one go (for example feature selector + model tuning parameters)

We are going to contruct two pipes one for preprocessing and one for model fitting. It makes sense to seperate these two because we the first one contains a defined sequence of steps and the last pipe we are going to use to tune certain parameters via cross validation.

When performning the cross validation the transformers and estimators in the pipe will be applied after splitting the data into cross validation pairs. Cross validation is computationally expensive and we only want to use it for steps which are likely to introduce bias and can lead to overfitting such as feature selection and hyperparameter tuning.

Preprocessing Pipeline

We are going to apply the sklearn-pandas dataframe mapper and a low variance feature filter.

from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import RepeatedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import VarianceThreshold
from scipy import stats
import os


pipe_pre_process = sklearn.pipeline.Pipeline([
    ('mapper', mapper_np ) 
    , ('feats', VarianceThreshold() )
])


pipe_pre_process
Pipeline(memory=None,
     steps=[('mapper', DataFrameMapper(default=False, df_out=False,
        features=[(['sex'], [CategoricalImputer(copy=True, missing_values='NaN'), LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]), (['embarked'], [CategoricalImputer(copy=True, missing_values='NaN'), LabelBinarizer(neg_la...h_std=True)])],
        input_df=False, sparse=False)), ('feats', VarianceThreshold(threshold=0.0))])
pipe_pre_process.named_steps
{'feats': VarianceThreshold(threshold=0.0),
 'mapper': DataFrameMapper(default=False, df_out=False,
         features=[(['sex'], [CategoricalImputer(copy=True, missing_values='NaN'), LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]), (['embarked'], [CategoricalImputer(copy=True, missing_values='NaN'), LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]), (['class'], [CategoricalImp...es='NaN', strategy='median', verbose=0), StandardScaler(copy=True, with_mean=True, with_std=True)])],
         input_df=False, sparse=False)}

The parameters are saved as follows in a nested dictionary and are named after the following principle step_name + '__' + argument

pipe_pre_process.get_params()
{'feats': VarianceThreshold(threshold=0.0),
 'feats__threshold': 0.0,
 'mapper': DataFrameMapper(default=False, df_out=False,
         features=[(['sex'], [CategoricalImputer(copy=True, missing_values='NaN'), LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]), (['embarked'], [CategoricalImputer(copy=True, missing_values='NaN'), LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]), (['class'], [CategoricalImp...es='NaN', strategy='median', verbose=0), StandardScaler(copy=True, with_mean=True, with_std=True)])],
         input_df=False, sparse=False),
 'mapper__default': False,
 'mapper__df_out': False,
 'mapper__features': [(['sex'],
   [CategoricalImputer(copy=True, missing_values='NaN'),
    LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]),
  (['embarked'],
   [CategoricalImputer(copy=True, missing_values='NaN'),
    LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]),
  (['class'],
   [CategoricalImputer(copy=True, missing_values='NaN'),
    LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]),
  (['who'],
   [CategoricalImputer(copy=True, missing_values='NaN'),
    LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]),
  (['adult_male'],
   [CategoricalImputer(copy=True, missing_values='NaN'),
    LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]),
  (['deck'],
   [CategoricalImputer(copy=True, missing_values='NaN'),
    LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]),
  (['embark_town'],
   [CategoricalImputer(copy=True, missing_values='NaN'),
    LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]),
  (['alone'],
   [CategoricalImputer(copy=True, missing_values='NaN'),
    LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]),
  (['pclass'],
   [Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0),
    StandardScaler(copy=True, with_mean=True, with_std=True)]),
  (['age'],
   [Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0),
    StandardScaler(copy=True, with_mean=True, with_std=True)]),
  (['sibsp'],
   [Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0),
    StandardScaler(copy=True, with_mean=True, with_std=True)]),
  (['parch'],
   [Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0),
    StandardScaler(copy=True, with_mean=True, with_std=True)]),
  (['fare'],
   [Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0),
    StandardScaler(copy=True, with_mean=True, with_std=True)])],
 'mapper__input_df': False,
 'mapper__sparse': False,
 'memory': None,
 'steps': [('mapper', DataFrameMapper(default=False, df_out=False,
           features=[(['sex'], [CategoricalImputer(copy=True, missing_values='NaN'), LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]), (['embarked'], [CategoricalImputer(copy=True, missing_values='NaN'), LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)]), (['class'], [CategoricalImp...es='NaN', strategy='median', verbose=0), StandardScaler(copy=True, with_mean=True, with_std=True)])],
           input_df=False, sparse=False)),
  ('feats', VarianceThreshold(threshold=0.0))]}

We can set a parameter

pipe_pre_process.set_params(feats__threshold = 0.05)
pipe_pre_process.named_steps.feats
VarianceThreshold(threshold=0.05)

Then we fit the preprocessing pipe to the data

pipe_pre_process.fit(X)
X_proc = pipe_pre_process.fit_transform(X)

X_proc
array([[ 1.        ,  0.        ,  0.        , ...,  0.43279337,
        -0.47367361, -0.50244517],
       [ 0.        ,  1.        ,  0.        , ...,  0.43279337,
        -0.47367361,  0.78684529],
       [ 0.        ,  0.        ,  0.        , ..., -0.4745452 ,
        -0.47367361, -0.48885426],
       ...,
       [ 0.        ,  0.        ,  0.        , ...,  0.43279337,
         2.00893337, -0.17626324],
       [ 1.        ,  1.        ,  0.        , ..., -0.4745452 ,
        -0.47367361, -0.04438104],
       [ 1.        ,  0.        ,  1.        , ..., -0.4745452 ,
        -0.47367361, -0.49237783]])

We will be using the pre processed data in another post so we are saving it to disc. We are storing it in feather format which is basically hdfs which has much faster in terms of reading and writing from and to disc.

import feather
import os

if not os.path.isdir('./data'):
    os.mkdir('./data')


df_feather = mapped_df.\
    assign( y = y )

feather.write_dataframe(df_feather, './data/mapped_df.feather')

Modelling Pipeline

We will add a feature selection step, which choses variables based on a univariate test such as a chisquare test (which we cannot use here because it does not accept negative values) and ANOVA and then fit a decision tree.

pipe_mod = sklearn.pipeline.Pipeline([
    ('feats', sklearn.feature_selection.SelectKBest( k = 10) ) 
    , ('tree', sklearn.tree.DecisionTreeClassifier() )
])

We can apply the same ‘__’ synthax as we used for setting the parameters of the pipe for constructing the dictionary for the sandomized hyperparameter search

param_dist = dict( tree__min_samples_split = stats.randint(2,250)
                 , tree__min_samples_leaf = stats.randint(1,500)
                 , tree__min_impurity_decrease = stats.uniform(0,1)
                 , tree__max_features = stats.uniform(0,1)
                 , feats__score_func = [sklearn.feature_selection.f_classif ## Anova
                                       , sklearn.feature_selection.mutual_info_classif] ) ## nearest n

n_iter = 500

random_search = RandomizedSearchCV(pipe_mod
                                   , param_dist
                                   , n_iter = n_iter
                                   , scoring = 'roc_auc'
                                   , cv = RepeatedKFold( n_splits = 5, n_repeats = 3 )
                                   , verbose = True
                                   , n_jobs = 4 ## parallel processing
                                   , return_train_score = True
                                  )


random_search.fit(X = X_proc, y =  df.survived )
Fitting 15 folds for each of 500 candidates, totalling 7500 fits


[Parallel(n_jobs=4)]: Done  49 tasks      | elapsed:    5.8s
[Parallel(n_jobs=4)]: Done 740 tasks      | elapsed:   34.9s
[Parallel(n_jobs=4)]: Done 2005 tasks      | elapsed:  1.4min
[Parallel(n_jobs=4)]: Done 3737 tasks      | elapsed:  2.6min
[Parallel(n_jobs=4)]: Done 5556 tasks      | elapsed:  4.3min
[Parallel(n_jobs=4)]: Done 7241 tasks      | elapsed:  5.9min
[Parallel(n_jobs=4)]: Done 7493 out of 7500 | elapsed:  6.1min remaining:    0.2s
[Parallel(n_jobs=4)]: Done 7500 out of 7500 | elapsed:  6.1min finished





RandomizedSearchCV(cv=<sklearn.model_selection._split.RepeatedKFold object at 0x000001916A6DC9B0>,
          error_score='raise',
          estimator=Pipeline(memory=None,
     steps=[('feats', SelectKBest(k=10, score_func=<function f_classif at 0x000001916A6ABF28>)), ('tree', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'))]),
          fit_params=None, iid=True, n_iter=500, n_jobs=4,
          param_distributions={'tree__min_impurity_decrease': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001916A6DC160>, 'feats__score_func': [<function f_classif at 0x000001916A6ABF28>, <function mutual_info_classif at 0x000001916A6CD2F0>], 'tree__max_features': <scipy.stats._distn_infrastru...tree__min_samples_leaf': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001916A6D9FD0>},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring='roc_auc', verbose=True)
random_search.best_estimator_
Pipeline(memory=None,
     steps=[('feats', SelectKBest(k=10,
      score_func=<function mutual_info_classif at 0x000001916A6CD2F0>)), ('tree', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=0.6995334988533182, max_leaf_nodes=None,
            min_impurity_decrease=0.00253...t=47, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'))])
res = pd.DataFrame( random_search.cv_results_ )

res.sort_values('rank_test_score').head(10)
mean_fit_time mean_score_time mean_test_score mean_train_score param_feats__score_func param_tree__max_features param_tree__min_impurity_decrease param_tree__min_samples_leaf param_tree__min_samples_split params ... split7_test_score split7_train_score split8_test_score split8_train_score split9_test_score split9_train_score std_fit_time std_score_time std_test_score std_train_score
206 0.360021 0.001642 0.851727 0.862858 <function mutual_info_classif at 0x000001916A6... 0.699533 0.00253909 33 47 {'feats__score_func': <function mutual_info_cl... ... 0.772863 0.866083 0.847184 0.881330 0.867859 0.867060 0.031329 0.003972 0.036028 0.016553
88 0.004166 0.001308 0.829953 0.838292 <function f_classif at 0x000001916A6ABF28> 0.87121 0.0133541 89 210 {'feats__score_func': <function f_classif at 0... ... 0.736742 0.809940 0.845372 0.858085 0.798939 0.825771 0.006909 0.003953 0.034927 0.019111
194 0.298661 0.000000 0.811322 0.811516 <function mutual_info_classif at 0x000001916A6... 0.813777 0.0138638 124 79 {'feats__score_func': <function mutual_info_cl... ... 0.713745 0.779844 0.815275 0.821707 0.798939 0.825771 0.029102 0.000000 0.032071 0.018513
230 0.435979 0.003441 0.777916 0.776460 <function mutual_info_classif at 0x000001916A6... 0.407913 0.127602 3 215 {'feats__score_func': <function mutual_info_cl... ... 0.754735 0.786423 0.768997 0.766081 0.750393 0.787890 0.268832 0.005945 0.019844 0.009008
285 0.312650 0.001333 0.777642 0.779836 <function mutual_info_classif at 0x000001916A6... 0.95935 0.0960747 234 36 {'feats__score_func': <function mutual_info_cl... ... 0.713745 0.779844 0.779288 0.780731 0.750393 0.787890 0.020375 0.001885 0.025972 0.005002
347 0.003999 0.001600 0.777642 0.779836 <function f_classif at 0x000001916A6ABF28> 0.722103 0.0547579 104 224 {'feats__score_func': <function f_classif at 0... ... 0.713745 0.779844 0.779288 0.780731 0.750393 0.787890 0.002529 0.001959 0.025972 0.005002
263 0.004570 0.001416 0.777642 0.779836 <function f_classif at 0x000001916A6ABF28> 0.83097 0.0618748 211 14 {'feats__score_func': <function f_classif at 0... ... 0.713745 0.779844 0.779288 0.780731 0.750393 0.787890 0.006060 0.002616 0.025972 0.005002
459 0.005208 0.000000 0.777642 0.779836 <function f_classif at 0x000001916A6ABF28> 0.93767 0.0945291 11 96 {'feats__score_func': <function f_classif at 0... ... 0.713745 0.779844 0.779288 0.780731 0.750393 0.787890 0.009316 0.000000 0.025972 0.005002
224 0.002137 0.001575 0.773051 0.774232 <function f_classif at 0x000001916A6ABF28> 0.509988 0.0569047 74 12 {'feats__score_func': <function f_classif at 0... ... 0.754735 0.786423 0.779288 0.780731 0.750393 0.787890 0.001999 0.003991 0.021037 0.015873
472 0.004166 0.002083 0.771509 0.774650 <function f_classif at 0x000001916A6ABF28> 0.555193 0.0792222 173 195 {'feats__score_func': <function f_classif at 0... ... 0.713745 0.779844 0.779288 0.780731 0.750393 0.787890 0.006909 0.005311 0.027898 0.013231

10 rows × 45 columns

Summary

We can see that our maximum ROC score 0.86 is similar to what we obtained in the last post (0.85) where we took a more manual approach. However using sklearn-pandas and pipes we were able to write code that is more generalizable and is less dependent on the actual dataset. We have more or less written a generalizable code for the preprocessing pipe however the code for the modelling pipe is quite specific for the model that we used. I f we wanted to train more models of a different type we would have to manually write pipes for them as well.