Data_Snapshot

Pipelines

Pipelines are great tools for duplicating efforts you’ve already made against a dataset, making them available to new information. It wraps up your data cleaning, your model, and any other intermediary steps all in one ordered operation.

Let's start with a very simple data frame to work on.

import pandas as pd
import numpy as np
df = pd.DataFrame({
    'age': [25,22,26,np.nan,30,35,40,42,43,44],
    'income': [40,37,42,60,58,70,62,85,120,95],
    'owns_car': [0,0,1,0,0,1,1,0,1,1]
})
df
age income owns_car
0 25.0 40 0
1 22.0 37 0
2 26.0 42 1
3 NaN 60 0
4 30.0 58 0
5 35.0 70 1
6 40.0 62 1
7 42.0 85 0
8 43.0 120 1
9 44.0 95 1

We will be predicting car ownership so let's split the features from the target now.

features = ['age','income']
X = df[features]
y = df['owns_car']

To begin lest go through some of the usual cleaning suspects to get this ready for a logistic regression. Specifically lets bring in SimpleImputer and StandardScaler.

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

To start let's deal with the NaN value in Age. SimpleImputer will handle imputing the mean age for that value. We are then stuffing that all back into a DataFrame so we can see what we've done.

imp = SimpleImputer(missing_values=np.nan,strategy='mean', copy=True)
imputed = pd.DataFrame(imp.fit_transform(X))
imputed.columns = X.columns
X = imputed
X
age income
0 25.000000 40.0
1 22.000000 37.0
2 26.000000 42.0
3 34.111111 60.0
4 30.000000 58.0
5 35.000000 70.0
6 40.000000 62.0
7 42.000000 85.0
8 43.000000 120.0
9 44.000000 95.0

Let's bring in StandardScaler to scale our features. Again we stuff it back into a DataFrame for ease of viewing.

scal = StandardScaler()
scaled = pd.DataFrame(scal.fit_transform(X))
scaled.columns = X.columns
X = scaled
X
age income
0 -1.189305e+00 -1.068765
1 -1.580906e+00 -1.187959
2 -1.058772e+00 -0.989303
3 9.274965e-16 -0.274144
4 -5.366378e-01 -0.353606
5 1.160298e-01 0.123166
6 7.686974e-01 -0.194682
7 1.029764e+00 0.719132
8 1.160298e+00 2.109719
9 1.290831e+00 1.116443

We now have happy data prepared for the vast majority of algorithms you may want to throw at it. We are going to throw a default logistic regression on it, using the features in X to predict car ownership.

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state=42, solver='lbfgs')
lr.fit(X,y)
lr.score(X,y)
0.8

Now we have a working model, maybe not the most accurate, but it does run. If we wanted to apply this model to a new piece of data we would need to run through each of those steps, making transformations against the new data. That seems like a hassle. That is a hassle. Instead below I am going to bring in pipeline to solve this. First I am going to re-instantiate our original feature data as bf.

bf = pd.DataFrame({
    'age': [25,22,26,np.nan,30,35,40,42,43,44],
    'income': [40,37,42,60,58,70,62,85,120,95]
})

Now that we have some "New" raw data to have the model predict against we will create the Pipeline. It consists of a list of tuples it will run in order. The first half of the tuple is always the name of the step, the second is the function it will execute.

from sklearn.pipeline import Pipeline
pipe = Pipeline([
        ('simple_imputer', SimpleImputer(missing_values=np.nan,strategy='mean')),
        ('scale', StandardScaler())
])

With pipe now instantiated we can use it to transform bf to align with what our model needs. Make sure the fit happens against original data and not against the new data, as this will alter how the scaling and average used in our cleaning.

pipe.fit(df[features])
piped_bf = pipe.transform(bf)
piped_bf = pd.DataFrame(piped_bf)
piped_bf.columns = bf.columns
piped_bf
age income
0 -1.189305e+00 -1.068765
1 -1.580906e+00 -1.187959
2 -1.058772e+00 -0.989303
3 9.274965e-16 -0.274144
4 -5.366378e-01 -0.353606
5 1.160298e-01 0.123166
6 7.686974e-01 -0.194682
7 1.029764e+00 0.719132
8 1.160298e+00 2.109719
9 1.290831e+00 1.116443

Now we can use our original logistic regression to predict against the new data.

lr.predict(piped_bf)
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

Now what has been done above works, but it's not aligned with how we would actually get this going in production. Train-test-split was not really executed, and a sample size of 10 is easy to demonstrate on but very unrealistic. As well Pipeline gives us the ability to wrap the model within the pipeline and we should look at that.

Below I am going to lay out how I actually implement Pipeline. To do this I will be bringing in DataFrameMapper from sklearn_pandas, I will not get verbose on how to use mapper, but I would highly suggest you take a look at its documentation as it is a great tool. I am also going to sample our data out to get up to a number that seems a bit more realistic for running this sort of prediction.

from sklearn.utils import resample
cf = resample(df,n_samples=500_000,random_state=42)
cf = cf.reset_index()
cf.tail()
index age income owns_car
499995 3 NaN 60 0
499996 0 25.0 40 0
499997 8 43.0 120 1
499998 7 42.0 85 0
499999 7 42.0 85 0

Now that we have it sampled out to 500,000 we will train test split, features is still defined from much earlier in the code, and standard train test split parameters will be used.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(cf[features], cf['owns_car'], random_state=42)
X_train.head()
age income
359342 22.0 37
236051 30.0 58
452617 43.0 120
34245 25.0 40
373935 43.0 120
from sklearn_pandas import DataFrameMapper
mapper = DataFrameMapper([
    (['age'], [SimpleImputer(missing_values=np.nan,strategy='mean'),StandardScaler()]),
    (['income'], StandardScaler())
])

Now that we have mapper set up we will make a new pipeline, pipe_2, and stuff it with the mapper and the logistic regression. Pipeline always executes in order, so the mapper will clean our data up and the regression will then be applied. If we needed extra steps, say some PCA, we can just slot it in after the cleaning, because again it always executes in order.

pipe_2 = Pipeline([
    ('map', mapper),
    ('log', LogisticRegression(random_state=42, solver='lbfgs'))
])

Now it's time to execute pipe_2. First we fit it against the training set, this will establish the cleaning parameters and create the logistic model for us to predict off of. Next we have pipe_2 predict against X_test and then compare these predictions to y_test.

pipe_2.fit(X_train, y_train)
y_pred = pipe_2.predict(X_test)
comp = pd.DataFrame({
    'y_pred': y_pred,
     'y_test': y_test
})
comp.head(15)
y_pred y_test
104241 1 1
199676 1 1
140199 1 1
132814 0 0
408697 1 1
163280 0 0
215758 0 0
442316 1 0
6940 1 1
382310 1 1
472236 0 0
309086 0 0
230672 0 0
209236 1 0
102953 1 1

Not perfect, but it's clearly working. Pipeline makes your flow much more modular. If you need more steps you slot them in the pipeline where you want them to happen. If you need a new model you swap it out. Better yet Pipeline can take pipelines within them. Have a weird replace-NA to Lambda to regularization function to fix a feature before it hits mapper? No problem, wrap it up and throw it in the pipeline before mapper.

As stated at the top Pipeline helps wrap your process all together. It's the box you can throw everything into. Hopefully it will let you be a bit more poetic. Throughout this I have shown you Pipeline, which requires a tuple of a title and a function. There is another option in make_pipeline, which doesn't require a title, just the function, which it arbitrarily names for you. Most days you will end up using make_pipeline, as it is faster. But when you need more specific documenting, Pipeline will be the ticket.