Pipelines

2019-07-03

Pipelines are great tools for duplicating efforts you’ve already made against a dataset, making them available to new information. It wraps up your data cleaning, your model, and any other intermediary steps all in one ordered operation.

Let's start with a very simple data frame to work on.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'age': [25,22,26,np.nan,30,35,40,42,43,44],
    'income': [40,37,42,60,58,70,62,85,120,95],
    'owns_car': [0,0,1,0,0,1,1,0,1,1]
})

df

	age	income	owns_car
0	25.0	40	0
1	22.0	37	0
2	26.0	42	1
3	NaN	60	0
4	30.0	58	0
5	35.0	70	1
6	40.0	62	1
7	42.0	85	0
8	43.0	120	1
9	44.0	95	1

We will be predicting car ownership so let's split the features from the target now.

features = ['age','income']
X = df[features]
y = df['owns_car']

To begin lest go through some of the usual cleaning suspects to get this ready for a logistic regression. Specifically lets bring in SimpleImputer and StandardScaler.

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

To start let's deal with the NaN value in Age. SimpleImputer will handle imputing the mean age for that value. We are then stuffing that all back into a DataFrame so we can see what we've done.

imp = SimpleImputer(missing_values=np.nan,strategy='mean', copy=True)
imputed = pd.DataFrame(imp.fit_transform(X))
imputed.columns = X.columns
X = imputed

	age	income
0	25.000000	40.0
1	22.000000	37.0
2	26.000000	42.0
3	34.111111	60.0
4	30.000000	58.0
5	35.000000	70.0
6	40.000000	62.0
7	42.000000	85.0
8	43.000000	120.0
9	44.000000	95.0

Let's bring in StandardScaler to scale our features. Again we stuff it back into a DataFrame for ease of viewing.

scal = StandardScaler()
scaled = pd.DataFrame(scal.fit_transform(X))
scaled.columns = X.columns
X = scaled

	age	income
0	-1.189305e+00	-1.068765
1	-1.580906e+00	-1.187959
2	-1.058772e+00	-0.989303
3	9.274965e-16	-0.274144
4	-5.366378e-01	-0.353606
5	1.160298e-01	0.123166
6	7.686974e-01	-0.194682
7	1.029764e+00	0.719132
8	1.160298e+00	2.109719
9	1.290831e+00	1.116443

We now have happy data prepared for the vast majority of algorithms you may want to throw at it. We are going to throw a default logistic regression on it, using the features in X to predict car ownership.

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(random_state=42, solver='lbfgs')
lr.fit(X,y)
lr.score(X,y)

0.8

Now we have a working model, maybe not the most accurate, but it does run. If we wanted to apply this model to a new piece of data we would need to run through each of those steps, making transformations against the new data. That seems like a hassle. That is a hassle. Instead below I am going to bring in pipeline to solve this. First I am going to re-instantiate our original feature data as bf.

bf = pd.DataFrame({
    'age': [25,22,26,np.nan,30,35,40,42,43,44],
    'income': [40,37,42,60,58,70,62,85,120,95]
})

Now that we have some "New" raw data to have the model predict against we will create the Pipeline. It consists of a list of tuples it will run in order. The first half of the tuple is always the name of the step, the second is the function it will execute.

from sklearn.pipeline import Pipeline

pipe = Pipeline([
        ('simple_imputer', SimpleImputer(missing_values=np.nan,strategy='mean')),
        ('scale', StandardScaler())
])

With pipe now instantiated we can use it to transform bf to align with what our model needs. Make sure the fit happens against original data and not against the new data, as this will alter how the scaling and average used in our cleaning.

pipe.fit(df[features])
piped_bf = pipe.transform(bf)
piped_bf = pd.DataFrame(piped_bf)
piped_bf.columns = bf.columns
piped_bf

	age	income
0	-1.189305e+00	-1.068765
1	-1.580906e+00	-1.187959
2	-1.058772e+00	-0.989303
3	9.274965e-16	-0.274144
4	-5.366378e-01	-0.353606
5	1.160298e-01	0.123166
6	7.686974e-01	-0.194682
7	1.029764e+00	0.719132
8	1.160298e+00	2.109719
9	1.290831e+00	1.116443

Now we can use our original logistic regression to predict against the new data.

lr.predict(piped_bf)

array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

Now what has been done above works, but it's not aligned with how we would actually get this going in production. Train-test-split was not really executed, and a sample size of 10 is easy to demonstrate on but very unrealistic. As well Pipeline gives us the ability to wrap the model within the pipeline and we should look at that.

Below I am going to lay out how I actually implement Pipeline. To do this I will be bringing in DataFrameMapper from sklearn_pandas, I will not get verbose on how to use mapper, but I would highly suggest you take a look at its documentation as it is a great tool. I am also going to sample our data out to get up to a number that seems a bit more realistic for running this sort of prediction.

from sklearn.utils import resample

cf = resample(df,n_samples=500_000,random_state=42)
cf = cf.reset_index()
cf.tail()

	index	age	income	owns_car
499995	3	NaN	60	0
499996	0	25.0	40	0
499997	8	43.0	120	1
499998	7	42.0	85	0
499999	7	42.0	85	0

Now that we have it sampled out to 500,000 we will train test split, features is still defined from much earlier in the code, and standard train test split parameters will be used.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(cf[features], cf['owns_car'], random_state=42)
X_train.head()

	age	income
359342	22.0	37
236051	30.0	58
452617	43.0	120
34245	25.0	40
373935	43.0	120

from sklearn_pandas import DataFrameMapper

mapper = DataFrameMapper([
    (['age'], [SimpleImputer(missing_values=np.nan,strategy='mean'),StandardScaler()]),
    (['income'], StandardScaler())
])

Now that we have mapper set up we will make a new pipeline, pipe_2, and stuff it with the mapper and the logistic regression. Pipeline always executes in order, so the mapper will clean our data up and the regression will then be applied. If we needed extra steps, say some PCA, we can just slot it in after the cleaning, because again it always executes in order.

pipe_2 = Pipeline([
    ('map', mapper),
    ('log', LogisticRegression(random_state=42, solver='lbfgs'))
])

Now it's time to execute pipe_2. First we fit it against the training set, this will establish the cleaning parameters and create the logistic model for us to predict off of. Next we have pipe_2 predict against X_test and then compare these predictions to y_test.

pipe_2.fit(X_train, y_train)
y_pred = pipe_2.predict(X_test)
comp = pd.DataFrame({
    'y_pred': y_pred,
     'y_test': y_test
})
comp.head(15)

	y_pred	y_test
104241	1	1
199676	1	1
140199	1	1
132814	0	0
408697	1	1
163280	0	0
215758	0	0
442316	1	0
6940	1	1
382310	1	1
472236	0	0
309086	0	0
230672	0	0
209236	1	0
102953	1	1

Not perfect, but it's clearly working. Pipeline makes your flow much more modular. If you need more steps you slot them in the pipeline where you want them to happen. If you need a new model you swap it out. Better yet Pipeline can take pipelines within them. Have a weird replace-NA to Lambda to regularization function to fix a feature before it hits mapper? No problem, wrap it up and throw it in the pipeline before mapper.

As stated at the top Pipeline helps wrap your process all together. It's the box you can throw everything into. Hopefully it will let you be a bit more poetic. Throughout this I have shown you Pipeline, which requires a tuple of a title and a function. There is another option in make_pipeline, which doesn't require a title, just the function, which it arbitrarily names for you. Most days you will end up using make_pipeline, as it is faster. But when you need more specific documenting, Pipeline will be the ticket.

Data_Snapshot

Pipelines