Streamline Modelling & Prediction In Scikit‒learn With Pipelines


Updated:

It’s always fascinating watching a well orchestrated production line at work (How It’s Made). Raw materials go in, magic happens, and something wonderful comes out the other end. Such elegant efficiency can, and does exist in machine learning too.

A doughnut production line

Image by Neil T, Wikimedia Commons

Raw data is often messy, and needs to be preprocessed. Most models expect input data to be numeric, with no missing values. Some models require input data to be normalized. So far, that’s three steps (encoding categorical features, imputing blanks and normalization).

Pipelines “sequentially apply a list of transformations”. They allow you to pool all the preprocessing steps and a predictor into one object ‒ a production line of sorts.

Pipelines:

  • Prevent data leakage:
    • There is much less risk of mixing up training and validation datasets.
  • Are convenient and efficient:
    • You won’t have to complicate things by nesting several transformers, or using intermediate variables after each transformation.
    • You won’t have to process new data when making predictions.

A Simple Example

Binder

Consider a case where we wish to fit a Support Vector Regressor on the Auto MPG dataset, which has both numeric and categorical features.

One possible course of action would be to:

  1. One-hot encode categorical features
  2. Standardize numeric features
  3. Impute missing values
  4. Perform a grid-search to get the best model

I. Performing transformations manually

column_transformer = ColumnTransformer(
    [
        ("categorical", OneHotEncoder(), make_column_selector(dtype_exclude="number")),
        ("numeric", StandardScaler(), make_column_selector(dtype_include="number")),
    ]
)
imputer = SimpleImputer(strategy="mean")

# Intermediate variables are needed to store transformed data
X_train_transformed = column_transformer.fit_transform(X_train)
X_test_transformed = column_transformer.transform(X_test)
X_train_imputed = imputer.fit_transform(X_train_transformed)
X_test_imputed = imputer.transform(X_test_transformed)

model = SVR()
params = {"C": np.logspace(1, 5, 5)}
best_model_without_pipeline = GridSearchCV(model, param_grid=params, cv=5, n_jobs=4)
best_model_without_pipeline.fit(X_train_imputed, y_train)
best_model_without_pipeline.score(X_test_imputed, y_test)

II. Using a pipeline

# This pipeline encapsulates all transformations & the model
model_pipeline = make_pipeline(
    ColumnTransformer(
        [
            ("categorical", OneHotEncoder(), make_column_selector(dtype_exclude="number")),
            ("numeric", StandardScaler(), make_column_selector(dtype_include="number")),
        ]
    ),
    SimpleImputer(strategy="mean"),
    SVR(),
)
params = {"svr__C": np.logspace(1, 5, 5)}
best_model_with_pipeline = GridSearchCV(model_pipeline, param_grid=params, cv=5, n_jobs=4)
best_model_with_pipeline.fit(X_train, y_train)  # No intermediate variables necessary
best_model_with_pipeline.score(X_test, y_test)

Making Predictions on New Data

new_row = pd.DataFrame([(9, 250.0, 150.0, 4100, 11.6, 84, "europe")], columns=X.columns)

# With pipeline
prediction1 = best_model_with_pipeline.predict(new_row)

# Without pipeline: new data must be processed
new_row_transformed = imputer.transform(column_transformer.transform(new_row))
prediction2 = best_model_without_pipeline.predict(new_row_transformed)

Conclusion

This was meant to be a brief demonstration of the power of pipelines. You can find out how to make them even better in Scikit-Learn’s “Pipelines and composite estimators” guide.