Machine Learning Pipeline Optimization with TPOT

Authors

Matthew Mayo

(This article originally appeared on KDNuggets.com here. For more, visit https://www.kdnuggets.com/)

Let's revisit the automated machine learning project TPOT, and get back up to speed on using open source AutoML tools on our way to building a fully-automated prediction pipeline.

It's been a while since I've had a look at TPOT, the Tree-based Pipeline Optimization Tool. TPOT is a Python automated machine learning (AutoML) tool for optimizing machine learning pipelines through the use of genetic programming. We are told by the authors to consider it our "data science assistant."

The rationale for AutoML stems from this idea: if numerous machine learning models must be built, using a variety of algorithms and a number of differing hyperparameter configurations, then this model building can be automated, as can the comparison of model performance and accuracy.

I want to have a fresh look at TPOT to to see if we can flesh out an actual fully-automated assistant for data scientists. What if we could expand on the functionality of TPOT and build an end-to-end prediction pipeline, which we could point at a dataset and get predictions out the other end, with no intervention in between? Sure, other possible tools for this exist, but what better way to understand the machine learning pipeline process, and any particular resulting single constructed pipeline, than building it ourselves, and making the decisions as to what happens along the way.

The goal wouldn't necessarily be to cut the data scientist out of the loop altogether, but to provide a baseline or a number of possible solutions to compare hand-crafted machine learning pipelines to. While the assistant toils in the background, the master can come up with more clever attempted approaches. At the very least, resulting prediction pipelines could be good starting points for a data scientist to manually tweak and intervene with after the fact, with much of the rote work taken care of on her behalf.

An AutoML "solution" could include the tasks of data preprocessing, feature engineering, algorithm selection, algorithm architecture search, and hyperparameter tuning, or some subset or variation of these distinct tasks. Thus, automated machine learning can now be thought of as anything from solely performing a single task, such as automated feature engineering, all the way through to a fully-automated pipeline, from data preprocessing, to feature engineering, to algorithm selection, and so on. So why not build something that does it all?

Anyhow, the first step of this plan is to refamiliarize ourselves with TPOT, the project that will eventually be at the center of our fully-automated prediction pipeline optimizer. TPOT is a Python tool which "automatically creates and optimizes machine learning pipelines using genetic programming." TPOT works in tandem with Scikit-learn, describing itself as a Scikit-learn wrapper. TPOT is open source, written in Python, and aimed at simplifying a machine learning process by way of an AutoML approach based on genetic programming. The end result is automated hyperparameter selection, modeling with a variety of algorithms, and exploration of numerous feature representations, all leading to iterative model building and model evaluation.

Aspects of a machine learning pipeline automated by TPOT (source)

We will take a look at something a little more involved than than the simple yet perfectly useful example script that can be found in the TPOT repository. The code should be straightforward and fairly easy to follow, so I won't go over it with a fine-toothed comb.

import timeit

from tpot import TPOTClassifier

from sklearn.model_selection import StratifiedKFold

from sklearn.model_selection import train_test_split

from sklearn.datasets import load_digits, load_iris

from sklearn import metrics

def main():

"""Run TPOT optimizer"""

# define dataset

dataset = 'iris'

#dataset = 'digits'

random_state = 42

train_size = 0.75

test_size = 1.0 - train_size

checkpoint_folder = './tpot_checkpoints'

output_folder = './tpot_output'

search_iters = 3

verbosity = 0

generations = 5

population_size = 50

n_jobs = -1

times = []

best_pipes = []

scores = []

ds = ''

# load and split dataset

if dataset=='iris':

ds = load_iris()

elif dataset=='digits':

ds = load_digits()

X_train, X_test, y_train, y_test = train_test_split(ds.data,

ds.target,

train_size=train_size,

test_size=test_size,

random_state=random_state)

# define scoring metric and model evaluation method

scoring = 'accuracy'

cv = ('stratified k-fold cross-validation',

StratifiedKFold(n_splits=10,

shuffle=True,

random_state=random_state))

# define search

tpot = TPOTClassifier(cv=cv[1],

scoring=scoring,

verbosity=verbosity,

random_state=random_state,

n_jobs=n_jobs,

generations=generations,

population_size=population_size,

periodic_checkpoint_folder=checkpoint_folder)

print(f'Optimizing prediction pipeline for the {dataset} dataset with {cv[0]} using the {scoring} scoring metric')

# pipeline optimization iterations

for i in range(search_iters):

print(f'\nPipeline optimization iteration: {i}')

start_time = timeit.default_timer()

tpot.fit(X_train, y_train)

elapsed = timeit.default_timer() - start_time

score = tpot.score(X_test, y_test)

best_pipes.append(tpot.fitted_pipeline_)

tpot.export(f'{output_folder}/tpot_{dataset}_pipeline_{i}.py')

print(f'>>> elapsed time: {elapsed} seconds')

print(f'>>> pipeline score on test data: {score}')

# check if pipelines are the same

result = True

first_pipe = str(best_pipes[0])

for pipe in best_pipes:

if first_pipe != str(pipe):

result = False

if (result):

print("\nAll best pipelines were the same:\n")

print(best_pipes[0])

else:

print('\nBest pipelines:\n')

print(*best_pipes, sep='\n\n')

if __name__ == "__main__":

main()

view raw tpot_optimization.py hosted with ❤ by GitHub

Here is an example output from running our optimization script:

Optimizing prediction pipeline for the iris dataset with stratified k-fold cross-validation using the accuracy scoring metric

Pipeline optimization iteration: 0
>>> elapsed time: 135.48434898200503 seconds
>>> pipeline score on test data: 1.0

Pipeline optimization iteration: 1
>>> elapsed time: 132.3554882509925 seconds
>>> pipeline score on test data: 1.0

Pipeline optimization iteration: 2
>>> elapsed time: 133.29390010499628 seconds
>>> pipeline score on test data: 1.0

All best pipelines were the same:

Pipeline(memory=None,
         steps=[('stackingestimator',
                 StackingEstimator(estimator=KNeighborsClassifier(algorithm='auto',
                                                                  leaf_size=30,
                                                                  metric='minkowski',
                                                                  metric_params=None,
                                                                  n_jobs=None,
                                                                  n_neighbors=11,
                                                                  p=1,
                                                                  weights='uniform'))),
                ('extratreesclassifier',
                 ExtraTreesClassifier(bootstrap=True, ccp_alpha=0.0,
                                      class_weight=None, criterion='gini',
                                      max_depth=None,
                                      max_features=0.9000000000000001,
                                      max_leaf_nodes=None, max_samples=None,
                                      min_impurity_decrease=0.0,
                                      min_impurity_split=None,
                                      min_samples_leaf=18, min_samples_split=14,
                                      min_weight_fraction_leaf=0.0,
                                      n_estimators=100, n_jobs=None,
                                      oob_score=False, random_state=42,
                                      verbose=0, warm_start=False))],
         verbose=False)

The output provides some basic info on the pipeline iterations. If you can't tell from the combination of the script and its output, we have run the optimization process a total of 3 separate times; with each of these, we have used stratified 10-fold cross-validation; and the genetic optimization process has run for 5 generations on a population size of 50 for each of these iterations. Can you figure out how many pipelines were tested during the process? This is something we will have to give consideration to moving forward, not least for the practical reasons associated with computation time.

As you may recall, TPOT outputs the best pipeline — or pipelines, upon multiple iterations — to file, which can then be used to recreate the same experiment, or to use the same pipeline on new data. We will harness this as we move forward creating our fully-automated end-to-end prediction pipeline.

In our case, our script noted that each of the resulting pipelines were identical, and so only outputted one of them. This is a reasonable result on such a small dataset, but due to the nature of genetic optimization, best pipelines could be different between iterations on larger, more complex data.

Some things we tried with our script this time that we did not in the past:

Cross-validation for model evaluation
Iterating on the modeling more than once — likely not useful on such a small dataset, but possibly will be as we progress
Comparing resulting pipelines on these multiple iterations — are they all the same?
Did you know TPOT now uses PyTorch under the hood to build neural networks for prediction?

Maybe you already see some ways we can improve on the above. Some specific things we might not want to do in our future implementations:

We would want to think about our dataset splitting proportions in order to have the ideal amount of training, validation, and testing data
As we are using cross-validation for training and validation (related to the above point), we would want to hang on to our testing data to use only on our best performing model, as opposed to on each one
Since feature selection/engineering/construction is dealt with using TPOT, we will want to automate the conversion of categorical variables to numerical form prior to feeding them in
We will want to be able to deal with a wider array of datasets :)
Much, much more!

These points, while important for actual modeling, aren't really an issue right now, since our focus was only on putting the structure in place to iteratively build and evaluate machine learning pipelines. We can address these legitimate concerns as we move forward.

I encourage you to have a look at the TPOT documentation to see what it has in store for us as we leverage it to help build an end-to-end prediction pipeline.

Content Tags

Machine Learning Pipeline Optimization with TPOT

Continue reading and listening

Stay in the loop.