Content Tags

There are no tags.

Feature stores – how to avoid feeling that every day is Groundhog Day

Authors
Monte Zweben

(This article originally appeared on KDNuggets.com here. For more, visit https://www.kdnuggets.com/)

Feature stores stop the duplication of each task in the ML lifecycle. You can reuse features and pipelines for different models, monitor models consistently, and sidestep data leakage with this MLOps technology that everyone is talking about.

Work as a data scientist follows a cycle: log in, clean data, define features, test and build a model, and make sure the model is running smoothly. Sounds straightforward enough, except not all parts of the cycle are created equal: data preparation takes 80% of any given data scientist’s time. No matter what project you’re working on, most days you’re cleaning data and converting raw data into features that machine learning models can understand. The monotonous hole of data prep blends hours together and makes each day of work feel identical to the one before it.

Why can't you do this tedious process more effectively?

You can—with a Feature Store. A Feature Store is a shareable repository of features made to automate the input, tracking, and governance of data into ML models. Feature stores compute and store features, enabling them to be registered, discovered, used, and shared across a company. A feature store makes sure features are always up to date for predictions and maintains the history of each feature’s values in a consistent manner so that models can be seamlessly trained and re-trained.

So, how will this save time for you as a data scientist? Glad you asked.

Reuse features and ETL pipelines

First and foremost, a feature store allows you to reuse features and data pipelines for each model you create. No more waiting around for bespoke ETL pipelines from data engineers, and no more having to copy, paste, and tweak feature definitions from previous models—just search, find, and use the feature you want.

Consistent model monitoring

In addition, a feature store allows you to ensure that the code used to generate features is the same as what your deployed models see. Because feature stores use the same pipelines to get features and training sets, they are automatically consistent. This makes it easy to evaluate and train models, especially at a scale where you start to lose track of where the different training sets are stored and when they were updated.

For example, instead of having weekly aggregations begin on Sunday in training and on Monday when deployed, they would begin on the same day in both cases. This consistency makes it way easier to identify model or feature drift in the case it does occur. In addition, it’s far easier to retrain your model if something does go wrong since there’s a single point of truth to refer to in a single location.

Evade data leakage

Creating a training set from historical values is complicated on its own, but when you’re building a model that needs to be regularly re-trained, retrieving different time-series feature sets is a real headache. Having to write complex SQL temporal joins is a lot of work and often outside the skill set of most data scientists. It’s all too easy to make mistakes that could compromise your model.

Feature stores log the time a feature was observed, as well as when it was added to the feature store, so you can identify the value of features at any point in time. By logging these timestamps, feature stores automatically enable point-in-time correctness when building training sets. This makes it incredibly easy to train a model on accurate data and automatically retrain it without worrying about data leakage.

Many people assume that data scientists spend most of their time testing and building models, but so much of getting a model up and running is wrangling the data that makes the model possible. Feature stores keep track of every data pipeline, feature, and all associated metadata so that the pieces you use to build a model are all reusable, easily shareable, and completely transparent to review. By making sure you only have to do each task once, you free up time and energy to focus on other projects. Feature stores are an efficiency no-brainer; if you want to learn more about their impact on the scale of an entire business, you can read my blog post, “Do you need a feature store?"

Stay in the loop.

Subscribe to our newsletter for a weekly update on the latest podcast, news, events, and jobs postings.