Do you need a feature store?

Feature Stores make it easier and cheaper to produce more accurate ML models.

Monte Zweben
Towards Data Science

--

By: Monte Zweben, Morgan Sweeney

Source: alexdndz/Adobe Stock

A machine learning model is only going to be as good as the data it’s been fed. To be more precise, a model is only as good as the features it’s been given.

A feature is a useful metric or attribute taken from either a raw data point or an aggregation of several raw data points. The specific features used in a model will depend on the prediction the model is trying to make. If a model tries to predict fraudulent transactions, for example, relevant features may include whether or not the transaction was in a foreign country, if the purchase was larger than usual, or if the purchase doesn’t align with typical spending for the given customer. These features might be calculated from data points such as the location of the purchase, the value of the purchase, the value of an average purchase, and the aggregated spending patterns of the particular user making the purchase.

While the data an ML model is trained on is of the utmost importance, preparing good data is one of the most challenging tasks for data scientists. In fact, 80% of the average data scientist’s time is spent on data preparation. This includes collecting data, cleaning and organizing that data, and engineering it into features. This work is manual, monotonous, and tedious: 76% of data scientists rated data prep as the least enjoyable part of their work. Perhaps most importantly, this work might be unnecessary — many data scientists throughout a company end up slogging through the data to calculate the same features that another data scientist in the company has already found. Additionally, data scientists spend considerable effort replicating the same feature engineering pipelines each time they want to deploy a model.

If this seems inefficient, that’s because it is. Small businesses and leading AI companies are turning to feature stores to solve this problem.

What Is a Feature Store?

Source: Author

A Feature Store is a system made specifically to automate the input, tracking, and governance of data into machine learning models. Feature stores compute and store features, enabling them to be registered, discovered, used, and shared across a company. A feature store makes sure features are always up to date for predictions and maintains the history of each feature’s values in a consistent manner, so that models can be trained and re-trained. Specifically, a feature store includes:

  1. Automated Data Transformation
  2. Consistent Feature Registry
  3. Model Training and Retraining
  4. Real-Time Feature Serving
  5. Model Monitoring.

Automated Data Transformation

Feature Stores manage data pipelines that transform raw data into feature values. These can be scheduled pipelines that aggregate petabytes of data at a time (like calculating the average 30-, 60-, and 90-day spending amounts of each customer of a large retailer), or real-time pipelines that are triggered by events and update feature values instantly (like updating the sum total of today’s spending for a particular customer every time they swipe their credit card).

Consistent Feature Registry

A feature registry is a central interface for cataloguing feature definitions within an organization. A feature registry contains standardized feature definitions and associated metadata to act as a single source of information for an organization.

The Feature Store makes searching through available Features and Feature definitions simple and straightforward. It exposes APIs and UIs to the data scientist to see currently available features, pipelines, and training datasets that are either being used in production models or under development. Data scientists can then pick and choose the features needed for their use case, and incorporate them into models without any extra code.

Model Training and Retraining

A feature store organizes older features into a time-series database so that when models are trained, the examples all have features aligned at the same time. Because all historical feature values are stored along with their most updated values, the Feature Store can generate entire training datasets for features, and align them properly with labels for training. As those Features are updated, the Feature Store can generate updated training datasets for model retraining in exactly the same way.

Real-Time Feature Serving

Feature stores serve a single vector of features made up of the freshest feature values to machine learning models. For example, if an application wants to recommend a particular product to a user, the model may need to know the average amount the user has spent in a particular spending category as well as the total length of time spent shopping in the last 48 hours. The Feature Store will have the most up-to-date values for those metrics immediately available for the model, instead of having to run the data pipeline to calculate them.

Model Monitoring

Assuming all previous predictions from models are stored along with the inputs to the model at that time, comparing those features (gathered from the Feature Store) along with the updated labels (when they become available) to the model prediction becomes a simple API call. This allows users to monitor the model’s performance, and keep track of any Feature drift, model prediction drift, and model accuracy (when labels become available). Because the Feature Store keeps all feature values up to date and all historical values in a time-consistent manner, it’s easy to monitor models with the Feature Store.

How Do Feature Stores Improve Productivity and Performance?

Source: alexdndz/Adobe Stock

Feature Stores increase the productivity of data scientists and improve the performance of ML models in an enterprise, by enabling:

  1. Feature Reuse

In typical Data Science workflows, a new project requires gathering data, transforming it into usable features, training, and then deploying a model. Because features cannot be easily shared, multiple teams each in their own silos often repeat the same work of feature engineering multiple times.

With a feature store, a data scientist can immediately start on a new problem by exploring the features already available. In many cases, features used for past models, or built by other data scientists, can be reused for your next machine learning project.

If the desired features aren’t there yet, the data scientist can always add new features, thus strengthening the Feature Store for themselves and others in the future. As this iterative process evolves, its value increases by accelerating data science and easing model deployment.

2. Feature Consistency

Lacking a consistent way to calculate features leads to models varying wildly between data silos. For example, in a retail company, one team may calculate “total customer revenue” by subtracting returns from sales, where another team calculates it just using sales. Both are valid metrics, but if they are both called “total customer revenue”, it results in inconsistently calculated metrics in different data pipelines.

The Feature Store’s singular feature registry provides a central location for features, where each feature is calculated in a single way, so there’s no more confusion.

3. Point-in-time Correctness

The sets of feature values used for training must be the values that were known at the time of the events that the model is trained on. This ensures that, when the model is used for prediction, the input feature values it uses are consistent with its training. A Feature Store solves this problem by producing training data sets with time-consistent feature values taken from each feature set’s history at the point in time of the events being modeled.

4. Model Explainability and Governance

With a feature store, you can easily identify what data a model was trained on and compare that to what data the deployed model has actually been fed. This makes iterating, training, and debugging a model much easier, because you’re able to see exactly what data you used and when. Moreover, end-to-end lineage ensures you can answer questions about why your model made certain predictions at any point in the past.

The Benefits of a Feature Store

Source: alexdndz/Adobe Stock

Data scientists are few and far between, and they don’t come cheap. Improving data science productivity by eliminating repetitive and unnecessary work means that you can produce more models in less time with your current staff.

Feature stores enable more accurate models by taking data freshness to a whole new level. By separating the data pipeline from the ML model, large aggregation-based features that may take hours to compute can be retrieved immediately when needed. This gives real-time models access to feature values they wouldn’t have otherwise. By having access to real-time data, models can predict more accurately based on what’s happening in the real world, instead of being stuck on yesterday’s data.

Feature stores are allowing enterprise AI to scale machine learning like never before. Not only do feature stores make your models as accurate as they can possibly be, they also offer your ML team an organizational structure that makes their job far easier and more enjoyable. Bring your company ahead of the competition with a feature store.

If you want to see a Feature Store in action, check out this 5-minute video.

--

--

CEO and co-founder of Splice Machine. Carnegie Mellon CS Advisory Board, NASA AI Deputy Chief, CEO Blue Martini Software, Red Pepper Software, Rocket Fuel