logo_2 (2)

Data Drift: An In-Depth Understanding

When a machine learning model is deployed in production, the main concern of data scientists is the model pertinence over time. Is the model still capturing the pattern of new incoming data, and is it still performing as well as during its design phase?

Data drift is unexpected and undocumented changes to data structure, semantics, and infrastructure that are a result of modern data architectures.

Drift is a key issue because machine learning often relies on a key assumption: the past == the future. In the real world, this is very rarely the case. As a result, it’s critical to understand how changes in the data will affect the model’s behavior both before a model is deployed and on an ongoing basis during deployment.

When we see cases like the one in the figure above where we get inaccurate class predictions with bad accuracy: model outputs had taken a significant turn in a particular direction. And this often leads us to several questions:

What in the data is causing this shift?

How does this affect the quality of my model?

Do I need to change my model in any way?

This problem is aggravated in situations where ground truth data isn’t instantaneously available, making it hard to measure model accuracy before it’s too late. This is especially true in cases such as fraud, credit risk, marketing where the outcomes of the prediction can only be measured with a certain amount of lag.

Why does drift happen?

The world and the context in which the model is applied potentially keep changing. Some examples of how the data to which a model is applied to and was trained on can be significantly different are:

  • The external world has changed: These can be external events such as the pandemic and interest rate shifts, or could be more internal events such as a breakdown in model pipelines leading to data quality issues
  • The model is applied to a new context: For example, a language model trained on a sample of articles is then applied to news articles.
  • The training data was drawn from a different set in the first place: This can be due to sample selection biases. Sometimes these biases are unavoidable. For example, for loan applications, ground truth is only available for individuals loans were given to.

What are the different kinds of drift?

When talking about drift, several different closely related terms may come up: covariate shift, concept drift, data drift, model drift, model decay, dataset shift, distribution shift. These terms refer to different assumptions about what’s changing.

Covariate Shift (Shift in the independent variables):

Covariate shift is the change of distributions in one or more of the independent variables (input features). This means that due to some environmental change even though the relationship between feature X and target Y remains unchanged, the distribution of feature X has changed. The graph below may help understand this better.

The example above: Due to the pandemic many businesses closed or their revenues decreased, they had to reduce staff, etc, but they decided to keep paying their loans because they were afraid that the bank may take seize their assets. Performance degradation will be more apparent when this sort of shift happens in one or more of the top contributing variables of a model.

Covariate shift is the situation where

Ptrain(Y|X)=Ptest(Y|X) but Ptrain(X) ≠Ptest(X)

Where Ptest could be your test set of data after the model has been deployed.

Prior Probability Shift (Shift in the target variable):

With prior probability shift, the distribution of the input variables remains the same but the distribution of the target variable changes. For example that could look something like this:

Companies that were not really affected by the lockdown and have not suffered any revenue losses but deliberately chose not to repay their loan installments to take advantage of government subsidies and maybe save that money in case the situation does worsen for them in the future (same X distribution but different Y).

Prior Probability Shift is the situation where

Ptrain(X|Y)=Ptest(X|Y) but Ptrain(Y) ≠Ptest(Y)

Where Ptest could be your test set of data after the model has been deployed.

Concept Shift

With concept drift the relationships between the input and output variables change. This means that the distributions of input variables (such as user demographics, frequency of words, etc.) might even remain the same and we must instead focus on the changes in the relationship between X and Y.

In more formal definition terms, concept shift is the situation where

Ptrain(Y|X) ≠ Ptest(Y|X) and Ptrain(X) = Ptest(X) in X Y

Ptrain(X|Y) ≠ Ptest(X|Y) and Ptrain(Y) = Ptest(Y) in Y X

Where Ptest could be your test set of data after the model has been deployed.

Concept drift is more likely to appear in domains that are dependent on time, such as time series forecasting and data with seasonality. Learning a model over a given month won’t generalize to another month.

There are a few different ways in which concept drift might show up:

Gradual Concept Drift

Gradual or incremental drift is the concept drift that we can observe over time and therefore expect. With different changes in the world, our model gradually becomes outdated resulting in a gradual decline in its performance.

Some examples of gradual concept drift are:

  • Launch of alternative products — products that weren’t available during the training period (for example if the product was the only one of its kind in the market) can cause unforeseen effects on the model since it has not seen similar trends before
  • Economic changes — changes in interest rates and maybe its effect on more loan borrowers to default on their loans can cause changes.

The effect of situations like these can add up over time to cause a more dramatic drift effect.

Sudden Concept Drift

As the name suggests, these concept shifts happen by surprise and suddenly. Some of the most apparent examples came when COVID-19 first struck on a global scale. Demand forecasting models were heavily affected, supply chains couldn’t keep up.But such changes can also happen during the regular function of a company when there is no pandemic.

  • Major change to a road network: The sudden opening of new roads and closing of others or the addition of a new public railway system may cause trouble for a traffic prediction model until it has collected some data to work with as it has never seen a similar configuration before.
  • New equipment added to a production line: New equipment presents new problems and the reduction of old problems. So the model will be unsure of how to make good predictions.

In general, any major change in the environment that throws the model into unfamiliar territory will cause performance degradation.

Recurring Concept Drift

Recurrent concept drift is pretty much “seasonality”. But seasonality is common in machine learning with time series data and is something we are aware of. So if we expect this sort of drift, for example, a different pattern on weekends or certain holidays of the year, we just need to make sure that we train the model with data representing this seasonality. This sort of data drift usually becomes a problem in production only if a new pattern develops that the model is unfamiliar with.

Conclusion

In practice, identifying the exact type of data drift is less important. Often, the drift can be a combination of these things and subtle. What matters is identifying the impact on model performance and catching the drift on time so that actions such as retraining the model can be taken early on.

Now that you suspect and see there’s data drift going on, how can we verify this and identify where in our data the data drift is occurring? In a soon-to-follow article that will be published by my colleague, Sarjhana Ragunathan Brindha, she will talk about this topic and more.

REFERENCES

Dhinakaran, A. (2020, October 19). Using Statistical Distance Metrics for Machine Learning Observability. Medium. https://towardsdatascience.com/using-statistical-distance-metrics-for-machine-learning-observability-4c874cded78.

Sarantitis, G. (2021, June 24). Data Shift in Machine Learning: what is it and how to detect it. Georgios Sarantitis. https://gsarantitis.wordpress.com/2020/04/16/data-shift-in-machine-learning-what-is-it-and-how-to-detect-it/.

Samuylova, E. (2021, June 22). Machine Learning in Production: Why You Should Care About Data and Concept Drift. Medium. https://towardsdatascience.com/machine-learning-in-production-why-you-should-care-about-data-and-concept-drift-d96d0bc907fb

Numal J. (2021, June 26). Data Drift: How to Detect Data Drift. TowardsDataScience How to Detect Data Drift | by Numal Jayawardena | Towards Data Science

Numal J. (2021, June 26). Data Drift: Types of Data Drift. TowardsDataScience. Types of Data Drift | by Numal Jayawardena | Towards Data Science