logo_2 (2)

Validation And Evaluation Of Model Drift

Worsening of model performance caused by changes in data and relationships between input and output variables is referred to as Model Drift. Businesses should track and manage their model performance over time in order to detect and minimise drift, which can have a negative impact otherwise.

Model drift can be classified into two broad categories:

  1. Concept drift – The properties of the dependent (target) variable changes thus, changing the relationship between the independent and dependent variables.
  2. Data drift – The statistical properties of the independent variable(s) change(s).

More precisely, Machine Learning models deployed in the industry are often trained on one dataset referred to as ‘Development/Training Dataset’ and these models are then used to predict another dataset called the ‘Production/Review Dataset’. Model drift indicates that the model does not perform robustly on the Review dataset.

Difference between Data Drift and Model Drift in practice

Data drift deals with the change in distribution of the datasets, particularly the independent variables(features) and does not take the model or the model’s performance into consideration. Concept drift deals with the distribution of target variables and/or predicted probabilities of the model.

Methods to verify and measure Data and Concept Drift

There are three main techniques to measure drift:

  1. Statistical: This approach can be used to detect/verify both Data and Concept Drift. It uses various statistical metrics on the datasets to conclude whether the distribution of the training data is different from the production data distribution.
  2. Model Based: This approach involves training a classification model to determine whether certain data points are similar to another set of data points and hence can be used for verifying Data Drift. If the model has a hard time differentiating between the data, then there isn’t any significant data drift. If the model is able to correctly separate the data groups, then there likely is some data drift.
  3. Adaptive Windowing (ADWIN): This approach to detect Concept Drift is used with streaming data, where there is a large amount of infinite data flowing in and it is infeasible to store it all.

Statistical Approach

Many of these approaches compare distributions. There are a number of different statistical distance measures that can help quantify the change between distributions. Different distance checks are useful in different situations. A comparison requires a reference distribution which is the fixed data distribution that we compare the production data distribution to. For example, this could be the first month of the training data, or the entire training dataset. It depends on the context and the timeframe in which you are trying to detect drift. But obviously, the reference distribution should contain enough samples to represent the training dataset.


One of the first and simplest measures we can look at is the average for our features. If the mean gradually shifts in a particular direction over the months, then there probably is data drift happening. Having said that, mean isn’t the best method to check for drift but is a good starting point.

Population Stability Index (PSI)

PSI is often used in the finance industry and is a metric to measure how much a variable has shifted in distribution between two samples or over time. As the name suggests it helps measure the population stability between two population samples. It is calculated by bucketing the two distributions and comparing the percentages of items in each of the buckets, resulting in a single number you can use to understand how different the populations are.

The ln(Actual%/Expected%) term implies that a large change in a bin that represents a small percentage of a distribution will have a larger impact on PSI than a large change in a bin with a large percentage of the distribution.

It should be noted that the population stability index simply indicates changes in the feature. However, this may or may not result in deterioration in performance. So if you do notice a performance degradation, you could use PSI to confirm that the population distributions have indeed changed.

Kullback-Leibler (KL) divergence

KL divergence measures the difference between two probability distributions and is also known as the relative entropy. KL divergence is useful if one distribution has a high variance to the other distribution.

KL divergence is also not symmetric. This means that unlike PSI, the results will vary if the reference and production(compared) distributions pair are switched. So KL(P || Q) != KL(Q || P). This makes it useful for applications involving Baye’s theorem or when you have a large number of training (reference) samples but only a small set of samples (resulting in more variance) in the comparison distribution.

A KL score can range from 0 to infinity, where a score of 0 means that the two distributions are identical. If the KL formulae are taken to log base 2, the result will be in bits, and if the natural log (base-e) is used, the result will be in “nats”.

Jensen-Shannon (JS) Divergence

The JS divergence is another way to quantify the difference between two probability distributions. It uses the KL divergence that we saw above to calculate a normalized score that is symmetrical. This makes JS divergence score more useful and easier to interpret as it provides scores between 0 (identical distributions) and 1 (maximally different distributions) when using log base 2.

With JS there are no divide-by-zero issues. Divide by zero issues come about when one distribution has values in regions the other does not.

Wasserstein distance metric

The Wasserstein distance, also known as the Earth Mover’s distance, is a measure of the distance between two probability distributions over a given region. The Wasserstein Distance is useful for statistics on non-overlapping numerical distribution moves and higher dimensional spaces, for example images.

Kolmogorov–Smirnov test (K–S test or KS test)

The KS test is is a nonparametric test of the equality of continuous/discontinuous, one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution (one-sample K–S test), or to compare two samples (two-sample K–S test).

  1. Model Based

An alternative to the statistical methods explored above is to build a classifier model to try and distinguish between the reference and compared distributions. We can do this using the following steps:

  1. Tag the data from the batch used to build the current production model as 0.
  2. Tag the batch of data that we have received since then as 1.
  3. Develop a model to discriminate between these two labels.
  4. Evaluate the results and adjust the model if necessary.

If the developed model can easily discriminate between the two sets of data, then a covariate shift has occurred and the model will need to be recalibrated. On the other hand, if the model struggles to discriminate between the two sets of data (it’s accuracy is around 0.5 which is as good as a random guess) then a significant data shift has not occurred and we can continue to use the model.

  1. ADWIN Adaptive Sliding Window

As mentioned earlier, the above techniques aren’t very suitable when we have a stream of input data. In some scenarios, the data may drift so quickly that by the time we collect data and train a model, the trends would have changed and our model is out of date. But figuring out the best time frame to consider for training is not that straightforward either. This is where the Adaptive Sliding Window technique is helpful. It is adaptive to the changing data. For instance, if the change is taking place, the window size will shrink automatically, else if the data is stationary, the window size will grow to improve the accuracy.

These techniques can be used to detect data drift according to their suitability. But we want to avoid waiting until we notice significant performance degradation in our model to start investigating for data drift. That’s where having a model monitoring plan becomes very useful.

At Predactica, we offer a platform to monitor and manage your Machine Learning/AI models. We detect and verify Data Drift and Concept Drift in your model using metrics such as PSI, Kullback-Leibler Divergence and Jenson Shannon Divergence to help you ensure that the model runs smoothly and efficiently in production.