Monitoring Forecasting Models at Scale

5 min readFeb 22, 2021

Machine Learning and Deep Learning model predictions are used in applications running in your laptop, your desktop, your phone and even in your car or in your home. The list keeps growing as more smart devices enter our homes and lives, like smart vacuums, fridges and anything else you can imagine.

Performance monitoring is critical, as it allows to pinpoint issues early, debug and update as needed.

Photo by Volodymyr Hryshchenko on Unsplash

In Predicto, we faced a unique challenge as we started scaling. We are tracking a large number of stocks and we are experimenting with different Deep Learning models. Those models can have different architectures, different features and even different ensemble combinations. In the future we plan to track ETFs and Cryptocurrencies as well.

As you can imagine this can quickly get out of hand.

Because of the nature of the stock market and changing market conditions, forecasting accuracy on past data is not exactly what we are after. Being accurate on past data doesn’t guarantee future success.

We are after patterns.

Monitoring all our models daily allows us to identify the ones that perform well lately. Then we can zoom in and identify why, using our explainability platform.

We know we can’t predict the future. But we are convinced we can identify patterns early.

So how can you tell if a model has been performing well, lately?

Monitoring Metrics

First, let’s define what we mean by “well” and by “lately”.

The “lately” factor can be adjusted. Currently, we aggregate all our metrics based on the last 15 days performance.

To tell if a model has been performing well, we experiment and don’t rely on a single metric. After all, why use one metric when you can use more that complement each other and serve different purposes?

Here is our metric list: MAPE / R² / DTW / STAR. Let’s go through them one by one.

Mean absolute percentage error (MAPE) is a common metric to use that does exactly what its name suggests. It averages over the percentage error between a forecast and the actual values. It helps identify forecasts that are indeed spot on and that can detect price fluctuations.

Forecast with small MAPE. Model forecast is very close to actual price movement.

Coefficient of determination (R²) is another important metric. The Wikipedia explanation states that it “is the proportion of the variance in the dependent variable that is predictable from the independent variable(s)”. The Investopedia explanation states that it “assesses how strong the linear relationship is between two variables”. To simplify, we use it to understand how close our forecast is to the fitted regression line of the actual price. It helps identify models that predict the trend of a price movement and it complements MAPE in our case. Trend prediction is very valuable to us.

Forecast with high R² and high MAPE. Model predicted the trend but not the price fluctuations.

Dynamic Time Warping (DTW) is used to measure similarity between 2 time series that may vary in speed. Comparison happens by choosing the shortest distance pair within a moving window. In our case, by using this metric we choose to relax our evaluation a bit: If forecast predicted a +2% movement for today, but in reality the +2% movement happens tomorrow we still consider it a success. In short, we allow small lags in our forecast. That lag can be adjusted.

Example of 2 time series matching using the DWT algorithm (Source: Wikipedia)

Simulated Trading Average Return (STAR) is our own metric that we derive from our simulated trading environment. What it does is pretty simple: given a price forecast of a model, generate a trade, keep track of it and once it hits the predicted target or a stop loss value then exit position and record the return. The trade is generated by using a fitted regression line on our forecast and by calculating a percentage change from this trend line as the target price. Like this we can keep track of the average return of all our models in the past.

OK. Now that we introduced our metrics, let’s see how they all come into play and work together.

In action

We update our metrics daily for more than 500 Deep Learning models.

Models that have good recent performance in the above metrics stay in our watch list. Here is one of our monitoring dashboards:

An overview of aggregated performance statistics of a subset of models

In the above list for example, we can see that a model for the VRSK stock has good performance on the first 5 days prediction when aggregating its recent daily forecasting errors. In general, models reporting small error on stocks with higher price volatility make always an interesting case. Let’s zoom in and have a closer look.

Zooming in on VRSK model recent performance

Indeed, that model caught the price increase movement. We can zoom in even further and understand why this model predicted this movement. We won’t go in more detail here as we have already written a blog post on Explaining Financial Deep Learning Forecasts. We encourage curious readers to have a look.

We are also able to identify individual recent forecasts based on the MAPE and R² metrics mentioned above.

Identifying individual forecasts based on a specific metric

What’s next

There is a lot more to cover but we will stop here for now. We plan to write a dedicated article for our Simulated Trading Average Return (STAR) metric and reveal more details in the near future.

If you enjoyed this article and want to learn more about our work, feel free to check our site, medium page and our blog.

Stay safe — Stay tuned!

Monitoring Forecasting Models at Scale

Monitoring Metrics

In action

What’s next

Written by Predicto