MAPE Madness

5 minute read


Spoiler: RTFM

Problem setup: You want to use the Mean Absolute Precision Error (MAPE) as your loss function for training Linear Regression on some forecast data. Springer: Mean Absolute Precision Error (MAPE)) has found success in forecasting because it has desirable properties:

  • robust to outliers

  • scale invariance (returns a percentage) and is intuitive to compare across datasets.

0) Setup

You have forecasting data where a significant difference may exist between contiguous samples.

\[T_1 = 5, T_2 = 5000\]

For example, you want to predict the price of Bitcoin, or ensure that your power plants can support when England brews up sufficient power for World Cup tea-time surge

We reproduce the equation below:

\[\text{MAPE} = \frac{1}{N} \sum_t^N |\frac{y - \hat{y}}{y}|\]

1) Failed Attempts

Here’s hoping you learn from my mistakes and can avoid the time I wasted trying to solve this problem

1.1) Sklearn

A quick look at the Sklearn Linear Model - Linear Regression page tells you that it only supports OLS. This is unfortunate because sklearn is, in general, heavily optimized and well tested.

1.2) Autograd

Having worked through the examples, I was not clear how to handle enormous datasets, which I was modeling at the time. The solution I was after was how to generate indices to be passed in for minibatch training. After much searching, I eventually found what I was looking for in Convnet Example, which shows you how to pass minibatches in.

Note: you want to be sure that none of your y_true values aren’t 0 as this can lead to division by 0 errors in the optimization. I suggest doing

def objective(params, X, y):
    pred =, params)
    non_zero_mask = y > 0
    return (y[non_zero_mask - pred[non_zero_mask]]) / y[non_zero_mask]

Other options would be to add weights to the objective function as it is possible that you are extremely unlucky, and the objective function returns 0 as all your labels, `y’, are 0. Additionally, you may want to weigh different samples more or less heavily.

Unfortunately, although I managed to get it to work, this solution was unbearably slow. Furthermore, for maintainability reasons, it would just be easier if you could use the sklearn API (not to say that you couldn’t wrap your autograd training into the sklearn format).

It was time to head back to the drawing board.

2) Solution

I got lucky, and things lined up perfectly.

2.1) Getting lucky with sklearn

While researching for ways to use sklearn packages to solve my issue, I also came across sklearn.SGDRegressor, but that only allows the following loss functions:

  • squared_error: OLS

  • huber: wherein errors below some \(\epsilon\) are treated as a linear loss, while errors above that \(\epsilon\) use the squared loss.

  • epsilon_insensitive: ignores errors less than \(\epsilon\) and is linear when greater than that

  • squared_epsilon_insensitive: is epsilon_insensitive but quadratic instead of linear.

2.2) Getting lucky with the equations

Looking at the Wikipedia page for MAPE, one might notice that it resembles the formula for MAE

\[MAPE = \frac{1}{n}|\frac{Y - \hat{Y}}{Y}|\] \[MAE = \frac{1}{n}|Y - \hat{Y}|\]

Algebraic Manipulation

\[\begin{align*} MAPE &= \frac{1}{n}|\frac{Y - \hat{Y}}{Y}| & \text{In my problem, Y is always positive} \\ &= \frac{1}{nY}|Y - \hat{Y}| & \text{Looks like a weighted MAE}\\ &= \frac{1}{Y}\text{MAE}\\ \end{align*}\]

so this means that I just need to find an MAE implementation.

2.3) Lady Luck is Smiling

By pure chance, I found Sklearn-mathematical formulation of SGD losses, and I decided to read it.

epsilon_insensitive loss ignores errors less than $\epsilon$ and is linear when greater than that

was the description for one of the losses. However, it wasn’t apparent to me that they would also take the absolute error. Only after reading the contents in the link above, did I realize what it meant:

\[L(Y, \hat{Y}) = max(0, |Y - \hat{Y}| - \epsilon)\]

This means that if we set \(\epsilon\) to 0, we get the form we want!

2.4) For completeness

For completeness, I list out the equation as I used it.

Y = ... # Our labels
X = ... # My forecast data
denominator = 1 / Y # we can do this

# Scaling
scaled_Y = Y * denominator
scaled_X = X * denominator #

model = SGDRegressor(loss="epsilon_insensitive`, epsilon=0), scaled_Y)

Closing words

Although we managed to make autograd and sklearn work for my problem, the results were still not good. I suppose that the takeaway from this is that you can do everything “right” and still not have things turn out your way.

In hindsight, this was a simple problem, but it was a good reminder of what it takes to be a good machine learning engineer: good software and math skills. I needed to set up minor infrastructure, massage data via a pipeline, and work out the autograd package, so being able to code was imperative. In addition, I needed to understand the math to come to the solution I did.

Please know that I am not blowing my own horn; in fact, I’m embarrassed about how long I took to find the solution. And even then, I stumbled backward into the solution.

Thank you for taking the time to read this, and happy holidays!