MAPE Madness
Published:
Spoiler: RTFM
Problem setup: You want to use the Mean Absolute Precision Error (MAPE) as your loss function for training Linear Regression on some forecast data. Springer: Mean Absolute Precision Error (MAPE)) has found success in forecasting because it has desirable properties:
robust to outliers
scale invariance (returns a percentage) and is intuitive to compare across datasets.
0) Setup
You have forecasting data where a significant difference may exist between contiguous samples.
\[T_1 = 5, T_2 = 5000\]For example, you want to predict the price of Bitcoin, or ensure that your power plants can support when England brews up sufficient power for World Cup tea-time surge
We reproduce the equation below:
\[\text{MAPE} = \frac{1}{N} \sum_t^N |\frac{y - \hat{y}}{y}|\]1) Failed Attempts
Here’s hoping you learn from my mistakes and can avoid the time I wasted trying to solve this problem
1.1) Sklearn
A quick look at the Sklearn Linear Model - Linear Regression page tells you that it only supports OLS. This is unfortunate because sklearn
is, in general, heavily optimized and well tested.
1.2) Autograd
Having worked through the examples, I was not clear how to handle enormous datasets, which I was modeling at the time. The solution I was after was how to generate indices to be passed in for minibatch training. After much searching, I eventually found what I was looking for in Convnet Example, which shows you how to pass minibatches in.
Note: you want to be sure that none of your y_true
values aren’t 0 as this can lead to division by 0 errors in the optimization. I suggest doing
def objective(params, X, y):
pred = np.dot(X, params)
non_zero_mask = y > 0
return (y[non_zero_mask - pred[non_zero_mask]]) / y[non_zero_mask]
Other options would be to add weights to the objective
function as it is possible that you are extremely unlucky, and the objective function returns 0 as all your labels, `y’, are 0. Additionally, you may want to weigh different samples more or less heavily.
Unfortunately, although I managed to get it to work, this solution was unbearably slow. Furthermore, for maintainability reasons, it would just be easier if you could use the sklearn
API (not to say that you couldn’t wrap your autograd
training into the sklearn
format).
It was time to head back to the drawing board.
2) Solution
I got lucky, and things lined up perfectly.
2.1) Getting lucky with sklearn
While researching for ways to use sklearn
packages to solve my issue, I also came across sklearn.SGDRegressor, but that only allows the following loss functions:
squared_error
: OLShuber
: wherein errors below some \(\epsilon\) are treated as a linear loss, while errors above that \(\epsilon\) use the squared loss.epsilon_insensitive
: ignores errors less than \(\epsilon\) and is linear when greater than thatsquared_epsilon_insensitive
: isepsilon_insensitive
but quadratic instead of linear.
2.2) Getting lucky with the equations
Looking at the Wikipedia page for MAPE, one might notice that it resembles the formula for MAE
\[MAPE = \frac{1}{n}|\frac{Y - \hat{Y}}{Y}|\] \[MAE = \frac{1}{n}|Y - \hat{Y}|\]Algebraic Manipulation
\[\begin{align*} MAPE &= \frac{1}{n}|\frac{Y - \hat{Y}}{Y}| & \text{In my problem, Y is always positive} \\ &= \frac{1}{nY}|Y - \hat{Y}| & \text{Looks like a weighted MAE}\\ &= \frac{1}{Y}\text{MAE}\\ \end{align*}\]so this means that I just need to find an MAE
implementation.
2.3) Lady Luck is Smiling
By pure chance, I found Sklearn-mathematical formulation of SGD losses, and I decided to read it.
epsilon_insensitive
loss ignores errors less than $\epsilon$ and is linear when greater than that
was the description for one of the losses. However, it wasn’t apparent to me that they would also take the absolute error. Only after reading the contents in the link above, did I realize what it meant:
\[L(Y, \hat{Y}) = max(0, |Y - \hat{Y}| - \epsilon)\]This means that if we set \(\epsilon\) to 0, we get the form we want!
2.4) For completeness
For completeness, I list out the equation as I used it.
Y = ... # Our labels
X = ... # My forecast data
denominator = 1 / Y # we can do this
# Scaling
scaled_Y = Y * denominator
scaled_X = X * denominator #
model = SGDRegressor(loss="epsilon_insensitive`, epsilon=0)
model.fit(scaled_X, scaled_Y)
Closing words
Although we managed to make autograd
and sklearn
work for my problem, the results were still not good. I suppose that the takeaway from this is that you can do everything “right” and still not have things turn out your way.
In hindsight, this was a simple problem, but it was a good reminder of what it takes to be a good machine learning engineer: good software and math skills. I needed to set up minor infrastructure, massage data via a pipeline, and work out the autograd
package, so being able to code was imperative. In addition, I needed to understand the math to come to the solution I did.
Please know that I am not blowing my own horn; in fact, I’m embarrassed about how long I took to find the solution. And even then, I stumbled backward into the solution.
Thank you for taking the time to read this, and happy holidays!