Understanding forecast accuracy MAPE, WMAPE,WAPE? - forecasting

I am new to the forecast space and I am trying to understand the different forecast accuracy measures. I am referring to the below link
https://www.otexts.org/fpp/2/5
Can anyone please help me understand the below things:
1. MAPE: I am trying to understand the disadvantage of MAPE "They also have the disadvantage that they put a heavier penalty on negative errors than on positive errors. " Can anyone please provide an example to explain this in detail?
2. Also, I was assuming that WMAPE and WAPE are same. I saw this post at stackoverflow which formulates them differently.
What's the gaps for the forecast error metrics: MAPE and WMAPE?
Also, can you please help me understand how the weights are calculated? My understanding is higher the value more important it is. But I am not sure how the value is calculated.
Thanks in advance!

MAPE = 100* mean(|(Actual-forecast)/Actual|)
If you check the website https://robjhyndman.com/hyndsight/smape/ and the example given u will notice that the denominator taken is the forecast which is incorrect (Should be the actual value). With this formula you can see that MAPE does not put a heavier penalty on negative errors than on positive errors.
WMAPE applies weights which may in fact be biased towards the error which would make the metric worse. The weightage for WMAPE is as far as I know based on the use case. For example you are trying to predict the loss but the percentage of loss needs to be weighted with volume of sales because a loss on a huge sale needs better prediction.
In cases where values to be predicted is very low MAD/Mean (a.k.a WAPE) should be used. For example if the sales is 3 units in one particular week (maybe a holiday) and the predicted value is 9 then the MAPE would be 200%. This would bloat up the total MAPE when you look at multiple weeks of data.
The link given below has details of some other stats used for error measurement
http://www.forecastpro.com/Trends/forecasting101August2011.html

I'm not very sure about the rest, but I came across an answer for the first question recently.
Check out this website - http://robjhyndman.com/hyndsight/smape/
The example given there is presented below -
"Armstrong and Collopy (1992) argued that the MAPE "puts a heavier penalty on forecasts that exceed the actual than those that are less than the actual". Makridakis (1993) took up the argument saying that "equal errors above the actual value result in a greater APE than those below the actual value". He provided an example where yt=150 and y^t=100, so that the relative error is 50/150=0.33, in contrast to the situation where yt=100 and y^t=150, when the relative error would be 50/100=0.50."
y^t == estimated value of y

WMAPE and MAPE are different measures.
MAPE is Mean Absolute Percent Error - this just averages the percent errors.
WMAPE is Weighted Mean Absolute Percent Error = This weights the errors by Volume so this is more rigorous and reliable.
Negative errors do not influence the calculation is this is all absolute error. This could result from the denominator used which is a separate debate.
You can download a detailed presentation from our website at https://valuechainplanning.com/download/24. The PDF can be downloaded at https://valuechainplanning.com/upload/details/Forecast_Accuracy_Presentation.pdf.

Related

Compare MAPE for markets with different volatilities

I am trying to compare the forecast accuracy of a number of methods using MAPE across different commodity markets, such as corn, wheat, soybeans, coffee, cotton. Obviously the relative MAPE’s area impacted by the relative volatilities of each commodity: a high MAPE for wheat may simply reflect a volatile market, not necessarily a poor forecast.
I am wondering how to correct for this: some kind of vol-adjusted MAPE I suppose, but I cannot find any literature on this. Alternatively, I was thinking of comparing the MAPE of a certain forecast method with the MAPE of a naïve forecast…this should also correct for the vol difference somewhat, I suppose.
Any further suggestions/comments are greatly appreciated.
I'm not aware of any measures that directly incorporates volatility, in order to enable comparison across. I would also question the relevance of directly comparing accuracy measures across like that, as the accuracy would depend - as you also points out - on the volatility/signal-to-noise ratio of the time series.
I approach a problem like this by what you also suggest - create a naïve forecast, and have that as the lowest acceptable accuracy for that series, and also an initial measure of the forecastability of the series.
Note: i follow the definition of a naïve forecast as: one which is a very simple forecast model, could be naive1, naive2, moving average or combination of those - where no further work needs to be done on parameters.
Try to have a look at the work of Michael Gilliland on FVA for inspiration

Sample size calculation for experimental design

I have three treatments (Wild type, Mutant1 and Mutant2); I request inputs on how to decide the sample size that would be statistically significant (alpha <0.05) with high statistical power (1-beta=0.8).
Questions
I understand that we need the information of effect size. We approach this problem if we don't know the expected effect size prior; a trial experiment to estimate the effect size. In such case if we want to estimate the effect size with trial experiment; what could be the sample size to start with; a high (n=10) or as low as n=3? Can n=3 among treatments provide a good estimate of effect size or n=10 is better to get this estimate. Let's be specific; if we have resource for n=10 max. and we are given option to choose between n=3 or n=10 for this trial
This question is better asked in https://stats.stackexchange.com.
I would discourage you from trying to estimate effects sizes from pilot experiments with low n. Your estimates will be quite noisy and this is rarely done (at least in my field of neuroscience). Instead, I would suggest you estimate your effect size from the literature. Have other people measured something similar to what you are planning to do? What are the sample sizes they use? What kind of effect sizes do they report.
If you were going to go ahead with the plan to run a pilot study, I would recommend pre-registering your experimental design (https://www.cos.io/initiatives/prereg). Something like:
We will test the effects of mutation 1 and mutation 2 on XXXX (compared to wild type) in a cohort of 30 mice (10 in each group). Based on the results of this study, we will then conduct a power analysis and reproduce the experiments in a sample size required to have a power of 0.8 at p=0.05.
Our criteria for excluding animals from the power analysis will be .....
The statistical test for estimating effect size will be......"
etc.

The order of features affects the results of a neural network

Well,
I am confused really.
I have a simple order of features, i.e. all the letters and a few symbols, counting how many times are contained in a string.
My selection as a result is as follows
numberOf_a
numberOf_b
...
numberOf_Z
numberOf_.
numberOf_,
I have a test sample of 65 values, and the MLP can get 46 correct.
Now If I chance the order of features in random order, train with the same data, evaluate the same values, I get a different number of correct predictions, e.g. 49.
Results are consistent (the same order will yield the same accuracy) but the accuracy changes between random orders.
The question is, is this supposed to happen? I cannot see how this is backed up by the theory. I am missing something large here?
PS. I am using WEKA's implementation of the MLP
I'm not familiar with the WEKA implementation of the MLP but that doesn't seem like something that should be happening with a neural network algorithm.
It almost seems like it's getting stuck in some sort of local minimum. The algorithm may be initializing the weights of the individual neurons the same way every time. Changing the parameter order might then cause the algorithm to arrive at the same answer for a certain parameter order each time, dependent on the initial parameter order. The "local minimum" might be determined by the algorithm only going through a certain number of iterations each time.

Help--100% accuracy with LibSVM?

Nominally a good problem to have, but I'm pretty sure it is because something funny is going on...
As context, I'm working on a problem in the facial expression/recognition space, so getting 100% accuracy seems incredibly implausible (not that it would be plausible in most applications...). I'm guessing there is either some consistent bias in the data set that it making it overly easy for an SVM to pull out the answer, =or=, more likely, I've done something wrong on the SVM side.
I'm looking for suggestions to help understand what is going on--is it me (=my usage of LibSVM)? Or is it the data?
The details:
About ~2500 labeled data vectors/instances (transformed video frames of individuals--<20 individual persons total), binary classification problem. ~900 features/instance. Unbalanced data set at about a 1:4 ratio.
Ran subset.py to separate the data into test (500 instances) and train (remaining).
Ran "svm-train -t 0 ". (Note: apparently no need for '-w1 1 -w-1 4'...)
Ran svm-predict on the test file. Accuracy=100%!
Things tried:
Checked about 10 times over that I'm not training & testing on the same data files, through some inadvertent command-line argument error
re-ran subset.py (even with -s 1) multiple times and did train/test only multiple different data sets (in case I randomly upon the most magical train/test pa
ran a simple diff-like check to confirm that the test file is not a subset of the training data
svm-scale on the data has no effect on accuracy (accuracy=100%). (Although the number of support vectors does drop from nSV=127, bSV=64 to nBSV=72, bSV=0.)
((weird)) using the default RBF kernel (vice linear -- i.e., removing '-t 0') results in accuracy going to garbage(?!)
(sanity check) running svm-predict using a model trained on a scaled data set against an unscaled data set results in accuracy = 80% (i.e., it always guesses the dominant class). This is strictly a sanity check to make sure that somehow svm-predict is nominally acting right on my machine.
Tentative conclusion?:
Something with the data is wacked--somehow, within the data set, there is a subtle, experimenter-driven effect that the SVM is picking up on.
(This doesn't, on first pass, explain why the RBF kernel gives garbage results, however.)
Would greatly appreciate any suggestions on a) how to fix my usage of LibSVM (if that is actually the problem) or b) determine what subtle experimenter-bias in the data LibSVM is picking up on.
Two other ideas:
Make sure you're not training and testing on the same data. This sounds kind of dumb, but in computer vision applications you should take care that: make sure you're not repeating data (say two frames of the same video fall on different folds), you're not training and testing on the same individual, etc. It is more subtle than it sounds.
Make sure you search for gamma and C parameters for the RBF kernel. There are good theoretical (asymptotic) results that justify that a linear classifier is just a degenerate RBF classifier. So you should just look for a good (C, gamma) pair.
Notwithstanding that the devil is in the details, here are three simple tests you could try:
Quickie (~2 minutes): Run the data through a decision tree algorithm. This is available in Matlab via classregtree, or you can load into R and use rpart. This could tell you if one or just a few features happen to give a perfect separation.
Not-so-quickie (~10-60 minutes, depending on your infrastructure): Iteratively split the features (i.e. from 900 to 2 sets of 450), train, and test. If one of the subsets gives you perfect classification, split it again. It would take fewer than 10 such splits to find out where the problem variables are. If it happens to "break" with many variables remaining (or even in the first split), select a different random subset of features, shave off fewer variables at a time, etc. It can't possibly need all 900 to split the data.
Deeper analysis (minutes to several hours): try permutations of labels. If you can permute all of them and still get perfect separation, you have some problem in your train/test setup. If you select increasingly larger subsets to permute (or, if going in the other direction, to leave static), you can see where you begin to lose separability. Alternatively, consider decreasing your training set size and if you get separability even with a very small training set, then something is weird.
Method #1 is fast & should be insightful. There are some other methods I could recommend, but #1 and #2 are easy and it would be odd if they don't give any insights.

PID controller affect on a differential driving robot when the parameters (Kp, Ki, and Kd) are increased individually. [full Q written below]

Question: A PID controller has three parameters Kp, Ki and Kd which could affect the output performance. A differential driving robot is controlled by a PID controller. The heading information is sensed by a compass sensor. The moving forward speed is kept constant. The PID controller is able to control the heading information to follow a given direction. Explain the outcome on the differential driving robot performance when the three parameters are increased individually.
This is a question that has come up in a past paper but most likely won't show up this year but it still worries me. It's the only question that has me thinking for quite some time. I'd love an answer in simple terms. Most stuff i've read on the internet don't make much sense to me as it goes heavy into the detail and off topic for my case.
My take on this:
I know that the proportional term, Kp, is entirely based on the error and that, let's say, double the error would mean doubling Kp (applying proportional force). This therefore implies that increasing Kp is a result of the robot heading in the wrong direction so Kp is increased to ensure the robot goes on the right direction or at least tries to reduce the error as time passes so an increase in Kp would affect the robot in such a way to adjust the heading of the robot so it stays on the right path.
The derivative term, Kd, is based on the rate of change of the error so an increase in Kd implies that the rate of change of error has increased over time so double the error would result in double the force. An increase by double the change in the robot's heading would take place if the robot's heading is doubled in error from the previous feedback result. Kd causes the robot to react faster as the error increases.
An increase in the integral term, Ki, means that the error is increased over time. The integral accounts for the sum of error over time. Even a small increase in the error would increase the integral so the robot would have to head in the right direction for an equal amount of time for the integral to balance to zero.
I would appreciate a much better answer and it would be great to be confident for a similar upcoming question in the finals.
Side note: i've posted this question on the Robotics part earlier but seeing that the questions there are hardly ever noticed, i've also posted it here.
I would highly recommend reading this article PID Without a PhD it gives a great explanation along with some implementation details. The best part is the numerous graphs. They show you what changing the P, I, or D term does while holding the others constant.
And if you want real world Application Atmel provides example code on their site (for 8 bit MCU) that perfectly mirrors the PID without a PhD article. It follows this code from AVR's website exactly (they make the ATMega32p microcontroller chip on the Arduino UNO boards) PDF explanation and Atmel Code in C
But here is a general explanation the way I understand it.
Proportional: This is a proportional relationship between the error and the target. Something like Pk(target - actual) Its simply a scaling factor on the error. It decides how quickly the system should react to an error (if it is of any help, you can think of it like amplifier slew rate). A large value will quickly try to fix errors, and a slow value will take longer. With Higher values though, we get into an overshoot condition and that's where the next terms come into play
Integral: This is meant to account for errors in the past. In fact it is the sum of all past errors. This is often useful for things like a small dc/constant offset that a Proportional controller can't fix on its own. Imagine, you give a step input of 1, and after a while the output settles at .9 and its clear its not going anywhere. The integral portion will see this error is always ~.1 too small so it will add it back in, to hopefully bring control closer to 1. THis term usually helps to stabilize the response curve. Since it is taken over a long period of time, it should reduce noise and any fast changes (like those found in overshoot/ringing conditions). Because it's aggregate, it is a very sensitive measurement and is usually very small when compared to other terms. A lower value will make changes happen very slowly, and create a very smooth response(this can also cause "wind-up" see the article)
Derivative: This is supposed to account for the "future". It uses the slope of the most recent samples. Remember this is the slope, it has nothing to do with the position error(current-goal), it is previous measured position - current measured position. This is most sensitive to noise and when it is too high often causes ringing. A higher value encourages change since we are "amplifying" the slope.
I hope that helps. Maybe someone else can offer another viewpoint, but that's typically how I think about it.

Resources