Forecasting Comparison by two different models - forecasting

I am doing Forecasting and I have 8 datasets and I am using ARIMA / ANN / CNN / LSTM. When I implemented the Diebold Mariano test, I observed that for one of the datasets, there is significant difference between ARIMA and ANN forecast whereas for another dataset it says that there is no significant difference between ARIMA and ANN forecast.
Is that possible? Does it depend on the dataset?
Kindly let me know.

Related

Compare MAPE for markets with different volatilities

I am trying to compare the forecast accuracy of a number of methods using MAPE across different commodity markets, such as corn, wheat, soybeans, coffee, cotton. Obviously the relative MAPE’s area impacted by the relative volatilities of each commodity: a high MAPE for wheat may simply reflect a volatile market, not necessarily a poor forecast.
I am wondering how to correct for this: some kind of vol-adjusted MAPE I suppose, but I cannot find any literature on this. Alternatively, I was thinking of comparing the MAPE of a certain forecast method with the MAPE of a naïve forecast…this should also correct for the vol difference somewhat, I suppose.
Any further suggestions/comments are greatly appreciated.
I'm not aware of any measures that directly incorporates volatility, in order to enable comparison across. I would also question the relevance of directly comparing accuracy measures across like that, as the accuracy would depend - as you also points out - on the volatility/signal-to-noise ratio of the time series.
I approach a problem like this by what you also suggest - create a naïve forecast, and have that as the lowest acceptable accuracy for that series, and also an initial measure of the forecastability of the series.
Note: i follow the definition of a naïve forecast as: one which is a very simple forecast model, could be naive1, naive2, moving average or combination of those - where no further work needs to be done on parameters.
Try to have a look at the work of Michael Gilliland on FVA for inspiration

How to calculate a lot of records in DB with reasonable time

If I have a vector (for example: (5,4,6,8) ) in my application and I want to find similarity to other vector in my DB, let say for simplicity that I'm calculating distance between two vectors with Manhattan distance.
What I need is a way to calculate the algorithm (Manhattan distance in my example) between my vector and all the vectors that are stored in my DB, Can I do 10 million vectors under a couple of seconds ?
If You really deal with a lot of data, what You really need is an Approximate Near Neighborhood - http://en.wikipedia.org/wiki/Nearest_neighbor_search#Approximate_nearest_neighbor implementation. Take look at Annoy - https://pypi.python.org/pypi/annoy/1.8.0 project page. There is a benchmark with other ANN projects wich You can find interesting. Maybe there is a implementation as a plugin for DB, but I am not aware of such. However, ANN can be also used to pre-compute top-n NN and store them in DB as a list for User/Item.

Data mining and weka

Hi ive beeen asked to search for at least 20 different datasets with a maximum of 40 datasets. i need to apply the following classification techniques using the WEKA software on the chosen datasets:
(1) Decision tree (SimpleCart),
(2) Naïve Bayes, and
(3) K-NN (IBk) (with K taking the value of 1 up to the number of class labels in the dataset)
Once you have applied WEKA on all the datasets, it is required to accomplish the following tasks:
Compare the performance of the applied techniques you have achieved through WEKA.
Analyse the results with regards to the dataset properties.
Ive never used weka before,am unsure on how to apply the classification techniques and what am actually comparing, but am quick at learning.Am not really about what am required to do...i just need some direction or some example please anyone?
To find dataset, you can use
https://archive.ics.uci.edu/ml/datasets.html
To compare the performance of classifier, there are many measures like AUC (Area Under Curve), ROC curve, Accuracy, precision and recall. Weka has the ability to generate these measures. I recommend to use AUC and Accuracy.
To learn how to use Weka, there are many online tutorials like http://www.ibm.com/developerworks/library/os-weka2/

Data Mining KNN Classifier

Suppose a data analyst working for an insurance company was asked to build a predictive model for predicting whether a customer will buy a mobile home insurance policy. S/he tried kNN classifier with different number of neighbours (k=1,2,3,4,5). S/he got the following F-scores measured on the training data: (1.0; 0.92; 0.90; 0.85; 0.82). Based on that the analyst decided to deploy kNN with k=1. Was it a good choice? How would you select an optimal number of neighbours in this case?
It is not a good idea to select a parameter of a prediction algorithm using the whole training set as the result will be biased towards this particular training set and has no information about generalization performance (i.e. performance towards unseen cases). You should apply a cross-validation technique e.g. 10-fold cross-validation to select the best K (i.e. K with largest F-value) within a range.
This involves splitting your training data in 10 equal parts retain 9 parts for training and 1 for validation. Iterate such that each part has been left out for validation. If you take enough folds this will allow you as well to obtain statistics of the F-value and then you can test whether these values for different K values are statistically significant.
See e.g. also:
http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Falg_knn_training_crossvalidation.htm
The subtlety here however is that there is likely a dependency between the number of data points for prediction and the K-value. So If you apply cross-validation you use 9/10 of the training set for training...Not sure whether any research has been performed on this and how to correct for that in the final training set. Anyway most software packages just use the abovementioned techniques e.g. see SPSS in the link.
A solution is to use leave-one-out cross-validation (each data samples is left out once for testing) in that case you have N-1 training samples(the original training set has N).

SPSS creating a loop for a multiple regression over several variables

For my master thesis I have to use SPSS to analyse my data. Actually I thought that I don't have to deal with very difficult statistical issues, which is still true regarding the concepts of my analysis. BUT the problem is now that in order to create my dependent variable I need to use the syntax editor/ programming in general and I have no experience in this area at all. I hope you can help me in the process of creating my syntax.
I have in total approximately 900 companies with 6 year observations. For all of these companies I need the predicted values of the following company-specific regression:
Y= ß1*X1+ß2*X2+ß3*X3 + error
(I know, the ß won t very likely be significant, but this is nothing to worry about in my thesis, it will be mentioned in the limitations though).
So far my data are ordered in the following way
COMPANY YEAR X1 X2 X3
1 2002
2 2002
1 2003
2 2003
But I could easily change the order, e.g. in
1
1
2
2 etc.
Ok let's say I have rearranged the data: what I need now is that SPSS computes for each company the specific ß and returns the output in one column (the predicted values with those ß multiplied by the specific X in each row). So I guess what I need is a loop that does a multiple linear regression for 6 rows for each of the 939 companies, am I right?
As I said I have no experience at all, so every hint is valuable for me.
Thank you in advance,
Janina.
Bear in mind that with only six observations per company and three (or 4 if you also have a constant term) coefficients to estimate, the coefficient estimates are likely to be very imprecise. You might want to consider whether companies can be pooled at least in part.
You can use SPLIT FILE to estimate the regressions specific for each company, example below. Note that one would likely want to consider other panel data models, and assess whether there is autocorrelation in the residuals. (This is IMO a useful approach though for exploratory analysis of multi-level models.)
The example declares a new dataset to pipe the regression estimates to (see the OUTFILE subcommand on REGRESSION) and suppresses the other tables (with 900+ tables much of the time is spent rendering the output). If you need other statistics either omit the OMS that suppresses the tables, or tweak it to only show the tables you want. (You can use OMS to pipe other results to other datasets as well.)
************************************************************.
*Making Fake data.
SET SEED 10.
INPUT PROGRAM.
LOOP #Comp = 1 to 1000.
COMPUTE #R1 = RV.NORMAL(10,2).
COMPUTE #R2 = RV.NORMAL(-3,1).
COMPUTE #R3 = RV.NORMAL(0,5).
LOOP Year = 2003 to 2008.
COMPUTE Company = #Comp.
COMPUTE Rand1 = #R1.
COMPUTE Rand2 = #R2.
COMPUTE Rand3 = #R3.
END CASE.
END LOOP.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME Companies.
COMPUTE x1 = RV.NORMAL(0,1).
COMPUTE x2 = RV.NORMAL(0,1).
COMPUTE x3 = RV.NORMAL(0,1).
COMPUTE y = Rand1*x1 + Rand2*x2 + Rand3*x3 + RV.NORMAL(0,1).
FORMATS Company Year (F4.0).
*Now sorting cases by Company and Year, then using SPLIT file to estimate
*the regression.
SORT CASES BY Company Year.
*Declare new set and have OMS suppress the other results.
DATASET DECLARE CoeffTable.
OMS
/SELECT TABLES
/IF COMMANDS = 'Regression'
/DESTINATION VIEWER = NO.
*Now split file to get the coefficients.
SPLIT FILE BY Company.
REGRESSION
/DEPENDENT y
/METHOD=ENTER x1 x2 x3
/SAVE PRED (CompSpePred)
/OUTFILE = COVB ('CoeffTable').
SPLIT FILE OFF.
OMSEND.
************************************************************.

Resources