predictions with qlickview - analytics

I am here to ask some small information regarding qlickview function and whether qlickview has some option regarding the prediction function or not
My Requirements:
I have some sales data from 2013 and 2014 and I want to predict the sales for 2015 what functions I can use to predict this specific data in qlickview ?
And not only sales but I have similar data for production and training for specific location and machine so if this works successfully for sales I can implement the predictions for other departments too
As there are lot of techniques and methods related to predictions I want to know which technique I need to apply in qlickview and how ?
Thank you

As you said, there are a lot of techniques and methods and you would have to combine them in QlikView as there's no one function that can do it for you. I would look into time series modelling (https://en.wikipedia.org/wiki/Time_series)
There's a good 3 part video tutorial on Youtube about time series modelling (https://www.youtube.com/watch?v=gHdYEZA50KE&feature=youtu.be). Although it is done in Excel, you can apply the same techniques in QlikView.
You would probably have to use linear regression. QlikView provides some analytical functions which you can use to calculate the slope and the y-intercept of a linear regression (linest_m and linest_b).
All in all I have found QlikView not to be very good at calculating such things. For example, if you find that instead of linear regression, polynomial regression fits your data better then you would have to implement a lot of it by yourself. Maybe it would be wise to use some statistical programming language (e.g. R, Octave) and present the results in QlikView.

Related

Data Warehousing - OLAP operations

I would like to know how to find the standard deviation of final scores from a Data warehouse (represented by a schema) representing a universities gradebook using OLAP Operations (slicing,drilling), I cannot post the image for the schema because I don't have enough reputation points.
The schema has the following dimensions:
course
student
semester
instructor
department
gradebook
Could you please help with this?
I think you need to be more specific in your question. Are you talking about a specific product vendor in relation to OLAP databases, Analysis Services, Oracle OLAP, etc? For example, if you are using Analysis Services, the MDX Language has functions (StdDev & StdDevP) that would calculate the standard deviation and population standard deviation of an appropriate Set you passed in Oracle likewise doesn't use MDX but has an appropriate function (STDDEV)
If more generally you want to understand how to calculate the standard deviation, then you probably understand it's a mathematical formula, that has nothing to do with OLAP and I would recommend Brendan Foltz's excellent, easy to consume, series of videos on broad range of statistical topics if you are interested. There are 3 on standard deviation on his blog here - you will find them in the middle of the select row (Statistics 101: Standard Deviation and NFL Field Goals - Part 1/3, 2/3, 3/3).
Either way, the second will help you understand what to do with the first, once you consult appropriate product documentation.

Use case for incremental supervised learning using apache mahout

Business case:
Forecasting fuel consumption at site.
Say fuel consumption C, is dependent on various factors x1,x2,...xn. So mathematically speaking, C = F{x1,x2,...xn}. I do not have any equation to put this.
I do have historical dataset from where I can get a correlation of C to x1,x2 .. etc. C,x1,x2,.. are all quantitative. Finding out the correlation seems tough for a person like me with limited statistical knowledge, for a n variable equation.
So, I was thinking of employing some supervised machine learning techniques for the same. I will train a classifier with the historic data to get a prediction for the next consumption.
Question: Am I thinking in the right way?
Question: If this is correct, my system should be an evolving one. So the more real data I am going to feed to the system, that would evolve my model to make a better prediction the next time. Is this a correct understanding?
If the above the statements are true, does the AdaptiveLogisticRegression algorithm, as present in Mahout, will be of help to me?
Requesting advises from the experts here!
Thanks in advance.
Ok, correlation is not a forecasting model. Correlation simply ascribes some relationship between the datasets based on covariance.
In order to develop a forecasting model, what you need to peform is regression.
The simplest form of regression is linear univariate, where C = F (x1). This can easily be done in Excel. However, you state that C is a function of several variables. For this, you can employ linear multivariate regression. There are standard packages that can perform this (within Excel for example), or you can use Matlab, etc.
Now, we are assuming that there is a "linear" relationship between C and the components of X (the input vector). If the relationship were not linear, then you would need more sophisticated methods (nonlinear regression), which may very well employ machine learning methods.
Finally, some series exhibit auto-correlation. If this is the case, then it may be possible for you to ignore the C = F(x1, x2, x3...xn) relationships, and instead directly model the C function itself using time-series techniques such as ARMA and more complex variants.
I hope this helps,
Srikant Krishna

Machine learning, best technique

I am new to machine learning. I am familiar with SVM , Neural networks and GA. I'd like to know the best technique to learn for classifying pictures and audio. SVM does a decent job but takes a lot of time. Anyone know a faster and better one? Also I'd like to know the fastest library for SVM.
Your question is a good one, and has to do with the state of the art of classification algorithms, as you say, the election of the classifier depends on your data, in the case of images, I can tell you that there is one method called Ada-Boost, read this and this to know more about it, in the other hand, you can find lots of people are doing some researh, for example in Gender Classification of Faces Using Adaboost [Rodrigo Verschae,Javier Ruiz-del-Solar and Mauricio Correa] they say:
"Adaboost-mLBP outperforms all other Adaboost-based methods, as well as baseline methods (SVM, PCA and PCA+SVM)"
Take a look at it.
If your main concern is speed, you should probably take a look at VW and generally at stochastic gradient descent based algorithms for training SVMs.
if the number of features is large in comparison to the number of the trainning examples
then you should go for logistic regression or SVM without kernel
if the number of features is small and the number of training examples is intermediate
then you should use SVN with gaussian kernel
is the number of features is small and the number of training examples is large
use logistic regression or SVM without kernels .
that's according to the stanford ML-class .
For such task you may need to extract features first. Only after that classification is feasible.
I think feature extraction and selection is important.
For image classification, there are a lot of features such as raw pixels, SIFT feature, color, texture,etc. It would be better choose some suitable for your task.
I'm not familiar with audio classication, but there may be some specturm features, like the fourier transform of the signal, MFCC.
The methods used to classify is also important. Besides the methods in the question, KNN is a reasonable choice, too.
Actually, using what feature and method is closely related to the task.
The method mostly depends on problem at hand. There is no method that is always the fastest for any problem. Having said that, you should also keep in mind that once you choose an algorithm for speed, you will start compromising on the accuracy.
For example- since your trying to classify images, there might a lot of features compared to the number of training samples at hand. In such cases, if you go for SVM with kernels, you could end up over fitting with the variance being too high.
So you would want to choose a method that has a high bias and low variance. Using logistic regression or linear SVM are some ways to do it.
You could also use different types of regularizations or techniques such as SVD to remove the features that do not contribute much to your output prediction and have only the most important ones. In other words, choose the features that have little or no correlation between them. Once you do this, you would be able to speed yup your SVM algorithms without sacrificing the accuracy.
Hope it helps.
there are some good techniques in learning machines such as, boosting and adaboost.
One method of classification is the boosting method. This method will iteratively manipulate data which will then be classified by a particular base classifier on each iteration, which in turn will build a classification model. Boosting uses weighting of each data in each iteration where its weight value will change according to the difficulty level of the data to be classified.
While the method adaBoost is one ensamble technique by using loss function exponential function to improve the accuracy of the prediction made.
I think your question is very open ended, and "best classifier for images" will largely depend on the type of image you want to classify. But in general, I suggest you study convulutional neural networks ( CNN ) and transfer learning, currently these are the state of the art techniques for the problem.
check out pre-trained models of cnn based neural networks from pytorch or tensorflow
Related to images I suggest you also study pre-processing of images, pre-processing techniques are very important to highlight some feature of the image and improve the generalization of the classifier.

how to rank gene using information gain?

how gene ranking is done for microarray data using information gain and chi-square statistics ?? Please illustrate with a simple example..
You could use the open source machine learning software Weka. Load your dataset and go to "Select attribute" tab. Use the following attributes evaluators:
ChiSquaredAttributeEval : Evaluates the worth of an attribute by computing the value of the chi-squared statistic with respect to the class.
InfoGainAttributeEval : Evaluates the worth of an attribute by measuring the information gain with respect to the class.
..using Ranker in the "Search Method" . That way the attributes are ranked by their individual evaluations
I don't exactly understand your question, but a very successful package for analyzing microarray data can be found here:
BioConductor
This is a software project that has a variety of different modules for reading data from microarrays and performing statistical analysis. This is very useful, because the file formats for microarray data are constantly changing as the technology develops, and the algorithms for analyzing microarray data have advanced significantly as well.
you can use InfoGainAttributeEval for calculating Information gain
and for more information check this answer

Measuring the performance of classification algorithm

I've got a classification problem in my hand, which I'd like to address with a machine learning algorithm ( Bayes, or Markovian probably, the question is independent on the classifier to be used). Given a number of training instances, I'm looking for a way to measure the performance of an implemented classificator, with taking data overfitting problem into account.
That is: given N[1..100] training samples, if I run the training algorithm on every one of the samples, and use this very same samples to measure fitness, it might stuck into a data overfitting problem -the classifier will know the exact answers for the training instances, without having much predictive power, rendering the fitness results useless.
An obvious solution would be seperating the hand-tagged samples into training, and test samples; and I'd like to learn about methods selecting the statistically significant samples for training.
White papers, book pointers, and PDFs much appreciated!
You could use 10-fold Cross-validation for this. I believe it's pretty standard approach for classification algorithm performance evaluation.
The basic idea is to divide your learning samples into 10 subsets. Then use one subset for test data and others for train data. Repeat this for each subset and calculate average performance at the end.
As Mr. Brownstone said 10-fold Cross-Validation is probably the best way to go. I recently had to evaluate the performance of a number of different classifiers for this I used Weka. Which has an API and a load of tools that allow you to easily test the performance of lots of different classifiers.

Resources