Data Warehousing - OLAP operations - database

I would like to know how to find the standard deviation of final scores from a Data warehouse (represented by a schema) representing a universities gradebook using OLAP Operations (slicing,drilling), I cannot post the image for the schema because I don't have enough reputation points.
The schema has the following dimensions:
course
student
semester
instructor
department
gradebook
Could you please help with this?

I think you need to be more specific in your question. Are you talking about a specific product vendor in relation to OLAP databases, Analysis Services, Oracle OLAP, etc? For example, if you are using Analysis Services, the MDX Language has functions (StdDev & StdDevP) that would calculate the standard deviation and population standard deviation of an appropriate Set you passed in Oracle likewise doesn't use MDX but has an appropriate function (STDDEV)
If more generally you want to understand how to calculate the standard deviation, then you probably understand it's a mathematical formula, that has nothing to do with OLAP and I would recommend Brendan Foltz's excellent, easy to consume, series of videos on broad range of statistical topics if you are interested. There are 3 on standard deviation on his blog here - you will find them in the middle of the select row (Statistics 101: Standard Deviation and NFL Field Goals - Part 1/3, 2/3, 3/3).
Either way, the second will help you understand what to do with the first, once you consult appropriate product documentation.

Related

Difference between homonyms and synonyms in data science with examples

Please share the difference between homonyms and synonyms in data science with examples.
Synonyms for concepts:
When you determine that two concepts are synonyms (say, sofa and couch), you use the class expression owl:equivalentClass. The entailment here is that any instance that was a member of class sofa is now also a member of class couch and vice versa. One of the nice things about this approach is that "context" of this equivalence is automatically scoped to the ontology in which you make the equivalence statement. If you had a very small mapping ontology between a furniture ontology and an interior decorating ontology, you could say in the map that these two are equivalent. In another situation if you needed to retain the (subtle) difference between a couch and a sofa, you do that by merely not including the mapping ontology that declared them equivalent.
Homonyms for concepts:
As Led Zeppelin says, "and you know sometimes words have two meaningsā€¦" What happens when a "word" has two meanings is that we have what WordNet would call "word senses." In a particular language, a set of characters may represent more than one concept. One example is the English word "mole," for which WordNet has 6 word senses. The Semantic Web approach is to give each its own namespace; for instance, I might refer to the counterspy mole as cia:mole and the burrowing rodent as the mammal:mole. (These are shortened qnames for what would be full namespace names.) The nice thing about this is, if the CIA ever needed to refer to the rodent they could unambiguously refer to mammal:mole.
Credit
Homonyms- are words that have the same sound but have different in meaning.
2. Synonyms- are words that have the same or almost the same meaning.
Homonyms
Machine learning algorithms are now the subject of ethical debate. Bias, in layman's terms, is a pre-formed view created before facts are known. It applies to an estimating procedure's proclivity to provide estimations or predictions that are, on average, off goal in machine learning and data mining.
A policy's strength can be measured in a variety of ways, including confidence. "Decision trees" are diagrams that show how decisions are being made and what consequences are available. Rescale a statistic to match the scale of other variables in the model to normalise it.
Confidence is a statistician's metric for determining how reliable a sample is (we are 95 percent confident that the average blood sugar in the group lies between X and Y, based on a sample of N patients). Decision tree algorithms are methods that divide data across pieces that are becoming more and more homogeneous in terms of the outcome measure as they advance.
A graph is a graphical representation of data that statisticians call plots and charts. A graph seems to be an information structure that contains the ties and links among items, according to computer programmers. The act of arranging relational databases and their columns such that table relationships are consistent is known as normalisation.
Synonyms
Statisticians use the terms record, instance, sample, or example to describe their data. In computer science and machine learning, this can be called an attribute, input variable, or feature. The term "estimation" is also used, though its use is generally limited to numeric outcomes.
Statisticians call the non-time-series data format a record, or record. In statistics, estimation more often refers to the use of a sample statistic to measure something. Predictive modelling involves developing aggregations of low-level predictors into more informative "features".
The spreadsheet format, in which each column is still a variable, so each row is a record, is perhaps the most common non-time-series data type. Modeling in machine learning and artificial intelligence often begins with some very low-level prediction data.

predictions with qlickview

I am here to ask some small information regarding qlickview function and whether qlickview has some option regarding the prediction function or not
My Requirements:
I have some sales data from 2013 and 2014 and I want to predict the sales for 2015 what functions I can use to predict this specific data in qlickview ?
And not only sales but I have similar data for production and training for specific location and machine so if this works successfully for sales I can implement the predictions for other departments too
As there are lot of techniques and methods related to predictions I want to know which technique I need to apply in qlickview and how ?
Thank you
As you said, there are a lot of techniques and methods and you would have to combine them in QlikView as there's no one function that can do it for you. I would look into time series modelling (https://en.wikipedia.org/wiki/Time_series)
There's a good 3 part video tutorial on Youtube about time series modelling (https://www.youtube.com/watch?v=gHdYEZA50KE&feature=youtu.be). Although it is done in Excel, you can apply the same techniques in QlikView.
You would probably have to use linear regression. QlikView provides some analytical functions which you can use to calculate the slope and the y-intercept of a linear regression (linest_m and linest_b).
All in all I have found QlikView not to be very good at calculating such things. For example, if you find that instead of linear regression, polynomial regression fits your data better then you would have to implement a lot of it by yourself. Maybe it would be wise to use some statistical programming language (e.g. R, Octave) and present the results in QlikView.

Simple database setup for MALDI peaks

I have a very simple problem that I could come up with a crude solution to, but it seems to me that there is probably some off the shelf answer.
Problem: I have a list of discrete values (these are mass units) that I want to find within a database of discrete values (known mass units) and their identities, allowing for some inexact match. Example: If I am looking for 500.23 in the database then anything +/- 0.025 would be considered a match (50 ppm or 0.005%). This tolerance should be adjustable. So in this example, 500.23 may return the database text value, 500.25 which is Compound A.
I could also make this tool myself if someone would like to suggest the most straightforward approach. I am competent in Matlab, somewhat in R, good in excel, poor in access, and don't know anything about SQL. Best case would be for this tool to be used by non-coders.
Background: The real background of this problem is that I have MALDI TOF data where I have identified peaks of interest from an experiment (masses; m/z). These masses correspond to molecules that were released after enzymatic digestion. This class of molecule has reported masses with known identities, but unlike peptide mass fingerprinting, or metabolomic databases, these known masses are mostly unpublished and/or uncollated, so I would like to cross-reference them with a database of my own making. Each mass corresponds to one identity. The masses will not match exactly, and being able to search with a specified mass tolerance is key.
There are plenty of mass spectrometer data solutions you may want to look at. For example: http://www.ionsource.com/links/programs.htm

Use case for incremental supervised learning using apache mahout

Business case:
Forecasting fuel consumption at site.
Say fuel consumption C, is dependent on various factors x1,x2,...xn. So mathematically speaking, C = F{x1,x2,...xn}. I do not have any equation to put this.
I do have historical dataset from where I can get a correlation of C to x1,x2 .. etc. C,x1,x2,.. are all quantitative. Finding out the correlation seems tough for a person like me with limited statistical knowledge, for a n variable equation.
So, I was thinking of employing some supervised machine learning techniques for the same. I will train a classifier with the historic data to get a prediction for the next consumption.
Question: Am I thinking in the right way?
Question: If this is correct, my system should be an evolving one. So the more real data I am going to feed to the system, that would evolve my model to make a better prediction the next time. Is this a correct understanding?
If the above the statements are true, does the AdaptiveLogisticRegression algorithm, as present in Mahout, will be of help to me?
Requesting advises from the experts here!
Thanks in advance.
Ok, correlation is not a forecasting model. Correlation simply ascribes some relationship between the datasets based on covariance.
In order to develop a forecasting model, what you need to peform is regression.
The simplest form of regression is linear univariate, where C = F (x1). This can easily be done in Excel. However, you state that C is a function of several variables. For this, you can employ linear multivariate regression. There are standard packages that can perform this (within Excel for example), or you can use Matlab, etc.
Now, we are assuming that there is a "linear" relationship between C and the components of X (the input vector). If the relationship were not linear, then you would need more sophisticated methods (nonlinear regression), which may very well employ machine learning methods.
Finally, some series exhibit auto-correlation. If this is the case, then it may be possible for you to ignore the C = F(x1, x2, x3...xn) relationships, and instead directly model the C function itself using time-series techniques such as ARMA and more complex variants.
I hope this helps,
Srikant Krishna

"Parametrized" database model & backend storage system as well as data mining manipulation

I have implicitly made this a community wiki seeing that the answers can be quite broad.
I'm working with a start-up company to accomplish the following goal.
In a medical research, a patient medical record can have infinite amount of data regarding a patient for a specific diagnosis, e.g. a smoker has a higher chance of catching lung cancer but that doesn't necessarily mean that a non-smoker can catch lung cancer. My goal is to create/use a database model that can deal with such parameters.
Now, I also have to come up with ways to data mine these parametrized data to create statistical data e.g. see the trends on all 40 year old female who suffered from lung cancer. That report can be generic, (graph, tabular, etc.) where doctors can see trends or analyse possible solutions that can work....
My questions are:
1) Which Database systems allows for parametrized backend storage (e.g. Cassandra) that can easily be used in java, and is very efficient in data retrieval, linkage, etc. We are dealing with high amount of patient records per states.
2) What algorithms or AI techniques can I use for data mining? Is there any mining techniques out there that can help me do this?
PS How does Google Analytics deal with parametrised data?
PPS A parametrized data is data which has a key, and data where data can be value, another key-value pair, a list of value, a set of parametrized data (organized, unorganized)
I'm looking forward for suggestive answers! :-D
I'll try to answer your first question only.
Cassandra is a key-value datastore (in your case parametrized). If you use Cassandra, you need higher computation time to derive complex reports. The reason being - it stores data in raw format. Cassandra like NOSQL databases are good if you want to scale very very big. They are eventually consistent and compromise on data replication and latency.
In your case as a patient can have data in infinitely any form, try to fit the model of a Triple Store (Semantic Web frameworks like Jena, OpenSesame, etc). They allow you to have a lousy data structures and can be molded at runtime. Also, their querying engines (SPARQL, SeRQL) give you more power than NOSQL stores (like Cassandra), but these querying capabilities are obviously lesser than RDBMS.
For this question, this is how we have implemented this.
We created a keyspace called medical and a supercolumn family called patient.
under the supercolumn family, we have a general supercolumn which basically store the patient details, and another supercolumn called operation to keep recording of the user occupation.
Don't forget that the general supercolumn keeps record of the patient as he/she comes to the doctor. That way, we know exactly the patient's exact condition before, during and after operation.
I know some data can be duplicates, but no supercolumns can be identical as there is no way that you can have exactly 2 different patient of identical attributes and sickness.
So basically, Cassandra allows 3 layers of abstraction, Keyspace, Column/Supercolumn family, Column/Supercolumn.
Hope this can help somebody.

Resources