How are OLAP, OLTP, data warehouses, analytics, analysis and data mining related? - database

I'm trying to understand what OLAP, OLTP, data mining, analytics etc. are about, and I feel like my understanding about some of these concepts is still a bit vague. Information about these subjects tend to be explained in a very complex manner on the internet.
I feel like a question like this is likely to be closed since it's a very broad one, so I'll try to narrow it down into two questions:
Question 1:
After doing research I understand the following about these concepts, is it correct?
Analysis is decomposing something complex, to understand the inner workings better.
Analytics is predictive analysis on information that requires alot of math and statistics.
There's many type of databases, but they are either OLTP (transactional) or OLAP (analytical).
OLTP databases use ER diagrams, and are therefore easier to update because they are in normalized form.
In contrast, OLAP uses the denormalized star schema's and is therefore easier to query
OLAP is used for predictive analysis and OLTP is usually used in more practical situations since theres no redundancy.
Data warehouses is a type of OLAP database, and usually consists out of multiple other databases.
Data mining is a tool used in analytics, where u use computer software to find out relationships between data so you can predict things (e.g. customer behavior).
Question 2:
I'm especially confused about the difference between analytics and analysis. They say analytics is multidimensional analysis, but what is that supposed to mean?

I will try to explain you from the top of the pyramid:
Business Intelligence (what you didn't mentioned) is term in IT which stands for a complex system and gives useful informations about company from data.
So, BI systems has target: Clean, accurate and meaningful informations.
Clean means there is no tech problems (missing keys, incomplete data ect). Accurate means accurate - BI systems are also used as fault checker of production database (logical faults - i.e invoice bill is too high, or inactive partner is used ect). It has been accomplished with rules. Meaningful is hard to explain, but in simple english, it's all your data (even excel table from the last meeting), in way you want.
So, BI system has back-end: It's data warehouse.
DWH is nothing else than a database (instance, not software). It can be stored in RDBMS, analytical db (columnar or document store types), or NoSQL databases.
Data warehouse is term used usually for whole database that I explained above. There could be number of data-marts (if Kimball model is used) - more often, or relational system in 3rd normalized form (Inmon model) called enterprise data warehouse.
Data marts are tables inside DWH that are related (star schema, snowflake schema). Fact table (business process in denormalized form ) and dimension tables.
Each data mart represents one business process. Example: DWH has 3 data marts. One is retail sales, second is export, and third is import. In retail you can see total sales, qty sold, import price, profit (measures) by SKU, date, store, city ect (dimensions).
Loading data in DWH is called ETL(extract, transform, load).
Extract data from multiple sources (ERP db, CRM db, excel files, web service...)
Transform data (clean data, connect data from diff sources, match keys, mine data)
Load data (Load transformed data in specific data marts)
edit because of comment: ETL process is usually created with ETL tool, or manually with some programming language (python, c# ect) and APIs.
ETL process is group of SQLs, procedures, scripts and rules related and separated in 3 parts (look above), controlled by meta data.
It's either scheduled (every night, every few hours) or live (change data capture, triggers, transactions).
OLTP and OLAP are types of data processing. OLTP is used in transaction purpose, between database and software (usually only one way of input/output data).
OLAP is for analitical purpose, and this means there is multiple sources, historical data, high select query performance, mined data.
edit because of comment: Data Processing is way how data is stored and accessed from database. So, based on of your needs, database is set in different way.
Image from http://datawarehouse4u.info/:
Data mining is the computational process of discovering patterns in large data sets. Mined data can give you more insight view of business process or even forecast.
Analysis is a verb, which in BI world means simplicity of getting asked information from data. Multidimensional analysis actually says how system is slicing your data (with dimensions inside cube). Wikipedia said that analysis of data is a process of inspecting data with the goal of discovering useful information.
Analytics is a noun and it represent a result of analysis process.
Don't get so much fuss about those two words.

I can tell you about Data mining as i had project on Data mining. Data mining is not a tool ,Its a method of mining data and different tools used for data mining is WEKA ,RAPID MINER etc. Data mining follows many algorithms which are inbuilt in tools like Weka ,Rapid miner. Algorithms like Clustering algorithm , assosiation algorithm etc.
A simple example i can give you of data mining . Teacher is teaching science subject in a class by using different methods of teaching like using chalkboard,presentation,Practical. So now our aim is to find which method is suitable for students. Then we do survey and take students opinion 40 students like chalk board ,30 likes presentation and 20 likes practical method. So with help of this data we can make the rules for example Science subject should be taught by chalk board method.
To knw different algorithms you can use google :D.

Related

When a data model becomes a data nightmare

I am revising a data model for an educational company. So not your straightforward retail or finance model.
Problem is the company want to use the datawarehouse to produce listing reports, the kind of reports that should normally be produced by the Education Management System (EMS).
Reports like class lists, detailed learner information reports, guardian contact, work, and financial information, academic reports.
My argument thus far has been that a datawarehouse houses an analytical data model used to data analytics. Not a reporting model for an education management system with a for more complex relational database.
The current model has (forgive the crude diagram) snow flaked completely out of control in such a way that reporting and analytical tools are struggling to interpret it. The warehouse is starting to resemble the RDMS model more and more. There are so many relationships between Dimensions in order to keep data together.
Some of the tables contains so much unnecessary attributes that have no analytical value and purely exist for a listing report.
I need some opinions/criticism (possibly as good reference) regarding this approach so I can better explain the problem to a business who is oblivious to the concept of data modelling. I need to make them understand that butchering the DW to handle detailed reporting is going to end badly for them.

How to Organize a Messy Database

I know there is no easy answer to this question, but how do I cleanup a database with no relationships, foreign keys, and not a whole lot of structure?
I'm an amateur to SQL, and I've inherited a database that is complete mess. We have no sort of referential integrity, and there's not a whole lot of logic to how tables are working.
My database is all data that comes from a warehouse that builds servers.
To give you an idea of the type of data I'm working with:
EDI from customers
Raw output from server projects
Sales information
Site information
Parts lists
I have been prioritizing Raw output and EDI information, and generating reports with that information using SSRS. I have learned a lot about SQL Server and the BI Microsoft tools (SSIS and SSRS) in my short time doing this. However, I'm still an amateur and I want to build a solid database that flows well and can stand on its own.
It seems like a data warehouse model is the type of structure I should adapt.
My question how do I take my mess of a database and make something more organized before I drown in data?
Since your end goal appears to be business reporting, and you're dealing with data from multiple sources made up from "isolated" tables, I would advise you to start by aggregating all that into a data model.
Personally, I would design a dimensional model to structure and store all that data, with the goal of being easy to understand (for reporting or adhoc querying). The model should be focused on business entities and their transactions. In a dimensional model, the business entities will (almost always) be the dimensions and the transactions (the metrics) will be the facts. For example, without knowing your model I'm guessing that the immediate entities would include Customer, Site, Part and transactions would include ServerSale, SiteVisit, PartPurchase, PartRepair, PartOrder, etc...
More information about dimensional modelling here and here, but I suggest going straight to the source: https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/books/data-warehouse-dw-toolkit/
When your model is designed (and implemented in a database like SQL Server) you'll then be loading data into the model, by extracting it from its different source systems/databases and transforming it from the current structure into the structure defined by the model, namely by using an ETL tool like MS Integration Services. For example, your Customer data may be scattered across the "sales", "customer" and "site", so you want to aggregate all that data and load it into a single Customer dimension table. It's when doing this ETL that you should check your data for the problems you already mentioned, loading correct rows into you data model and discarding incorrect rows into a file/log where they can later be checked and corrected. (multiple ways to address this).
A straightforward tutorial to get started on doing ETL using SSIS can be found at https://technet.microsoft.com/en-us/library/jj720568(v=sql.110).aspx
So, to sum up, you should build a data mart:
design a dimensional model that represents the business facts and
context on the data you have. This will strongly facilitate both data understanding and reporting, because a dimensional model is closely matches business users terminology and mental models.
use an ETL tool to extract the data from its current source, process it (e.g. check for data quality problems, join data from different sources) and load it into the dimensional model and check it for problems. This will get you close to having an automated data integration job/pipeline with quality checks you deem fit for the data.

DataWarehouse - What is a good definition?

Could someone give me a good, practical definition of what a data warehouse is?
I'm surprised no one has posted Inmon's definition:
A warehouse is a subject-oriented,
integrated, time-variant and
non-volatile collection of data in
support of management's decision
making process
From the same page you can pick up Kimball's definition:
A copy of transaction data
specifically structured for query and
analysis
I think that, unfortunately, data warehousing is a wide-ranging field. There is a lot of variety with very few standard paradigms, specifically I'm thinking of Kimball's dimensional modelling. Inmon does not have as a specific a methodology as Kimball's and thus some 3NF models may or may not conform to his principles.
Because Inmon has broadened his scope for what warehousing is meant to accomplish, it can encompass unstructured data. However, analysis of unstructured data is very different than traditional analysis.
As applied to SQL Server, typically the largest Data Warehouses on SQL Server are modelled dimensionally, because this lends itself well to the non-distributed, non-massively parallel model. Massively parallel systems like Teradata generally perform a lot better with 3NF models. These are still table-based systems with the various tables connected with foreign key constraints (perhaps not enforced, but at least logical).
Of course, we are also seeing NoSQL data processing systems like Map/Reduce which are not really databases at all in the sense of normalized, denormalized or non/poorly-normalized relational databases which we have had for 40 years now.
i just started with Datawarehousing and Buisness Intelligence and looking around the web you can find some interesting links :
Get Start With Datawarehousing
I think this two links could help you to understand the concepts of datawarehousing.
sorry, im new i can post only one link ^^
we're sorry, but as a spam prevention mechanism, new users can only post a maximum of one hyperlink. Earn 10 reputation to post more hyperlinks.
A database optimized for retrieval, in general denormalized data, usually a star schema(but could be snowflake) and uses dimensional modeling (fact and dimension tables)
While this is not an academic definition, it might serve as a practical one. A data warehouse is a collection of datamarts and will combine datasets across the breadth of an organization.
A datamart will contain datasets specific to certain portions of the business. In the datamart you will find fact tables, measurable pieces of information, along with dimensions, attributes of your measurable pieces.
A true data warehouse will have conformed dimension tables that can be shared across datamarts.
An example...
Your company may build a datamart around sales. And another datamart around human resources. If the customer dimension table is shared across both these datamarts, it would be considered a conformed dimension. All three of these entities together would make up a data warehouse.
As someone else stated you can find more detailed information by searching for Ralph Kimball's Data Strategies.
Definition : Datawarehouse is a database used for analysis purpose rather than for transaction processing
Check the below link for more informaion on datawarehouse
http://www.idatastage.com/datawarehouse/

Architecture for database analytics

We have an architecture where we provide each customer Business Intelligence-like services for their website (internet merchant). Now, I need to analyze those data internally (for algorithmic improvement, performance tracking, etc...) and those are potentially quite heavy: we have up to millions of rows / customer / day, and I may want to know how many queries we had in the last month, weekly compared, etc... that is the order of billions entries if not more.
The way it is currently done is quite standard: daily scripts which scan the databases, and generate big CSV files. I don't like this solutions for several reasons:
as typical with those kinds of scripts, they fall into the write-once and never-touched-again category
tracking things in "real-time" is necessary (we have separate toolset to query the last few hours ATM).
this is slow and non-"agile"
Although I have some experience in dealing with huge datasets for scientific usage, I am a complete beginner as far as traditional RDBM go. It seems that using column-oriented database for analytics could be a solution (the analytics don't need most of the data we have in the app database), but I would like to know what other options are available for this kind of issues.
You will want to google Star Schema. The basic idea is to model a special data warehouse / OLAP instance of your existing OLTP system in a way that is optimized to provided the type of aggregations you describe. This instance will be comprised of facts and dimensions.
In the example below, sales 'facts' are modeled to provide analytics based on customer, store, product, time and other 'dimensions'.
You will find Microsoft's Adventure Works sample databases instructive, in that they provide both the OLTP and OLAP schemas along with representative data.
There are special db's for analytics like Greenplum, Aster data, Vertica, Netezza, Infobright and others. You can read about those db's on this site: http://www.dbms2.com/
The canonical handbook on Star-Schema style data warehouses is Raplh Kimball's "The Data Warehouse Toolkit" (there's also the "Clickstream Data Warehousing" in the same series, but this is from 2002 I think, and somewhat dated, I think that if there's a new version of the Kimball book it might serve you better. If you google for "web analytics data warehouse" there are a bunch of sample schema available to download & study.
On the other hand, a lot of the no-sql that happens in real life is based around mining clickstream data, so it might be worth see what the Hadoop/Cassandra/[latest-cool-thing] community has in the way of case studies to see if your use case matches well with what they can do.

Star-Schema Design [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Is a Star-Schema design essential to a data warehouse? Or can you do data warehousing with another design pattern?
Using star schemas for a data warehouse system gets you several benefits and in most cases it is appropriate to use them for the top layer. You may also have an operational data store (ODS) - a normalised structure that holds 'current state' and facilitates operations such as data conformation. However there are reasonable situations where this is not desirable. I've had occasion to build systems with and without ODS layers, and had specific reasons for the choice of architecture in each case.
Without going into the subtlties of data warehouse architecture or starting a Kimball vs. Inmon flame war the main benefits of a star schema are:
Most database management systems
have facilities in the query optimiser
to do 'Star Transformations' that
use Bitmap Index structures or
Index Intersection for fast
predicate resolution. This means that selection from a star schema can be done without hitting the fact table (which is usually much bigger than the indexes) until the selection is resolved.
Partitioning a star schema is relatively straightforward as only the fact table needs to be partitioned (unless you have some biblically large dimensions). Partition elimination means that the query optimiser can ignore patitions that could not possibly participate in the query results, which saves on I/O.
Slowly changing dimensions are much easier to implement on a star schema than a snowflake.
The schema is easier to understand and tends to involve less joins than a snowflake or E-R schema. Your reporting team will love you for this
Star schemas are much easier to use and (more importantly) make perform well with ad-hoc query tools such as Business Objects or Report Builder. As a developer you have very little control over the SQL generated by these tools so you need to give the query optimiser as much help as possible. Star schemas give the query optimiser relatively little opportunity to get it wrong.
Typically your reporting layer would use star schemas unless you have a specific reason not to. If you have multiple source systems you may want to implement an Operational Data Store with a normalised or snowflake schema to accumulate the data. This is easier because an ODS typically does not do history. Historical state is tracked in star schemas where this is much easier to do than with normalised structures. A normalised or snowflaked Operational Data Store reflects 'current' state and does not hold a historical view over and above any that is inherent in the data.
ODS load processes are concerned with data scrubbing and conforming, which is easier to do with a normalised structure. Once you have clean data in an ODS, dimension and fact loads can track history (changes over time) with generic or relatively simple mechanisms relatively simply; this is much easier to do with a star schema, Many ETL tools (for example) provide built-in facilities for slowly changing dimensions and implementing a generic mechanism is relatively straightforward.
Layering the system in this way providies a separation of responsibilities - business and data cleansing logic is dealt with in the ODS and the star schema loads deal with historical state.
There is an ongoing debate in the datawarehousing litterature about where in the datawarehouse-architecture the Star-Schema design should be applied.
In short Kimball advocates very highly for using only the Star-Schema design in the datawarehouse, while Inmon first wants to build an Enterprise Datawarehouse using normalized 3NF design and later use the Star-Schema design in the datamarts.
In addition here to you could also say that Snowflake schema design is another approach.
A fourth design could be the Data Vault Modeling approach.
Star schemas are used to enable high speed access to large volumes of data. The high performance is enabled by reducing the amount of joins needed to satsify any query that may be made against the subject area. This is done by allowing data redundancy in dimension tables.
You have to remember that the star schema is a pattern for the top layer for the warehouse. All models also involve staging schemas at the bottom of the warehouse stack, and some also include a persistant transformed merged staging area where all source systems are merged into a 3NF modelled schema. The various subject areas sit above this.
Alternatives to star schemas at the top level include a variation, which is a snowflake schema. A new method that may bear out some investigation as well is Data Vault Modelling proposed by Dan Linstedt.
The thing about star schemas is they are a natural model for the kinds of things most people want to do with a data warehouse. For instance it is easy to produce reports with different levels of granularity (month or day or year for example). It is also efficient to insert typical business data into a star schema, again a common and important feature of a data warehouse.
You certainly can use any kind of database you want but unless you know your business domain very well it is likely that your reports will not run as efficiently as they could if you had used a star schema.
Star schemas are a natural fit for the last layer of a data warehouse. How you get there is another question. As far as I know, there are two big camps, those of Bill Inmon and Ralph Kimball. You might want to look at the theories of these two guys if/when you decide to go with a star.
Also, some reporting tools really like the star schema setup. If you are locked into a specific reporting tool, that might drive what the reporting mart looks like in your warehouse.
Star schema is a logical data model for relational databases that fits the regular data warehousing needs; if the relational environment is given, a star or a snowflake schema will be a good design pattern, hard-wired in lots of DW design methodologies.
There are however other than relational database engines too, and they can be used for efficient data warehousing. Multidimensional storage engines might be very fast for OLAP tasks (TM1 eg.); we can not apply star schema design in this case. Other examples requiring special logical models include XML databases or column-oriented databases (eg. the experimental C-store)).
It's possible to do without. However, you will make life hard for yourself -- your organization will want to use standard tools that live on top of DWs, and those tools will expect a star schema -- a lot of effort will be spent fitting a square peg in a round hole.
A lot of database-level optimizations assume that you have a star schema; you will spend a lot of time optimizing and restructuring to get the DB to do "the right thing" with your not-quite-star layout.
Make sure that the pros outweigh the cons..
(Does it sound like I've been there before?)
-D
There are three problems we need to solve.
1) How to get the data out of the operational source system without putting undue pressure on them by joining tables within and between them, cleaning data as we extract, creating derivations etc.
2) How to merge data from disparate sources - some legacy, some file based, from different departments into an integral, accurate, efficiently stored whole that models the business, and does not reflect the structures of the source systems. Remember, systems change / are replaced relatively quickly, but the basic model of the business changes slowly.
3) How to structure the data to meet specific analytical and reporting requirements for particular people/departments in the business as quickly and accurately as possible.
The solution to these three very different problems require different architectural layers to solve them
Staging Layer
We replicate the structures of the sources, but only changed data from the sources are loaded each night. once the data is taken from the staging layer into the next layer, the data is dropped. Queries are single table queries with a simple data_time filter. Very little effect on the source.
Enterprise Layer
This is a business oriented 3rd normal form database. Data is extracted (and afterward dropped) from the staging layer into the enterprise layer, where it is cleaned, integrated and normalised.
Presentation (Star Schema) Layer
Here, we model dimensionally to meet specific requirements. Data is deliberately de-normalise to reduce the number of joins. Hierarchies that may occupy several tables in the Enterprise Layer are collapsed into a single dimension tables, and multiple transactional tables may be merged into single fact tables.
You always face these three problems. If you choose to do away with the enterprise layer, you still have to solve the second problem, but you have to do it in the star schema layer, and in my view, this is the wrong place to do it.

Resources