What is the difference between data modelling and dimensional modelling? - database

I've been learning about Data warehousing concepts and I found these 2 topics little confusing. I've read multiple blog posts and I understood that data modelling consists of three steps
Conceptual Data Model
Logical Data Model
Physical Data Model
and in data warehousing we need to perform certain steps:
Step 1: Identify the dimensions
Step 2: Identify the measures
Step 3: Identify the attributes or properties of dimensions
Step 4: Identify the granularity of the measures
Are these modelling techniques related to each other? If yes, how are this related.
If someone asks, how to design a data warehouse, what should be the correct answer. Where does these modelling techniques comes in while designing a data warehouse.
It would be really helpful, if someone could provide me any link/resource about data modelling and dimensional modelling scenarios.

As the name suggests, a conceptual model is very high level and does not correspond directly to what actually gets built. Logical/physical models do correspond to what you are actually going to build - the difference between the two is that a logical model is system-independent while a physical model is tied to the platform/DB where it is going to be deployed. However they are fundamentally identical in that most modelling tools can automatically generate a physical model from a logical one (and vice versa).
A dimensional model is a type of logical/physical model, in the same way that OLTP, Inmon, Data Vault, etc. are types of logical/physical model. There are normally best practices defined for the steps required to design each of these model types - and you have listed the steps specific to designing a Dimensional model.
So for a given data domain (e.g. a Sales organisation), you would normally have a single Conceptual model and then multiple logical/physical models. Usually these would be one transactional model and one analytical model; the transactional model could be OLTP or NoSQL (or whatever suits your requirements/technology the best); the analytical model could be Dimensional, Inmon, Graph, etc. - again whatever suits your data/analytical requirements the best.

Related

How to design large Database, Entity–relationship model

I normally design databases entity–relationship model with Visio and then the relationship model. But now I have to design a DB with a lot of entities and relationsship.
Which end up looking very confusing in Visio, so my question is, are there any programms which allow displaying large db models easly or are there any design models for large dbs.
Check this post: http://www.datasciencecentral.com/profiles/blogs/top-6-data-modeling-tools
Usually, you don't have to visualize the whole Entities and Relationships in a single diagram. You partition your entities into smaller concepts/modules, etc. The repository (of a tool that you use, I believe including visio) would have all the entities you defined, and a selected set of them are visualized in a diagram

Why is it said that dimensional models (DM/DW) are denormalized when most of them are in 1NF?

Currently I am working with the Dimensional modeling / Data Warehouse / Data Mart.
"Dimensional modeling" is the data model of the data warehouse. There are two basic models: "star schema" and "snowflake schema"
Dimensional modelling is used for OLAP (Online Analytical Processing).
I have been reading about dimensional modeling and OLAP, and this kind of database is described as "denormalized."
But since I work with them, I see all the data structures always minimally in 1NF. I have never worked with a completely denormalized database structure.
So here is the question, does 1NF mean the same thing as "denormalized?" If not, then why do people say it?
Because it is denormalised in comparison to more commonly used relational models, which are very often 3NF+. The assumption is that your source systems are using 3NF+ databases, and when you drop down to 2NF or 1NF, you are denormalising.
This is a big assumption, and not always correct. Plenty of systems are built on relational databases which don't really follow a 3NF model. And more recently, some systems are not using a relational model at all! (Think about all the NoSQL data stores now in use.)
Further to this, one fairly common data warehouse architecture involves creating a 3NF+ datawarehouse which is loaded from the source, and then denormalising the data to create dimensional data marts which are loaded from the more normalised model. In this case saying you are "denormalising" makes sense.

How are OLAP, OLTP, data warehouses, analytics, analysis and data mining related?

I'm trying to understand what OLAP, OLTP, data mining, analytics etc. are about, and I feel like my understanding about some of these concepts is still a bit vague. Information about these subjects tend to be explained in a very complex manner on the internet.
I feel like a question like this is likely to be closed since it's a very broad one, so I'll try to narrow it down into two questions:
Question 1:
After doing research I understand the following about these concepts, is it correct?
Analysis is decomposing something complex, to understand the inner workings better.
Analytics is predictive analysis on information that requires alot of math and statistics.
There's many type of databases, but they are either OLTP (transactional) or OLAP (analytical).
OLTP databases use ER diagrams, and are therefore easier to update because they are in normalized form.
In contrast, OLAP uses the denormalized star schema's and is therefore easier to query
OLAP is used for predictive analysis and OLTP is usually used in more practical situations since theres no redundancy.
Data warehouses is a type of OLAP database, and usually consists out of multiple other databases.
Data mining is a tool used in analytics, where u use computer software to find out relationships between data so you can predict things (e.g. customer behavior).
Question 2:
I'm especially confused about the difference between analytics and analysis. They say analytics is multidimensional analysis, but what is that supposed to mean?
I will try to explain you from the top of the pyramid:
Business Intelligence (what you didn't mentioned) is term in IT which stands for a complex system and gives useful informations about company from data.
So, BI systems has target: Clean, accurate and meaningful informations.
Clean means there is no tech problems (missing keys, incomplete data ect). Accurate means accurate - BI systems are also used as fault checker of production database (logical faults - i.e invoice bill is too high, or inactive partner is used ect). It has been accomplished with rules. Meaningful is hard to explain, but in simple english, it's all your data (even excel table from the last meeting), in way you want.
So, BI system has back-end: It's data warehouse.
DWH is nothing else than a database (instance, not software). It can be stored in RDBMS, analytical db (columnar or document store types), or NoSQL databases.
Data warehouse is term used usually for whole database that I explained above. There could be number of data-marts (if Kimball model is used) - more often, or relational system in 3rd normalized form (Inmon model) called enterprise data warehouse.
Data marts are tables inside DWH that are related (star schema, snowflake schema). Fact table (business process in denormalized form ) and dimension tables.
Each data mart represents one business process. Example: DWH has 3 data marts. One is retail sales, second is export, and third is import. In retail you can see total sales, qty sold, import price, profit (measures) by SKU, date, store, city ect (dimensions).
Loading data in DWH is called ETL(extract, transform, load).
Extract data from multiple sources (ERP db, CRM db, excel files, web service...)
Transform data (clean data, connect data from diff sources, match keys, mine data)
Load data (Load transformed data in specific data marts)
edit because of comment: ETL process is usually created with ETL tool, or manually with some programming language (python, c# ect) and APIs.
ETL process is group of SQLs, procedures, scripts and rules related and separated in 3 parts (look above), controlled by meta data.
It's either scheduled (every night, every few hours) or live (change data capture, triggers, transactions).
OLTP and OLAP are types of data processing. OLTP is used in transaction purpose, between database and software (usually only one way of input/output data).
OLAP is for analitical purpose, and this means there is multiple sources, historical data, high select query performance, mined data.
edit because of comment: Data Processing is way how data is stored and accessed from database. So, based on of your needs, database is set in different way.
Image from http://datawarehouse4u.info/:
Data mining is the computational process of discovering patterns in large data sets. Mined data can give you more insight view of business process or even forecast.
Analysis is a verb, which in BI world means simplicity of getting asked information from data. Multidimensional analysis actually says how system is slicing your data (with dimensions inside cube). Wikipedia said that analysis of data is a process of inspecting data with the goal of discovering useful information.
Analytics is a noun and it represent a result of analysis process.
Don't get so much fuss about those two words.
I can tell you about Data mining as i had project on Data mining. Data mining is not a tool ,Its a method of mining data and different tools used for data mining is WEKA ,RAPID MINER etc. Data mining follows many algorithms which are inbuilt in tools like Weka ,Rapid miner. Algorithms like Clustering algorithm , assosiation algorithm etc.
A simple example i can give you of data mining . Teacher is teaching science subject in a class by using different methods of teaching like using chalkboard,presentation,Practical. So now our aim is to find which method is suitable for students. Then we do survey and take students opinion 40 students like chalk board ,30 likes presentation and 20 likes practical method. So with help of this data we can make the rules for example Science subject should be taught by chalk board method.
To knw different algorithms you can use google :D.

how to tell if specifications are modelled using database oriented approach or class design oriented approach

Given a problem specification, how to tell if it is a database design problem or class design(object oriented design) problem?
What comes to my mind, is that in OOP, classes(objects) contain methods, whereas a database is just a collection of relationships and values.
Therefore:
If you can say a problem is about how "things" in the specification relate to each other you have a database design problem.
If it is about what the "things" in the specification can do, you're going to be modeling more along object oriented programming.
If you're using a database and creating domain objects, it's both. Database design and class design are two different things, and both are necessary if you're using a database and classes. It's not like you choose one or the other.
This is where an ORM comes into play. When your data layer retrieves information from the database, a typical approach is to transform the relational data into your domain object(s) and pass that to the business logic layer so the rest of your application can deal with domain objects instead of a relational model.
Then your ORM does the opposite when persisting data: it takes a domain entity and turns it back into a relational structure that can be saved to the database.
Note: I'm assuming a relational database here. If not, substitute relational for whatever type of persistence layer you're using.
I believe that the only specifications which should be addressed as database-oriented problems are those which are focused on the manipulation of structured data types. If your specification is all about "store a customer record", "delete an order record", "change the value of price from 12 to 33 for record matching specifcation", you've got a database project.
I haven't seen that kind of problem specification since the Cobol team I worked in employed a systems ~~anarchist~~ analyst. Almost every project I've worked on since has had requirements that were not about how data was stored, but what the data meant.
If you get a requirement that says "Users may create Customers. Customers can place orders. Orders contain products. Orders can have delivery methods, payment methods, and status. Status follows a business process", you have an OO problem. You probably need a storage mechanism - and a database would be an excellent choice - but you have business logic that cannot be exclusively implemented by creating structured data types and relationships.

DataWarehouse - What is a good definition?

Could someone give me a good, practical definition of what a data warehouse is?
I'm surprised no one has posted Inmon's definition:
A warehouse is a subject-oriented,
integrated, time-variant and
non-volatile collection of data in
support of management's decision
making process
From the same page you can pick up Kimball's definition:
A copy of transaction data
specifically structured for query and
analysis
I think that, unfortunately, data warehousing is a wide-ranging field. There is a lot of variety with very few standard paradigms, specifically I'm thinking of Kimball's dimensional modelling. Inmon does not have as a specific a methodology as Kimball's and thus some 3NF models may or may not conform to his principles.
Because Inmon has broadened his scope for what warehousing is meant to accomplish, it can encompass unstructured data. However, analysis of unstructured data is very different than traditional analysis.
As applied to SQL Server, typically the largest Data Warehouses on SQL Server are modelled dimensionally, because this lends itself well to the non-distributed, non-massively parallel model. Massively parallel systems like Teradata generally perform a lot better with 3NF models. These are still table-based systems with the various tables connected with foreign key constraints (perhaps not enforced, but at least logical).
Of course, we are also seeing NoSQL data processing systems like Map/Reduce which are not really databases at all in the sense of normalized, denormalized or non/poorly-normalized relational databases which we have had for 40 years now.
i just started with Datawarehousing and Buisness Intelligence and looking around the web you can find some interesting links :
Get Start With Datawarehousing
I think this two links could help you to understand the concepts of datawarehousing.
sorry, im new i can post only one link ^^
we're sorry, but as a spam prevention mechanism, new users can only post a maximum of one hyperlink. Earn 10 reputation to post more hyperlinks.
A database optimized for retrieval, in general denormalized data, usually a star schema(but could be snowflake) and uses dimensional modeling (fact and dimension tables)
While this is not an academic definition, it might serve as a practical one. A data warehouse is a collection of datamarts and will combine datasets across the breadth of an organization.
A datamart will contain datasets specific to certain portions of the business. In the datamart you will find fact tables, measurable pieces of information, along with dimensions, attributes of your measurable pieces.
A true data warehouse will have conformed dimension tables that can be shared across datamarts.
An example...
Your company may build a datamart around sales. And another datamart around human resources. If the customer dimension table is shared across both these datamarts, it would be considered a conformed dimension. All three of these entities together would make up a data warehouse.
As someone else stated you can find more detailed information by searching for Ralph Kimball's Data Strategies.
Definition : Datawarehouse is a database used for analysis purpose rather than for transaction processing
Check the below link for more informaion on datawarehouse
http://www.idatastage.com/datawarehouse/

Resources