Data Vault Modelling - data-modeling

Assuming the following data architecture:
Source Systems -> Data Warehouse (using the data vault model) -> Data Virtualization -> Consumption Layer (e.g., BI Tools & reporting)
I read that for data vault, one of the key principles is to load raw data and keeping records from all sources - so no de-dupping or transformations for traceability/auditing purposes. If this is true, where would the transformations happen?

Yes, it is true, the "raw" data vault keeps records as it was on source system when it was loaded.
But there's another concept, the "business" data vault. This is where all the logic and transformation happens. The business data vault is not a full copy of the raw data vault, but you create hub/link/sat/pit/bridge to implement the logic to suit your needs.
That way, it helps you in the long run. If, for example, you need to change a business rule next year, you still have the original data for a particular source system at a particular time in the past. If your logic has a bug, you still have the original data.

From my experience usually you have this architecture:
Raw Source (Copy from your OLTP data sources)
Staging (nowadays as Persistent Staging Area in a Datalake, because it is cheaper than a Relational DB)
Raw Vault (applying so called Hard Rules, like data type changes)
Business Vault (applying so called Soft Rules, all your Business Logic, Aggregations, Concatenation, ...)
Information Mart (Data Mart sometimes virtualized, but not always ... usually Star/Snowflake Schema)
Cube/Tabular Model
BI Tool
More information about the difference between Raw Vault and Business Vault you can find here: Datavault - hard rules (rawvault) vs soft rules (businessvault)

Related

Can we store multiple types of data in a data warehouse?

I want to ask that can we store various types of data in a Hadoop data warehouse? Data like RDBMS, JSON Doc, Cassandra Keyspace, txt, CSV, etc? Are they all stored in HDFS?
Classic DWH is a repository for structured, filtered data that has already been processed for a specific purpose and all the data is being stored in the same format except landing zone (LZ or RAW) where data can be stored in the same format as it is loaded from source systems. DHW building process is based on Kimball or Inmon theory.
What you are asking about is a Data Lake - a modern concept - is a vast pool of raw data, the purpose for which can be not completely defined yet. In a DL you can store all structured alond with semi-structured data and data analysts can access both RAW semi-structured data and structured data in 3NF or dimentional form.
RDBMS normally add the abstraction layer between internal storage representation and means how it can be accessed, though storing data in external files in HDFS is possible for many RDBMS, this is used for convenient integration with Data Lake.
Yes, you can store everything in the same DL: semi-structured data, data in different storage formats like AVRO, CSV, Parquet, ORC, ETC, build Hive tables on it as well as different RDBMs tables, all can be stored in the same HDFS/S3/Azure/GCS/etc
Some layers are also can be created in DL like RAW/LZ/DM or based on domain event/business event model, this means that DL is not an absence of architecture constraints, normally you have some architecture design, and architecture constraints to follow in DL as well as in classic DWH.

what is the difference between a unified data warehouse and a unified data model?

I ask as the words are used pretty much interchangeably in some documentation I have had to review.
In the real world what are the differences?
A "Data Warehouse" is mostly an information systems concept that describes a centralized and trusted source of (e.g. company/business) data.
From Wikipedia: "DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise."
That being said, I think it's a bit redundant to say "unified data warehouse"; a data warehouse is a "unified" source of data by definition.
This definition implies that the data model in a data warehouse must/should be a unified, canonical model of all relevant data. You can also look at a Data Warehouse as a collection of data marts, which in turn are smaller unified/canonical models focused on specific business/functional areas; so the "unified data model" is can be thought of as the sum of the various smaller/specific models (the data marts).
A Data Warehouse, as an information system, is usually surrounded by a lot of technology tools (databases, ETL software, analytics and reporting tools, etc); but regardless of how you handle, model and explore data, the primary purpose of a DW is to serve as a curated, single source of truth for (business) questions that (should) rely on data.

Can an Data Warehouse include a Data lake?

I want to understand data warehouse and data lake more in detail.
It seems to me there is different information to the topic. Inmon defines a data warehouse as
a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process
Now I understand that this is just a form of architecture and does not imply any technology. Which means that the underlying data can be any structure that could also be an S3 object storage. Moreover Waas et al. in On-Demand ELT Architecture for Right-Time BI: Extending the Vision proposed a data warehouse with a ELT process of integrating data.
When it comes to data lakes I found the following definition
scalable storage repository that holds a vast amount of raw data in its native format ("as is") until it is needed plus processing systems (engine) that can ingest data without compromising the data structure
taken from Data lake governance.
Now can a data warehouse be a more strict data lake? There has been an argument that a data warehouse must use ETL but according to Inmon the definiten does not include any restriction on data transformation? If data integration can be ELT and the there the transformation is agile e.g. it can be easily extended. A data warehouse looks very much like a data lake.
are my assumption correct or am looking at this from a skewed angle.
A Data Warehouse and a Data Lake are independent systems that serve different purposes, can/should be complementary, and both are part of a larger data architecture. A data lake, as a concept, can be just another data source for dimensional models on a data warehouse (although the technological implementation of data lakes enables direct querying over the raw data).
You can think of a Data Lake as a "landing zone" where several systems dump data in a "complex/raw format", e.g. MP3 files from customer support calls, gzipped logs from web servers. It's meant to sit there for historical purposes and for further processing into a format that can be easily analyzed/reported over, e.g. text extraction from MP3 files.
A Data Warehouse also aggregates data from different systems, but the data is modeled into a format appropriate for reporting (like a dimensional model), its model reflects the business/domain's processes and transactions, and is usually highly curated.
Imagine the case: if you log visits to your online store using web server logs, you could keep the gzipped logs (the "transaction data") in a data lake and then process the data into a dimensional model (like this) which will be the "copy of transaction data specifically structured for query and analysis", so business users can easily explore it in Excel or some other reporting tool.
Data Warehouse was created to solve the need to do Analytical Data Processing on Enterprise Level and Structured data, it means
data is from throughout the organization and is usually brought to warehouse using ETL processes from various sources
Data in warehouse is structured and managed in format optimized for intensive analytical transformations. Most Warehouses structure data as Columnar Store and provide a SQL type interface to work with data.
Data Lake on the other hand was created to be one stop zone for all your organizations data. Data is in raw, unprocessed format straight from applications. You can also process data in lake either by moving them to warehouse or directly use them in Distributed Big Data processing systems.
So from this we see data warehouse is not data lake
since it does not have unstructured data
can only be used for compute intensive OLAP applications

How to Organize a Messy Database

I know there is no easy answer to this question, but how do I cleanup a database with no relationships, foreign keys, and not a whole lot of structure?
I'm an amateur to SQL, and I've inherited a database that is complete mess. We have no sort of referential integrity, and there's not a whole lot of logic to how tables are working.
My database is all data that comes from a warehouse that builds servers.
To give you an idea of the type of data I'm working with:
EDI from customers
Raw output from server projects
Sales information
Site information
Parts lists
I have been prioritizing Raw output and EDI information, and generating reports with that information using SSRS. I have learned a lot about SQL Server and the BI Microsoft tools (SSIS and SSRS) in my short time doing this. However, I'm still an amateur and I want to build a solid database that flows well and can stand on its own.
It seems like a data warehouse model is the type of structure I should adapt.
My question how do I take my mess of a database and make something more organized before I drown in data?
Since your end goal appears to be business reporting, and you're dealing with data from multiple sources made up from "isolated" tables, I would advise you to start by aggregating all that into a data model.
Personally, I would design a dimensional model to structure and store all that data, with the goal of being easy to understand (for reporting or adhoc querying). The model should be focused on business entities and their transactions. In a dimensional model, the business entities will (almost always) be the dimensions and the transactions (the metrics) will be the facts. For example, without knowing your model I'm guessing that the immediate entities would include Customer, Site, Part and transactions would include ServerSale, SiteVisit, PartPurchase, PartRepair, PartOrder, etc...
More information about dimensional modelling here and here, but I suggest going straight to the source: https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/books/data-warehouse-dw-toolkit/
When your model is designed (and implemented in a database like SQL Server) you'll then be loading data into the model, by extracting it from its different source systems/databases and transforming it from the current structure into the structure defined by the model, namely by using an ETL tool like MS Integration Services. For example, your Customer data may be scattered across the "sales", "customer" and "site", so you want to aggregate all that data and load it into a single Customer dimension table. It's when doing this ETL that you should check your data for the problems you already mentioned, loading correct rows into you data model and discarding incorrect rows into a file/log where they can later be checked and corrected. (multiple ways to address this).
A straightforward tutorial to get started on doing ETL using SSIS can be found at https://technet.microsoft.com/en-us/library/jj720568(v=sql.110).aspx
So, to sum up, you should build a data mart:
design a dimensional model that represents the business facts and
context on the data you have. This will strongly facilitate both data understanding and reporting, because a dimensional model is closely matches business users terminology and mental models.
use an ETL tool to extract the data from its current source, process it (e.g. check for data quality problems, join data from different sources) and load it into the dimensional model and check it for problems. This will get you close to having an automated data integration job/pipeline with quality checks you deem fit for the data.

How are OLAP, OLTP, data warehouses, analytics, analysis and data mining related?

I'm trying to understand what OLAP, OLTP, data mining, analytics etc. are about, and I feel like my understanding about some of these concepts is still a bit vague. Information about these subjects tend to be explained in a very complex manner on the internet.
I feel like a question like this is likely to be closed since it's a very broad one, so I'll try to narrow it down into two questions:
Question 1:
After doing research I understand the following about these concepts, is it correct?
Analysis is decomposing something complex, to understand the inner workings better.
Analytics is predictive analysis on information that requires alot of math and statistics.
There's many type of databases, but they are either OLTP (transactional) or OLAP (analytical).
OLTP databases use ER diagrams, and are therefore easier to update because they are in normalized form.
In contrast, OLAP uses the denormalized star schema's and is therefore easier to query
OLAP is used for predictive analysis and OLTP is usually used in more practical situations since theres no redundancy.
Data warehouses is a type of OLAP database, and usually consists out of multiple other databases.
Data mining is a tool used in analytics, where u use computer software to find out relationships between data so you can predict things (e.g. customer behavior).
Question 2:
I'm especially confused about the difference between analytics and analysis. They say analytics is multidimensional analysis, but what is that supposed to mean?
I will try to explain you from the top of the pyramid:
Business Intelligence (what you didn't mentioned) is term in IT which stands for a complex system and gives useful informations about company from data.
So, BI systems has target: Clean, accurate and meaningful informations.
Clean means there is no tech problems (missing keys, incomplete data ect). Accurate means accurate - BI systems are also used as fault checker of production database (logical faults - i.e invoice bill is too high, or inactive partner is used ect). It has been accomplished with rules. Meaningful is hard to explain, but in simple english, it's all your data (even excel table from the last meeting), in way you want.
So, BI system has back-end: It's data warehouse.
DWH is nothing else than a database (instance, not software). It can be stored in RDBMS, analytical db (columnar or document store types), or NoSQL databases.
Data warehouse is term used usually for whole database that I explained above. There could be number of data-marts (if Kimball model is used) - more often, or relational system in 3rd normalized form (Inmon model) called enterprise data warehouse.
Data marts are tables inside DWH that are related (star schema, snowflake schema). Fact table (business process in denormalized form ) and dimension tables.
Each data mart represents one business process. Example: DWH has 3 data marts. One is retail sales, second is export, and third is import. In retail you can see total sales, qty sold, import price, profit (measures) by SKU, date, store, city ect (dimensions).
Loading data in DWH is called ETL(extract, transform, load).
Extract data from multiple sources (ERP db, CRM db, excel files, web service...)
Transform data (clean data, connect data from diff sources, match keys, mine data)
Load data (Load transformed data in specific data marts)
edit because of comment: ETL process is usually created with ETL tool, or manually with some programming language (python, c# ect) and APIs.
ETL process is group of SQLs, procedures, scripts and rules related and separated in 3 parts (look above), controlled by meta data.
It's either scheduled (every night, every few hours) or live (change data capture, triggers, transactions).
OLTP and OLAP are types of data processing. OLTP is used in transaction purpose, between database and software (usually only one way of input/output data).
OLAP is for analitical purpose, and this means there is multiple sources, historical data, high select query performance, mined data.
edit because of comment: Data Processing is way how data is stored and accessed from database. So, based on of your needs, database is set in different way.
Image from http://datawarehouse4u.info/:
Data mining is the computational process of discovering patterns in large data sets. Mined data can give you more insight view of business process or even forecast.
Analysis is a verb, which in BI world means simplicity of getting asked information from data. Multidimensional analysis actually says how system is slicing your data (with dimensions inside cube). Wikipedia said that analysis of data is a process of inspecting data with the goal of discovering useful information.
Analytics is a noun and it represent a result of analysis process.
Don't get so much fuss about those two words.
I can tell you about Data mining as i had project on Data mining. Data mining is not a tool ,Its a method of mining data and different tools used for data mining is WEKA ,RAPID MINER etc. Data mining follows many algorithms which are inbuilt in tools like Weka ,Rapid miner. Algorithms like Clustering algorithm , assosiation algorithm etc.
A simple example i can give you of data mining . Teacher is teaching science subject in a class by using different methods of teaching like using chalkboard,presentation,Practical. So now our aim is to find which method is suitable for students. Then we do survey and take students opinion 40 students like chalk board ,30 likes presentation and 20 likes practical method. So with help of this data we can make the rules for example Science subject should be taught by chalk board method.
To knw different algorithms you can use google :D.

Resources