Can an Data Warehouse include a Data lake? - data-modeling

I want to understand data warehouse and data lake more in detail.
It seems to me there is different information to the topic. Inmon defines a data warehouse as
a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process
Now I understand that this is just a form of architecture and does not imply any technology. Which means that the underlying data can be any structure that could also be an S3 object storage. Moreover Waas et al. in On-Demand ELT Architecture for Right-Time BI: Extending the Vision proposed a data warehouse with a ELT process of integrating data.
When it comes to data lakes I found the following definition
scalable storage repository that holds a vast amount of raw data in its native format ("as is") until it is needed plus processing systems (engine) that can ingest data without compromising the data structure
taken from Data lake governance.
Now can a data warehouse be a more strict data lake? There has been an argument that a data warehouse must use ETL but according to Inmon the definiten does not include any restriction on data transformation? If data integration can be ELT and the there the transformation is agile e.g. it can be easily extended. A data warehouse looks very much like a data lake.
are my assumption correct or am looking at this from a skewed angle.

A Data Warehouse and a Data Lake are independent systems that serve different purposes, can/should be complementary, and both are part of a larger data architecture. A data lake, as a concept, can be just another data source for dimensional models on a data warehouse (although the technological implementation of data lakes enables direct querying over the raw data).
You can think of a Data Lake as a "landing zone" where several systems dump data in a "complex/raw format", e.g. MP3 files from customer support calls, gzipped logs from web servers. It's meant to sit there for historical purposes and for further processing into a format that can be easily analyzed/reported over, e.g. text extraction from MP3 files.
A Data Warehouse also aggregates data from different systems, but the data is modeled into a format appropriate for reporting (like a dimensional model), its model reflects the business/domain's processes and transactions, and is usually highly curated.
Imagine the case: if you log visits to your online store using web server logs, you could keep the gzipped logs (the "transaction data") in a data lake and then process the data into a dimensional model (like this) which will be the "copy of transaction data specifically structured for query and analysis", so business users can easily explore it in Excel or some other reporting tool.

Data Warehouse was created to solve the need to do Analytical Data Processing on Enterprise Level and Structured data, it means
data is from throughout the organization and is usually brought to warehouse using ETL processes from various sources
Data in warehouse is structured and managed in format optimized for intensive analytical transformations. Most Warehouses structure data as Columnar Store and provide a SQL type interface to work with data.
Data Lake on the other hand was created to be one stop zone for all your organizations data. Data is in raw, unprocessed format straight from applications. You can also process data in lake either by moving them to warehouse or directly use them in Distributed Big Data processing systems.
So from this we see data warehouse is not data lake
since it does not have unstructured data
can only be used for compute intensive OLAP applications

Related

Can we store multiple types of data in a data warehouse?

I want to ask that can we store various types of data in a Hadoop data warehouse? Data like RDBMS, JSON Doc, Cassandra Keyspace, txt, CSV, etc? Are they all stored in HDFS?
Classic DWH is a repository for structured, filtered data that has already been processed for a specific purpose and all the data is being stored in the same format except landing zone (LZ or RAW) where data can be stored in the same format as it is loaded from source systems. DHW building process is based on Kimball or Inmon theory.
What you are asking about is a Data Lake - a modern concept - is a vast pool of raw data, the purpose for which can be not completely defined yet. In a DL you can store all structured alond with semi-structured data and data analysts can access both RAW semi-structured data and structured data in 3NF or dimentional form.
RDBMS normally add the abstraction layer between internal storage representation and means how it can be accessed, though storing data in external files in HDFS is possible for many RDBMS, this is used for convenient integration with Data Lake.
Yes, you can store everything in the same DL: semi-structured data, data in different storage formats like AVRO, CSV, Parquet, ORC, ETC, build Hive tables on it as well as different RDBMs tables, all can be stored in the same HDFS/S3/Azure/GCS/etc
Some layers are also can be created in DL like RAW/LZ/DM or based on domain event/business event model, this means that DL is not an absence of architecture constraints, normally you have some architecture design, and architecture constraints to follow in DL as well as in classic DWH.

what is the difference between a unified data warehouse and a unified data model?

I ask as the words are used pretty much interchangeably in some documentation I have had to review.
In the real world what are the differences?
A "Data Warehouse" is mostly an information systems concept that describes a centralized and trusted source of (e.g. company/business) data.
From Wikipedia: "DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise."
That being said, I think it's a bit redundant to say "unified data warehouse"; a data warehouse is a "unified" source of data by definition.
This definition implies that the data model in a data warehouse must/should be a unified, canonical model of all relevant data. You can also look at a Data Warehouse as a collection of data marts, which in turn are smaller unified/canonical models focused on specific business/functional areas; so the "unified data model" is can be thought of as the sum of the various smaller/specific models (the data marts).
A Data Warehouse, as an information system, is usually surrounded by a lot of technology tools (databases, ETL software, analytics and reporting tools, etc); but regardless of how you handle, model and explore data, the primary purpose of a DW is to serve as a curated, single source of truth for (business) questions that (should) rely on data.

Where does Big Data go and how is it stored?

I'm trying to get to grips with Big Data, and mainly with how Big Data is managed.
I'm familiar with the traditional form of data management and data life cycle; e.g.:
Structured data collected (e.g. web form)
Data stored in tables in an RDBMS on a database server
Data cleaned and then ETL'd into a Data Warehouse
Data is analysed using OLAP cubes and various other BI tools/techniques
However, in the case of Big Data, I'm confused about the equivalent version of points 2 and 3, mainly because I'm unsure about whether or not every Big Data "solution" always involves the use of a NoSQL database to handle and store unstructured data, and also what the Big Data equivalent is of a Data Warehouse.
From what I've seen, in some cases NoSQL isn't always used and can be totally omitted - is this true?
To me, the Big Data life cycle goes something on the lines of this:
Data collected (structured/unstructured/semi)
Data stored in NoSQL database on a Big Data platform; e.g. HBase on MapR Hadoop distribution of servers.
Big Data analytic/data mining tools clean and analyse data
But I have a feeling that this isn't always the case, and point 3 may be totally wrong altogether. Can anyone shed some light on this?
When we talk about Big Data, we talk in most cases about huge amount of data that is many cases constantly written. Data can have a lot of variety as well. Think of a typical data source for Big Data as a machine in a production line that produces all the time sensor data on temperature, humidity, etc. Not the typical kind of data you would find in your DWH.
What would happen if you transform all this data to fit into a relational database? If you have worked with ETL a lot, you know that extracting from the source, transforming the data to fit into a schema and then to store it takes time and it is a bottle neck. Creating a schema is too slow. Also mostly this solution is to costly as you need expensive appliances to run your DWH. You would not want to fill it with sensor data.
You need fast writes on cheap hardware. With Big Data you store schemaless as first (often referred as unstructured data) on a distributed file system. This file system splits the huge data into blocks (typically around 128 MB) and distributes them in the cluster nodes. As the blocks get replicated, nodes can also go down.
If you are coming from the traditional DWH world, you are used to technologies that can work well with data that is well prepared and structured. Hadoop and co are good for looking for insights like the search for the needle in the hay stack. You gain the power to generate insights by parallelising data processing and you process huge amount of data.
Imagine you collected Terabytes of data and you want to run some analytical analysis on it (e.g. a clustering). If you had to run it on a single machine it would take hours. The key of big data systems is to parallelise execution in a shared nothing architecture. If you want to increase performance, you can add hardware to scale out horizontally. With that you speed up your search with a huge amount of data.
Looking at a modern Big Data stack, you have data storage. This can be Hadoop with a distributed file system such as HDFS or a similar file system. Then you have on top of it a resource manager that manages the access on the file system. Then again on top of it, you have a data processing engine such as Apache Spark that orchestrates the execution on the storage layer.
Again on the core engine for data processing, you have applications and frameworks such as machine learning APIs that allow you to find patterns within your data. You can run either unsupervised learning algorithms to detect structure (such as a clustering algorithm) or supervised machine learning algorithms to give some meaning to patterns in the data and to be able to predict outcomes (e.g. linear regression or random forests).
This is my Big Data in a nutshell for people who are experienced with traditional database systems.
Big data, simply put, is an umbrella term used to describe large quantities of structured and unstructured data that are collected by large organizations. Typically, the amounts of data are too large to be processed through traditional means, so state-of-the-art solutions utilizing embedded AI, machine learning, or real-time analytics engines must be deployed to handle it. Sometimes, the phrase "big data" is also used to describe tech fields that deal with data that has a large volume or velocity.
Big data can go into all sorts of systems and be stored in numerous ways, but it's often stored without structure first, and then it's turned into structured data clusters during the extract, transform, load (ETL) stage. This is the process of copying data from multiple sources into a single source or into a different context than it was stored in the original source. Most organizations that need to store and use big data sets will have an advanced data analytics solution. These platforms give you the ability to combine data from otherwise disparate systems into a single source of truth, where you can use all of your data to make the most informed decisions possible. Advanced solutions can even provide data visualizations for at a glance understanding of the information that was pulled, without the need to worry about the underlying data architecture.

I would like to know the difference between Database, Data warehouse, Data mining and Big Data?

Since I am new to the word Database. I would like to know the differences. Please explain with examples. What is Database, Data mining, Data warehouse and Big Data?
I highly recommend using http://bigdatauniversity.com/
It has free relevant and updated course materials on the topics you seek. Topics such as Hadoop and Data Mining are covered and this gives you access to tools to practise.
Big data is a buzzword, or catch-phrase, used to describe a massive volume of both structured and unstructured data that is so
large that it's difficult to process using traditional database and
software techniques.
A database is an organized collection of data. The data is
typically organized to model aspects of reality in a way that
supports processes requiring information.
Data Mining is an analytic process designed to explore data (usually large amounts of data typically business or market related
also known as "big data") in search of consistent patterns and/or systematic relationships between variables, and then to validate the
findings by applying the detected patterns to new subsets of data.
The ultimate goal of data mining is prediction - and predictive data
mining is the most common type of data mining and one that has the
most direct business applications.
StatSoft defines data warehousing as a process of organizing the
storage of large, multivariate data sets in a way that facilitates
the retrieval of information for analytic purposes.

How would you implement OLAP cubes if the raw data is inconsistent?

I have been reading up about OLAP cubes, and I've spoken to our data architect about OLAP implementation within our warehouses. However, i learned that in order for OLAP to be implemented, the data that we receive has to be consistent. However, the data that we've always been receiving aren't. For instance, we received various forms of data in Flat File, CSV, AS2, and 852s. Additionally, there's a lot of custom metrics and cleansing we do internally. What about be an alternate way around OLAP?

Resources