Does YAFFS (yet another flash file system) have data deduplication and provide data consistency? Or does it have an additional tool to ensure no data redundancy?
Related
I keep hitting an issue where BigQuery's API tells me a table I just created doesn't exist. According to it's documentation on errors for streaming inserts
"Because BigQuery's streaming API is designed for high insertion rates, modifications to the underlying table metadata exhibit are eventually consistent when interacting with the streaming system."
https://cloud.google.com/bigquery/docs/error-messages#metadata-errors-for-streaming-inserts
However, Google also say "All table modifications in BigQuery, including DML operations, queries with destination tables, and load jobs are ACID-compliant" (emphasis is mine).
https://cloud.google.com/architecture/bigquery-data-warehouse#handling_change
So, if creation of the table is eventually consistent, is it BASE, or is it ACID and I am not thinking about it correctly as the eventual consistency is with the API, not within the DB.
I want to ask that can we store various types of data in a Hadoop data warehouse? Data like RDBMS, JSON Doc, Cassandra Keyspace, txt, CSV, etc? Are they all stored in HDFS?
Classic DWH is a repository for structured, filtered data that has already been processed for a specific purpose and all the data is being stored in the same format except landing zone (LZ or RAW) where data can be stored in the same format as it is loaded from source systems. DHW building process is based on Kimball or Inmon theory.
What you are asking about is a Data Lake - a modern concept - is a vast pool of raw data, the purpose for which can be not completely defined yet. In a DL you can store all structured alond with semi-structured data and data analysts can access both RAW semi-structured data and structured data in 3NF or dimentional form.
RDBMS normally add the abstraction layer between internal storage representation and means how it can be accessed, though storing data in external files in HDFS is possible for many RDBMS, this is used for convenient integration with Data Lake.
Yes, you can store everything in the same DL: semi-structured data, data in different storage formats like AVRO, CSV, Parquet, ORC, ETC, build Hive tables on it as well as different RDBMs tables, all can be stored in the same HDFS/S3/Azure/GCS/etc
Some layers are also can be created in DL like RAW/LZ/DM or based on domain event/business event model, this means that DL is not an absence of architecture constraints, normally you have some architecture design, and architecture constraints to follow in DL as well as in classic DWH.
Assuming the following data architecture:
Source Systems -> Data Warehouse (using the data vault model) -> Data Virtualization -> Consumption Layer (e.g., BI Tools & reporting)
I read that for data vault, one of the key principles is to load raw data and keeping records from all sources - so no de-dupping or transformations for traceability/auditing purposes. If this is true, where would the transformations happen?
Yes, it is true, the "raw" data vault keeps records as it was on source system when it was loaded.
But there's another concept, the "business" data vault. This is where all the logic and transformation happens. The business data vault is not a full copy of the raw data vault, but you create hub/link/sat/pit/bridge to implement the logic to suit your needs.
That way, it helps you in the long run. If, for example, you need to change a business rule next year, you still have the original data for a particular source system at a particular time in the past. If your logic has a bug, you still have the original data.
From my experience usually you have this architecture:
Raw Source (Copy from your OLTP data sources)
Staging (nowadays as Persistent Staging Area in a Datalake, because it is cheaper than a Relational DB)
Raw Vault (applying so called Hard Rules, like data type changes)
Business Vault (applying so called Soft Rules, all your Business Logic, Aggregations, Concatenation, ...)
Information Mart (Data Mart sometimes virtualized, but not always ... usually Star/Snowflake Schema)
Cube/Tabular Model
BI Tool
More information about the difference between Raw Vault and Business Vault you can find here: Datavault - hard rules (rawvault) vs soft rules (businessvault)
I ask as the words are used pretty much interchangeably in some documentation I have had to review.
In the real world what are the differences?
A "Data Warehouse" is mostly an information systems concept that describes a centralized and trusted source of (e.g. company/business) data.
From Wikipedia: "DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise."
That being said, I think it's a bit redundant to say "unified data warehouse"; a data warehouse is a "unified" source of data by definition.
This definition implies that the data model in a data warehouse must/should be a unified, canonical model of all relevant data. You can also look at a Data Warehouse as a collection of data marts, which in turn are smaller unified/canonical models focused on specific business/functional areas; so the "unified data model" is can be thought of as the sum of the various smaller/specific models (the data marts).
A Data Warehouse, as an information system, is usually surrounded by a lot of technology tools (databases, ETL software, analytics and reporting tools, etc); but regardless of how you handle, model and explore data, the primary purpose of a DW is to serve as a curated, single source of truth for (business) questions that (should) rely on data.
I want to understand data warehouse and data lake more in detail.
It seems to me there is different information to the topic. Inmon defines a data warehouse as
a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process
Now I understand that this is just a form of architecture and does not imply any technology. Which means that the underlying data can be any structure that could also be an S3 object storage. Moreover Waas et al. in On-Demand ELT Architecture for Right-Time BI: Extending the Vision proposed a data warehouse with a ELT process of integrating data.
When it comes to data lakes I found the following definition
scalable storage repository that holds a vast amount of raw data in its native format ("as is") until it is needed plus processing systems (engine) that can ingest data without compromising the data structure
taken from Data lake governance.
Now can a data warehouse be a more strict data lake? There has been an argument that a data warehouse must use ETL but according to Inmon the definiten does not include any restriction on data transformation? If data integration can be ELT and the there the transformation is agile e.g. it can be easily extended. A data warehouse looks very much like a data lake.
are my assumption correct or am looking at this from a skewed angle.
A Data Warehouse and a Data Lake are independent systems that serve different purposes, can/should be complementary, and both are part of a larger data architecture. A data lake, as a concept, can be just another data source for dimensional models on a data warehouse (although the technological implementation of data lakes enables direct querying over the raw data).
You can think of a Data Lake as a "landing zone" where several systems dump data in a "complex/raw format", e.g. MP3 files from customer support calls, gzipped logs from web servers. It's meant to sit there for historical purposes and for further processing into a format that can be easily analyzed/reported over, e.g. text extraction from MP3 files.
A Data Warehouse also aggregates data from different systems, but the data is modeled into a format appropriate for reporting (like a dimensional model), its model reflects the business/domain's processes and transactions, and is usually highly curated.
Imagine the case: if you log visits to your online store using web server logs, you could keep the gzipped logs (the "transaction data") in a data lake and then process the data into a dimensional model (like this) which will be the "copy of transaction data specifically structured for query and analysis", so business users can easily explore it in Excel or some other reporting tool.
Data Warehouse was created to solve the need to do Analytical Data Processing on Enterprise Level and Structured data, it means
data is from throughout the organization and is usually brought to warehouse using ETL processes from various sources
Data in warehouse is structured and managed in format optimized for intensive analytical transformations. Most Warehouses structure data as Columnar Store and provide a SQL type interface to work with data.
Data Lake on the other hand was created to be one stop zone for all your organizations data. Data is in raw, unprocessed format straight from applications. You can also process data in lake either by moving them to warehouse or directly use them in Distributed Big Data processing systems.
So from this we see data warehouse is not data lake
since it does not have unstructured data
can only be used for compute intensive OLAP applications