I've been reviewing a client's architecture, particularly their OLAP system, which is just a regular old snowflake schema on SQL Server. The facts and dimensions are ETL'd in from other transactional systems such as ERP.
One thing that jumped out at me was several additional tables, in the same database, for multiple additional OLTP applications. These tables have FK relationships to dimension tables in the snowflake schema.
There are a lot of joins into the dimension data from the OLTP system, so performance is not the best.
I am not an OLAP expert at all; but this just feels wrong. I've done some searching but can't find much about this on the internet either pro or con. What are the benefits of doing this? Are there any? What about potential problems?
I would try to avoid any explicit foreign keys between OLTP and OLAP data. Having foreign keys from OLTP to OLAP prevents you from adding new entities in the business and may require to define entities in OLAP first, while the standard is to run the ETL processes one-directional only - always from OLTP o OLAP. And having foreign keys from OLAP to OLTP prevents you from keeping historical data in the data warehouse that is not relevant for the current business, but may be interesting for analysis.
Of course, there are always situations where you break rules for a reason. Maybe there is one. Does someone at the client's side have an explanation why this was implemented the way you describe?
It is not common to share a dimension table between OLTP and OALP. There are at least 2 reasons: (1) the attributes interesting in OLTP and in OLAP may be quite different. (2) the contention and consequent performance problem.
On the other hand, it is not uncommon (but is somewhat advanced) for OLTP and an ODS to share exactly the same copy of a dimension. This is often called a "golden copy". I often call an ODS designed like this to be an active ODS. When there are multiple copies of the dimension, I call it a passive ODS. It may be that the OLAP you refer to is not true OLAP but just some form of tactical reporting, in which case sharing the same dimension table is not uncommon.
Related
The chunk partitioning for hypertables is a key feature of TimescaleDB.
You can also create relational tables instead of hypertables, but without the chunk partitioning.
So if you have a database with around 5 relational tables and 1 hypertable, does it lose the performance and scalability advantage of chunk partitioning?
One of the key advantages of TimescaleDB in comparison to other timeseries products is that timeseries data and relational data can be stored in the same database and then queried and joined together. So, "by design", it is expected that the database with several normal tables and hypertable will perform well. Usual PostgreSQL consideration about tables and other database objects, e.g., how shared memory is going to be affected, applies here.
Currently, I generate data on a different datastore and replicate to Snowflake Staging, then that data moves to the Data Warehouse DB through ELT ingestion for Analytics purpose. However this approach can be considered as creating data-silos in itself, since we already have 3 copies of the same data:
Transactional data-store DB
Replicated snowflake staging
Snowflake Data Warehouse DB
From a technical architecture point of view, is it a good idea to use Snowflake as a direct datastore for transactional application? (application that does many CRUD operations). That may help in avoiding the cost of replication and ingestion.
The main problem I see with this approach is that: Snowflake does not enforce any referential integrity (primary keys, foreign keys) so within the CRUD app, I have to either use a MERGE statement always or somehow make sure I don't create duplicate records.
The other problem being in the cloud, the distance (aka network) between the app and snowflake decides the performance of the transactions, I want good, consistent performance of my CRUD operations.
Any thoughts/suggestions are much appreciated.
Snowflake as of today does not perform well with singleton updates and inserts, which is what we see mostly with transactional databases. I have seen a performance degradation when using singleton inserts are submitted against Snowflake.
On the contrary, they are very optimized for bulk ingestion of unstructured data and structured data though and are designed for OLAP warehouses. You can still use it but you may see the same performance degradation. Also, primary keys can be defined but they are not enforced.
In my opinion, if you are faced with that challenge, you have the option to use a Postgre SQL DB (open source) in the cloud as your transactional database and it acts as a good complement to Snowflake as the OLAP database.
No. Snowflake isn't good as a transactional / OLTP database for the reasons you've mentioned. Plus, it won't perform well with many individual CRUD operations due to how they structure the data (optimised for OLAP workloads).
Just want to point out that there are benefits to creating separate databases, for one you want to isolate your transactional database from that of your analytics database otherwise you could be significantly affect the performance of the application. Secondly, the data in the transactional database could change and if you had to reprocess the data for whatever reason you may not be able to do so. There are many more, but I will stop here :-)
We know that Snowflake is a compressed columnar storage database and tuned to run queries with MPP and auto scaling. We also know that for creating data marts and DW, Kimball and Dimensional modelling(Star Schema) has been in market and practice from decades now. This was a success due to the massive Row store DBs we used to have for our DWs.
So the question here is for creating Data Marts and DW in Snowflake, do we have to follow Kimball ? Does it add any value to the performance, in fact I read that it add overhead for the engine that is already tuned to work on columnar compressed data? Do we still need to use surrogate keys for columns and force to create Facts and Dimensions and star schema, where we can simply join the flat denormalized tables to get the similar or better performance ?
What does superpower DBs like Snowflake recommend from a best practice point of view for Modelling? Is Kimball must to have or redundant as it defeats the purpose of columnar storage benefits ?
I think SAP HANA / Redshift / Big Query or even Azure SQL Datawarehouse, no one recommend this and I could not find anywhere a single line that recommend to use Kimball or star schema. Few does mention that, “It also works for Star schema” which does not mean that Star schema has to be used ?
One thing to keep in mind: Snowflake is a row-oriented, columnar store. That's an important distinction. This means that Snowflake takes advantage of all of the significant compression gains associated with columnar storage but still maintains the row-oriented approach to storing data.
Why does this matter?
With the micro-partition approach, it means we can still eliminate large numbers of rows using query predicates and then interrogate only the column stores within those row groupings that met the query's criteria. So you really get the best of both worlds.
In my opinion, Snowflake can support just about any data model (or partial/hybrid implementation).
Also - "redundant" values in a row-oriented, column store tend to lead to very, VERY good compression.
We have a requirement wherein we want our business users to manipulate data in Snowflake through some UI interface (which might require creating additional reference tables etc)
Is it a good practice as Snowflake is for DW purpose and not for transactional data and are there any performance issues in doing so?
Typical actions could be filtering data, searching for particular IDs, updating/deleting certain row(s), etc. For these activities wanted to know if it's cost effective?
Yeah your point is correct, Snowflake db will suit DW workload better than the transactional one.
Having said that, we must also note that the Snowflake databases support a lot of features provided by traditional OLTP databases. For example, honouring ACID properties, transactional consistency, object recovery from accidental drops, read-only database copies for reporting purposes, data encryption, Role based access control, Secure Views, Materialized Views, support for semi-structured data, supportability of a wide variety of function such as Scalar Functions, Aggregation Functions, Window Functions, Table Functions, System Functions as well as support for External Function and customized User-defined Functions (UDFs) etc.
It also provides connectivity interface (driver/connectors) to connect to a variety of different database systems, big-data eco-systems as well as analytical tools.
The Snowflake engine also offers an amazing level of query performance and it has the ability to add dynamic compute power for higher concurrency and greater performance.
So if you are looking for a database for good query performance and doing operations such as querying data, filtering data and routine updates and deletes then it will serve the purpose.
But if you planning to create constraints on tables, PK/FK relationships between tables, indexes, a lot of single row Inserts, any isolation level apart from Read Committed, transaction management through a stored procedure etc. then it may not be a natural choice.
There is a concept of table clustering in place of Indexes. Single row Inserts must be converted into 'COPY Into Table' commands to reduce throttling and to get better performance. Primary Keys / Foreign Keys can be created but they are not enforced.
Can anyone tell me the difference between a simple database and a data warehouse in terms of implementation?
I know that data warehouse is used for analysis rather than keeping record but I don't understand how are they structurally different
In simple database we have tables and so in a data warehouse. How can we make a data warehouse out of a simple database
In both cases we have queries so how are they different for each of them?
The differences are in the implementation, that is the representation (structure) of the data in tables.
A simple database is typically structured in normalized tables in order to minimize redundancy and optimize writing operations to the table. This can be achieved by dividing large tables into smaller and less redundant tables, so that data of the same kind are isolated in one place so that additions, deletions, and modifications of a field can be made in just one table. The smaller tables are then connected together via defined relationships between them (this is done by foreign keys), resulting in many joins between tables when retrieving the data.
On the other hand a datawarehouse is structured for reading operations only, which is why a datawarehouse accepts some level of redundancy in the data, because this makes reading faster. In a datawarehouse data is typically structured in what is called a Starschema approach through the use of dimensional modeling. That means you have 1 big table (Facttable) with all the relevant records and measures (fx sales amount in $), and then many minor tables (called dimensiontables) that describes the values in the facttable.
Dimensiontables could be something like Date, SalesCountry, SalesPerson, Product etc. that all describes the sales amount from the facttable. The dimensiontables are then related to the facttable with foreign keys, thereby creating the star like figure with the facttable in the middle and all the dimensiontables around it in a circle linking to it.
NB: This is a very simple introduction, and you should of course refer to some datawarehouse litterature to read more details. Look for books by Ralph Kimball and Bill Inmon, they are the gurus within the datawarehouse field.
Assuming you already know something about OLTP databases, the IBM Redbooks have a couple of downloadable titles about data warehouses that are worth looking at.
Data Modeling Techniques for Data
Warehousing
Dimensional Modeling: In a Business
Intelligence Environment
In essence, the way that data and tables are organized -- and more...
Read
Bill Inmon "Building the Data Warehouse"
Ralph Kimball "The Data Warehouse Toolkit"
OLTP stands for Online Transaction processing. The systems that are used in any booking system or in technical terms "OLTP, refers to a class of systems that facilitate and manage transaction-oriented applications, typically for data entry and retrieval transaction processing"
Now the next questions arrive what is the difference between OLTP and Data Warehouse?
There are many differences between the two, so we will list down some of the important differences :
The most important difference of all is : OLTP is normally in 3NF (3rd Normalized Form) whereas Data Warehousing is not in 3NF. Therefore, we can also infer that OLTP won't have any kind of Data Redundancy.
Data warehouse is used to store months and years of data to support historical analysis, whereas OLTP system store data for a few weeks or months. Therefore the sizes of the DB also has a vast difference. OLTP uses 100MB - 100GB where a Data Warehouse uses 100GB- few terabytes.
The highly normalized structure of the OLTP helps it to optimize the operations such as UPDATE/INSERT/DELETE, where Data Warehouse has a very de-normalized structure (Star Schema) to optimize query performance.
Data in Data Warehouse is pushed on regular basis by ETL process and end user do not update the data warehouse directly whereas in OLTP systems, end users routinely issue individual data modification statements to the database and thus the OLTP system is up to date.
These are few important differences between OLTP and Data Warehouse.
Read more