Snowflake Database with Dimensional modelling(Star Schema) - snowflake-cloud-data-platform

We know that Snowflake is a compressed columnar storage database and tuned to run queries with MPP and auto scaling. We also know that for creating data marts and DW, Kimball and Dimensional modelling(Star Schema) has been in market and practice from decades now. This was a success due to the massive Row store DBs we used to have for our DWs.
So the question here is for creating Data Marts and DW in Snowflake, do we have to follow Kimball ? Does it add any value to the performance, in fact I read that it add overhead for the engine that is already tuned to work on columnar compressed data? Do we still need to use surrogate keys for columns and force to create Facts and Dimensions and star schema, where we can simply join the flat denormalized tables to get the similar or better performance ?
What does superpower DBs like Snowflake recommend from a best practice point of view for Modelling? Is Kimball must to have or redundant as it defeats the purpose of columnar storage benefits ?
I think SAP HANA / Redshift / Big Query or even Azure SQL Datawarehouse, no one recommend this and I could not find anywhere a single line that recommend to use Kimball or star schema. Few does mention that, “It also works for Star schema” which does not mean that Star schema has to be used ?

One thing to keep in mind: Snowflake is a row-oriented, columnar store. That's an important distinction. This means that Snowflake takes advantage of all of the significant compression gains associated with columnar storage but still maintains the row-oriented approach to storing data.
Why does this matter?
With the micro-partition approach, it means we can still eliminate large numbers of rows using query predicates and then interrogate only the column stores within those row groupings that met the query's criteria. So you really get the best of both worlds.
In my opinion, Snowflake can support just about any data model (or partial/hybrid implementation).
Also - "redundant" values in a row-oriented, column store tend to lead to very, VERY good compression.

Related

Does having relational tables affect the performance/scalability when using TimescaleDB?

The chunk partitioning for hypertables is a key feature of TimescaleDB.
You can also create relational tables instead of hypertables, but without the chunk partitioning.
So if you have a database with around 5 relational tables and 1 hypertable, does it lose the performance and scalability advantage of chunk partitioning?
One of the key advantages of TimescaleDB in comparison to other timeseries products is that timeseries data and relational data can be stored in the same database and then queried and joined together. So, "by design", it is expected that the database with several normal tables and hypertable will perform well. Usual PostgreSQL consideration about tables and other database objects, e.g., how shared memory is going to be affected, applies here.

Analytics along with OLTP Database

I have a primary use case where I want to have a transactional relational database for which I am using Postgres.
I also need to run frequent aggregate queries (count, sum, average) on the data. These statistics cannot be precomputed as there are multiple filters for search that we have to provide.
I was initially thinking of using Redshift as a secondary storage, which can serve these queries, but then I would also need to build a system to keep the data in sync between the two storages.
Is there a better way to achieve this?
Take a look at AWS DMS, you can set this up to keep a near real time replica of your Postgres data on Redshift.
It is reliable and requires minimal maintenance (e.g. if you add new columns to your source data).
Read both of these carefully, especially limitations and requirements.
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.PostgreSQL.html
and
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.Redshift.html
Unless you need them, I recommend excluding text (and other large object) columns from the sync. this can be done easily by setting a flag, or can be tailored column by column.
The source Postgres database does not have to be held on AWS.

is it needed to normalize your database when you are using mongodb?

i am making an web application and using a mongodb for my database. but im new in mongodb and i just want to know if i still needed to normalize my database like what others do in using RDBMS. that before making a table or database it is normalize.
Normalizing your data like you would with a relational database is usually not a good idea in MongoDB.
Normalization in relational databases is only feasible under the premise that JOINs between tables are relatively cheap. The $lookup aggregation operator provides some limited JOIN functionality, but it doesn't work with sharded collections. So joins often need to be emulated by the application through multiple subsequent database queries, which is very slow (see question MongoDB and JOINs for more information).
For that reason, you should design your database in a way that the most common queries can be satisfied by querying a single collection, even when this means that you will have some redundancy in your database. A good tool to model 1:n relations are often arrays of sub-objects. However, try to avoid objects which grow indefinitely over time, because in that case they will be moved around the file a lot which will impact write performance.

data warehouse and database difference in implementation

Can anyone tell me the difference between a simple database and a data warehouse in terms of implementation?
I know that data warehouse is used for analysis rather than keeping record but I don't understand how are they structurally different
In simple database we have tables and so in a data warehouse. How can we make a data warehouse out of a simple database
In both cases we have queries so how are they different for each of them?
The differences are in the implementation, that is the representation (structure) of the data in tables.
A simple database is typically structured in normalized tables in order to minimize redundancy and optimize writing operations to the table. This can be achieved by dividing large tables into smaller and less redundant tables, so that data of the same kind are isolated in one place so that additions, deletions, and modifications of a field can be made in just one table. The smaller tables are then connected together via defined relationships between them (this is done by foreign keys), resulting in many joins between tables when retrieving the data.
On the other hand a datawarehouse is structured for reading operations only, which is why a datawarehouse accepts some level of redundancy in the data, because this makes reading faster. In a datawarehouse data is typically structured in what is called a Starschema approach through the use of dimensional modeling. That means you have 1 big table (Facttable) with all the relevant records and measures (fx sales amount in $), and then many minor tables (called dimensiontables) that describes the values in the facttable.
Dimensiontables could be something like Date, SalesCountry, SalesPerson, Product etc. that all describes the sales amount from the facttable. The dimensiontables are then related to the facttable with foreign keys, thereby creating the star like figure with the facttable in the middle and all the dimensiontables around it in a circle linking to it.
NB: This is a very simple introduction, and you should of course refer to some datawarehouse litterature to read more details. Look for books by Ralph Kimball and Bill Inmon, they are the gurus within the datawarehouse field.
Assuming you already know something about OLTP databases, the IBM Redbooks have a couple of downloadable titles about data warehouses that are worth looking at.
Data Modeling Techniques for Data
Warehousing
Dimensional Modeling: In a Business
Intelligence Environment
In essence, the way that data and tables are organized -- and more...
Read
Bill Inmon "Building the Data Warehouse"
Ralph Kimball "The Data Warehouse Toolkit"
OLTP stands for Online Transaction processing. The systems that are used in any booking system or in technical terms "OLTP, refers to a class of systems that facilitate and manage transaction-oriented applications, typically for data entry and retrieval transaction processing"
Now the next questions arrive what is the difference between OLTP and Data Warehouse?
There are many differences between the two, so we will list down some of the important differences :
The most important difference of all is : OLTP is normally in 3NF (3rd Normalized Form) whereas Data Warehousing is not in 3NF. Therefore, we can also infer that OLTP won't have any kind of Data Redundancy.
Data warehouse is used to store months and years of data to support historical analysis, whereas OLTP system store data for a few weeks or months. Therefore the sizes of the DB also has a vast difference. OLTP uses 100MB - 100GB where a Data Warehouse uses 100GB- few terabytes.
The highly normalized structure of the OLTP helps it to optimize the operations such as UPDATE/INSERT/DELETE, where Data Warehouse has a very de-normalized structure (Star Schema) to optimize query performance.
Data in Data Warehouse is pushed on regular basis by ETL process and end user do not update the data warehouse directly whereas in OLTP systems, end users routinely issue individual data modification statements to the database and thus the OLTP system is up to date.
These are few important differences between OLTP and Data Warehouse.
Read more

Database design: one huge table or separate tables?

Currently I am designing a database for use in our company. We are using SQL Server 2008. The database will hold data gathered from several customers. The goal of the database is to acquire aggregate benchmark numbers over several customers.
Recently, I have become worried with the fact that one table in particular will be getting very big. Each customer has approximately 20.000.000 rows of data, and there will soon be 30 customers in the database (if not more). A lot of queries will be done on this table. I am already noticing performance issues and users being temporarily locked out.
My question, will we be able to handle this table in the future, or is it better to split this table up into smaller tables for each customer?
Update: It has now been about half a year since we first created the tables. Following the advices below, I created a handful of huge tables. Since then, I have been experimenting with indexes and decided on a clustered index on the first two columns (Hospital code and Department code) on which we would have partitioned the table had we had Enterprise Edition. This setup worked fine until recently, as Galwegian predicted, performance issues are springing up. Rebuilding an index takes ages, users lock each other out, queries frequently take longer than they should, and for most queries it pays off to first copy the relevant part of the data into a temp table, create indices on the temp table and run the query. This is not how it should be. Therefore, we are considering to buy Enterprise Edition for use of partitioned tables. If the purchase cannot go through I plan to use a workaround to accomplish partitioning in Standard Edition.
Start out with one large table, and then apply 2008's table partitioning capabilities where appropriate, if performance becomes an issue.
Datawarehouses are supposed to be big (the clue is in the name). Twenty million rows is about medium by warehousing standards, although six hundred million can be considered large.
The thing to bear in mind is that such large tables have a different physics, like black holes. So tuning them takes a different set of techniques. The other thing is, users of a datawarehouse must understand that they are dealing with huge amounts of data, and so they must not expect sub-second response (or indeed sub-minute) for every query.
Partitioning can be useful, especially if you have clear demarcations such as, as in your case, CUSTOMER. You have to be aware that partitioning can degrade the performance of queries which cut across the grain of the partitioning key. So it is not a silver bullet.
Splitting tables for performance reasons is called sharding. Also, a database schema can be more or less normalized. A normalized schema has separate tables with relations between them, and data is not duplicated.
I am assuming you have your database properly normalized. It shouldn't be a problem to deal with the data volume you refer to on a single table in SQL Server; what I think you need to do is review your indexes.
Since you've tagged your question as 'datawarehouse' as well I assume you know some things about the subject. Depending on your goals you could go for a star-schema (a multidemensional model with a fact and dimensiontables). Store all fastchanging data in 1 table (per subject) and the slowchaning data in another dimension/'snowflake' tables.
An other option is the DataVault method by Dan Lindstedt. Which is a bit more complex but provides you with full flexibility.
http://danlinstedt.com/category/datavault/
In a properly designed database, that is not a huge anmout of records and SQl server should handle with ease.
A partioned single table is usually the best way to go. Trying to maintain separate indivudal customer tables is very costly in termas of time and effort and far more probne to errors.
Also examine you current queries if you are experiencing performance issues. If you don't have proper indexing (did you for instance index the foreign key fields?) queries will be slow, if you don't have sargeable queries they will be slow if you used correlated subqueries or cursors, they will be slow. Are you returning more data than is striclty needed? If you have select * anywhere in your production code, get rid of it and only return the fields you need. If you used views that call views that call views or if you used EAV table, you willhave performance iisues at this level. If you allowed a framework to autogenerate SQl code, you may well have badly perforimng queries. Remember Profiler is your friend. Of course you could also have a hardware issue, you need a pretty good sized dedicated server for that number of records. It won't work to run this on your web server or a small box.
I suggest you need to hire a professional dba with performance tuning experience. It is quite complex stuff. Databases desigend by application programmers often are bad performers when they get a real number of users and records. Database MUST be designed with data integrity, performance and security in mind. If you didn't do that the changes of having them are slim indeed.
Partioning is definately something to look into. I had a database that had 2 tables sharded. Each table contained around 30-35million records. I have since merged this into one large table and assigned some good indexes. So far, I've not had to partition this table as it's working a treat, but I'm keep partitioning in mind. One thing that I have noticed, compared to when the data was sharded, and that's the data import. It is now slower, but I can live with that as the Import tool can be re-written ;o)
One table and use table partitioning.
I think the advice to use NOLOCK is unjustified based on the information given. NOLOCK means you will get inaccurate and unreliable results from your queries (dirty and phantom reads). Before using NOLOCK you need to be sure that's not going to be a problem for your customers.
Is this a single flat table (no particular model)? Typically in data warehouses, you either have a normalized data model (third normal form at least - usually in an entity-relationship-model) or you have dimensional data (Kimball method or variations - usually fact tables with associated dimension tables in a set of stars).
In both cases, indexes play a large part, and partitioning can also play a part in getting queries to perform (but partitioning is not usually about performance but about maintenance being able to add and drop partitions quickly) over very large data sets - but it really depends on the order of aggregation and the types of queries.
One table, then worry about performance. That is, assuming you are collecting the exact same information for each customer. That way, if you have to add/remove/modify a column, you are only doing it in one place.
If you're on MS SQL server and you want to keep the single table, table partitioning could be one solution.
Keep one table - 20M rows isn't huge, and customers aren't exactly the kind of table that you can easily 'archive off', and the aggrevation of searching multiple tables to find a customer isn't worth the effort (SQL is likely to be much more efficient at BTree searching than your own invention is)
You will need to look into the performance and locking issues however - this will prevent your db from scaling.
You can also create supplemental tables that hold already calculated details on historical information if there are common queries.

Resources