Going to do a POC on Snowflake and just wanted to check what is the best practice around loading the data to snowflake:
Should load data in normalized (Group and store related information into multiple tables) Or go with Denormalized form? What is recommended here..?
Or dump data to one table and create multiple views from one table? But think that The big table has 150 million records and it has a column called Australia State and we know that we have only 6 states in Australia. If a create a view to extract Australia State information from main table via view, I feel like it will be more costly than store Australia State Information in a separate table and that is what I am talking about normalization..?
What is the way to load SCD-2 dimensions in Snowflake? Interested to know the efficient way to do this..?
Your questions 1. and 2. seem to be more about partitioning (or "clustering" in Snowflake lingo) than normalization. It is also about performance vs. maintainability.
The best of two worlds would be to have a single table where Australia State is a clustering key. Correct setup will allow for efficient Query pruning. Read more in Clustering Keys & Clustered Tables.
Re. question 3. Look into MERGE. Maybe you also can get some hints reading Working with SCD-Type-II in Snowflake
I would load the data the way that "makes the most sense for how it will be 'updated' and 'used'"
Which is to mean we have data (many forms actually) that we sync/stream from PostgreSQL DB's, and some of it we dimension it (SCD1/SCD2/SCD6) as we load it. For this data we have the update timestamp we we load the record, we workout the changes and build the dimension data.
If you already have dimension data, and it's a single data move. Dump the tables you have and just load them. It's really cheap to make a new table in snowflake, so we just tried stuff and worked out what fitted our data ingress patterns, and how we were reading the data to improve/help clustering, or avoid churn that costs on the auto-clustering operations.
Related
I am working on a PowerBI report that consists of multiple dashboards. The data needed is from a single table with 100K rows of data in DWH . The table stores all the variables and values for different stores, as shown in the picture below.
Currently, we are creating new table in data mart for each separate dashboard, such as total profit in each country, total number of staff in each country etc. However, I realize I can do the same using Power Query without adding new tables for my data mart. So I am curious which approach is better?
And this leads to another question I always have, when we need a tranformed table for dashboard, shoud we create new tables in data mart, or should we do it in the BI tool such as PBI or Tableau? I think performance is a factor to be considered, but not sure about the other factors.
Appreciate if anyone can share your opinion.
Appreciate if anyone can share your opinion.
Given the amount of transformation that needs to occur, it would be worth doing this in the DWH. Power BI does well with a star schema, so it would be good to break out dimensions like country, store and date into their own tables.
You might also work the measures into a single fact table - or maybe two if some of the facts are transactional and others are semi-additive snapshot facts. i.e. profit vs. number of staff. Designed right, the model could support all of the dashboards, so you would not need a report table for each.
I'm designing a website where users answer surveys. I need to design a data warehouse to aggregate their responses. So far in my model I have:
A dim table for Users.
A dim table for Questions.
A fact table for UserResponses. <= This is where I'm having the problem.
So the problem I have is that additional comments can be added to their responses. For example, somebody may come in and make 2 comments against a single response. How should I model this in the database?
I was thinking of creating another fact table for "Comments", and linking it to a record in UserResponses. Is this the right thing to do? This additional table would have something like the below columns:
CommentText
Foreign key relationship to fact.UserResponses.
Yes, your idea to create another table is correct. I would typically call it a "child" table rather than calling it another fact table.
The key thing that you didn't mention is that the table comments still needs an ID field. A table without an ID would be bad design (although it is indeed possible to create the table with no ID) since you would have no simple way to refer to individual comments.
In a dimension model, fact tables are never linked to each other, as the grain of the data will be compromised.
The back-end database of a client application is not usually a data warehouse schema, but more of an online transactional processing (OLTP) schema. This is because transactional systems work better with third normal form. Analytical systems work better with dimensional models because the data can be aggregated (i.e., "sliced and diced") more easily.
I would recommend switching back to an OLTP database. It can still be aggregated when needed, but maintains third normal form for easier transactional processing.
Here is a good comparison between a dimensional model (OLAP) and a transactional system (OLTP):
https://www.guru99.com/oltp-vs-olap.html
I'm trying to find the best data model to adapt a very big mysql table in Cassandra.
This table is structured like this:
CREATE TABLE big_table (
social_id,
remote_id,
timestamp,
visibility,
type,
title,
description,
other_field,
other_field,
...
)
A page (which is not here) can contain many socials, which can contain many remote_ids.
Social_id is the partitioning key, remote_id and timestamp are the clustering key: "Remote_id" gives unicity, "Time" is used to order the results. So far so good.
The problem is that users can also search on their page contents, filtering by one or more socials, one or more types, visibility (could be 0,1,2), a range of dates or even nothing at all.
Plus, based on the filters, users should be able to set visibility.
I tried to handle this case, but I really can find a sustainable solution.
The best I've got is to create another table, which I need to keep up with the original one.
This table will have:
page_id: partition key
timestamp, social_id, type, remote_id: clustering key
Plus, create a Materialized View for each combination of filters, which is madness.
Can I avoid creating the second table? What wuold be the best Cassandra model in this case? Should I consider switching to other technologies?
I start from last questions.
> What would be the best Cassandra model in this case?
As stated in Cassandra: The Definitive Guide, 2nd edition (which I highly recommend to read before choosing or using Cassandra),
In Cassandra you don’t start with the data model; you start with the query model.
You may want to read an available chapter about data design at Safaribooksonline.com. Basically, Cassandra wants you to think about queries only and don't care about normalization.
So the answer on
> Can I avoid creating the second table?
is You shouldn't avoiding it.
> Should I consider switching to other technologies?
That depends on what you need in terms of replication and partitioning. You may end up creating master-master synchronization based on RDBMS or something else. In Cassandra, you'll end up with duplicated data between tables and that's perfectly normal for it. You trade disk space in exchange for reading/writing speed.
> how to filter and update a big table dynamically?
If after all of the above you still want to use normalized data model in Cassandra, I suggest you look on secondary indexes at first and then move on to custom indexes like Lucene index.
I'm looking at the best practice approach here. I have a web page that has several drop down options. The drop downs are not related, they are for misc. values (location, building codes, etc). The database right now has a table for each set of options (e.g. table for building codes, table for locations, etc). I'm wondering if I could just combine them all into on table (called listOptions) and then just query that one table.
Location Table
LocationID (int)
LocatValue (nvarchar(25))
LocatDescription (nvarchar(25))
BuildingCode Table
BCID (int)
BCValue (nvarchar(25))
BCDescription (nvarchar(25))
Instead of the above, is there any reason why I can't do this?
ListOptions Table
ID (int)
listValue (nvarchar(25))
listDescription (nvarchar(25))
groupID (int) //where groupid corresponds to Location, Building Code, etc
Now, when I query the table, I can pass to the query the groupID to pull back the other values I need.
Putting in one table is an antipattern. These are differnt lookups and you cannot enforce referential integrity in the datbase (which is the ciorrect place to enforce it as applications are often not the only way data gets changed) unless they are in separate tables. Data integrity is FAR more important than saving a few minutes of development time if you need an additonal lookup.
If you plan to use the values later in some referencing FKeys - better use separate tables.
But why do you need "all in one" table? Which problem it solves?
You could do this.
I believe that is your master data and it would not be having any huge amounts of rows that it might create and performance problems.
Secondly, why would you want to do it once your app is up and running. It should have thought about earlier. The tables might be used in a lot of places and it's might be a lot of coding and most importantly testing.
Can you throw further light into your requirements.
You can keep them in separate tables and have your stored procedure return one set of data with a "datatype" key that signifies which set of values go with what option.
However, I would urge you to consider a much different approach. This suggestion is based on years of building data driven websites. If these drop-down options don't change very often then why not build server-side include files instead of querying the database. We did this with most of our websites. Think about it, each time the page is presented you query the database for the same list of values... that data hardly ever changes.
In cases when that data did have the tendency to change, we simply added a routine to the back end admin that rebuilt the server-side include file whenever an add, change or delete was done to one of the lookup values. This reduced database I/O's and spead up the load time of all our websites.
We had approximately 600 websites on the same server all using the same instance of SQL Server (separate databases) our total server database I/O's were drastically reduced.
Edit:
We simply built SSI that looked like this...
<option value="1'>Blue</option>
<option value="2'>Red</option>
<option value="3'>Green</option>
With single table it would be easy to add new groups in favour of creating new tables, but for best practices concerns you should also have a group table so you can name those groups in the db for future maintenance
The best practice depends on your requirements.
Do the values of location and building vary frequently? Where do the values come from? Are they imported from external data? Do other tables refer the unique table (so that I need a two-field key to preper join the tables)?
For example, I use unique table with hetorogeneus data for constants or configuration values.
But if the data vary often or are imported from external source, I prefer use separate tables.
I'm new to databases. I have 4 tables in total: 3 tables are populated automatically when the user logs on to Facebook. I want the values of the primary keys form those tables to be populated into the 4th table. How do I do this... I need help soon!
This is how the tables look:
table:attributes
fb_user : fb_uid, birhtday, gender, email.
company_master : com_id, com_name.
position_master : pos_id, pos_name.
And the 4th table goes like this:
[table]:[attributes]
work_history : work_id, fb_uid, com_id, pos_id.
fb_uid, pos_id and com_id are primary keys.
How to perform this using less database operations? Is there any way to use triggers for this to optimize?
Firstly what type of database are you using? Secondly, this seems to be a database design issue. You really should use a single unique primary key across all tables instead of using different primaries and mapping them. Since your using Facebook it would make sense to use their Facebook id as the primary for all tables then store the other ids as unique fields. This would also allow you to easily use useful features like joins to retrieve data from multiple tables at once. If this isn't practical, since for example, your using multiple logins (Facebook Google, etc) for the same user, you would then want to have a lookup table like you suggest as the driving table then use it to help populated the others. Ideally you want to minimize redundant data as much as possible to reduce the risk of data inconsistencies. If you are new to databases you should do some reading on database design and database normalization. A good design will help with scalability and prevent a lot of headaches and frustration.