Optimisation of Fact Read & Write Snowflake - snowflake-cloud-data-platform

All,
I am looking at some tips with regards to optimizing a ELT on a Snowflake Fact table with approx. 15 billion rows.
We get approximately 30,000 rows every 35 mins like the one below, we always will get 3 Account Dimension Key values i.e. Sales, Cogs & MKT.
Finance_Key
Date_Key
Department_Group_Key
Scenario_Key
Account_Key
Value
IsCurrent
001
2019-01-01
001
0012
SALES_001
100
Y
001
2019-01-01
001
0012
COGS_001
300
Y
001
2019-01-01
001
0012
MKT_001
200
Y
This is then PIVOTED based on Finance Key and Date Key and loaded into another table for reporting, like the one below
Finance_Key
Date_Key
Department_Group_Key
Scenario_Key
SALES
COGS
MKT
IsCurrent
001
2019-01-01
001
0012
100
300
200
Y
At times there is an adjustment made and for 1 Account key.
Finance_Key
Date_Key
Department_Group_Key
Scenario_Key
Account_Key
Value
001
2019-01-01
001
0012
SALES_001
50
Hence we have to do this
Finance_Key
Date_Key
Department_Group_Key
Scenario_Key
Account_Key
Value
IsCurrent
001
2019-01-01
001
0012
SALES_001
100
X
001
2019-01-01
001
0012
COGS_001
300
Y
001
2019-01-01
001
0012
MKT_001
200
Y
001
2019-01-01
001
0012
SALES_001
50
Y
And the resulting value should be
Finance_Key
Date_Key
Department_Group_Key
Scenario_Key
SALES
COGS
MKT
001
2019-01-01
001
0012
50
300
200
However my question is how do I go about optimizing the query to scan and update the Pivoted table for approx. 15 billion rows in Snowflake.
This is more of a optimization the read and write .
Any pointers
Thanks

So a longer form answer:
I am going to assume the Date_Key values being very is the past, is just a feature of the example date, because if you have 15B rows, and every 35 minutes you have ~10K (30K /3) updates to apply, and they span the whole date range of your data, then there is very little you can do to optimize it. Snowflake Query Acceleration Service might help with the IO.
But the primary way to improve total processing time, is process less.
For example in my old job we had IoT data, that could have messages up to two weeks old. And we duplicated all messages on load (which is effectively a similar process), as part of our pipeline error handling. We found that handling batches with a min date of -2 week against of full message tables (that also had billions of rows) used most of the time, reading/writing the tables. By altering the ingestion to sideline message older than a couple of hours, and deferring their processing until "midnight" we could get the processing of all the timely points in the batch done in under 5 minutes (we did a large amount of other processing, in that interval) for every batch, and we could turn the warehouse cluster size down out of core data hours, and use the saved credits to run a bigger instance at the midnight "catchup" to bring all the days worth of sidelined data on board. This eventually consistent approach worked very well for our application.
So if your destination table is well clustered, so the read/writes is just a fraction of the data then this should be less of a problem for you. But by the nature of asking the question I assume this is not the case.
So if the tables natural clustering is unaligned with the loading data, is that because the destination table needs to be a different shape for the critical read the tables is handling, at which point which is more cost/time sensitive. Another option is to have the destination table clustered by the date (again assuming the fresh data is a small window of time) and have a materialize view/table on top of that table, with a different sort, so that the rewrite is being done for you by Snowflake, now this in not a free-lunch. But it might allow faster upsert times, and faster usage performance. Assuming both a time sensitive.

Related

Create a table in Data Studio that shows current/latest value and a previous value and compares them

I have two sheets in a Google Sheet that has historical data for some internal programs.
programs
high level data about each program
each program has data for all of the different report dates
metrics
specific metrics about each program for each report date
Using my example data, there are 4 programs: a, b, c, and d and I have reports for 1/1/2020, 1/15/2020, 2/3/2020, and 6/20/2020.
I want to create a Data Studio report that will:
combine the data on report date and program
show a filter where the user can select which previous report date they want to compare against
the filter should show all of the previous report dates and default select the most recent one
for example, the filter would show:
1/1/2020
1/15/2020
2/3/2020 (default selected)
the filter should only allow selecting one
a table showing values for current report date and values for the report date selected in the above filter
Here is an example table using the source data in my above linked Google Sheet when the report is initially loaded and the filter has 2/3/2020 default selected:
report date
program
id
l1 manager
status
current value
previous report date value
direction
6/20/2020
a
1
Carlos
bad
202
244
up
6/20/2020
b
2
Jack
bad
202
328
up
6/20/2020
c
3
Max
bad
363
249
down
6/20/2020
d
4
Henry
good
267
284
up
If the user selects 1/1/2020 in the filter, then the table would show:
report date
program
id
l1 manager
status
current value
previous report date value
direction
6/20/2020
a
1
Carlos
bad
202
220
up
6/20/2020
b
2
Jack
bad
202
348
up
6/20/2020
c
3
Max
bad
363
266
down
6/20/2020
d
4
Henry
good
267
225
down

How can I aggregate on 2 dimensions in Google Data Studio?

I have data in 2 dimensions (let's say, time and region) counting the number of visitors on a website for a given day and region, as per the following:
time
region
visitors
2021-01-01
Europe
653
2021-01-01
America
849
2021-01-01
Asia
736
2021-01-02
Europe
645
2021-01-02
America
592
2021-01-02
Asia
376
...
...
...
2021-02-01
Asia
645
...
...
...
I would like to create a table showing the average daily worldwide visitors for each month, that is:
time
visitors
2021-01
25238
2021-02
16413
This means, I need to aggregate the data this way:
first, sum over regions for distinct dates
then, calculate average on dates
Is was thinking of doing a global average of all lines of data for each month, and then multiply the value by the number of days in the month but since that number is variable I can't do it.
Is there any way to do this ?
Create 2 calculated fields:
Month(time)
SUM(visitors)/COUNT(DISTINCT(time))
In case it might help someone... so far (January 2021) it seems there is no way to do that in DataStudio. Calculated fields or data blending do not have a GROUP BY-like function.
So, I found 2 alternative solutions:
create an additional table in my data with the first aggregation (sum over regions). This gives a table with the number of visitors for each date.
Then I import it in DataStudio and do the second aggregation in the table.
since my data is stored in BigQuery, a custom SQL query can be used to create another data source from the same dataset. This way, a GROUP BY statement can be used to sum over regions before the average is calculated.
These solutions have a big drawback that is, I cannot add controls to filter by region (since data from all regions is aggregated before entering datastudio).

Is normalization always necessary and more efficient?

I got into databases and normalization. I am still trying to understand normalization and I am confused about its usage. I'll try to explain it with this example.
Every day I collect data which would look like this in a single table:
TABLE: CAR_ALL
ID
DATE
CAR
LOCATION
FUEL
FUEL_USAGE
MILES
BATTERY
123
01.01.2021
Toyota
New York
40.3
3.6
79321
78
520
01.01.2021
BMW
Frankfurt
34.2
4.3
123232
30
934
01.01.2021
Mercedes
London
12.7
4.7
4321
89
123
05.01.2021
Toyota
New York
34.5
3.3
79515
77
520
05.01.2021
BMW
Frankfurt
20.1
4.6
123489
29
934
05.01.2021
Mercedes
London
43.7
5.0
4400
89
In this example I get data for thousands of cars every day. ID, CAR and LOCATION never changes. All the other data can have other values daily. If I understood correctly, normalizing would make it look like this:
TABLE: CAR_CONSTANT
ID
CAR
LOCATION
123
Toyota
New York
520
BMW
Frankfurt
934
Mercedes
London
TABLE: CAR_MEASUREMENT
GUID
ID
DATE
FUEL
FUEL_USAGE
MILES
BATTERY
1
123
01.01.2021
40.3
3.6
79321
78
2
520
01.01.2021
34.2
4.3
123232
30
3
934
01.01.2021
12.7
4.7
4321
89
4
123
05.01.2021
34.5
3.3
79515
77
5
520
05.01.2021
20.1
4.6
123489
29
6
934
05.01.2021
43.7
5.0
4400
89
I have two questions:
Does it make sense to create an extra table for DATE?
It is possible that new cars will be included through the collected data.
For every row I insert into CAR_MEASUREMENT, I would have to check whether the ID is already in CAR_CONSTANT. If it doesn't exist, I'd have to insert it.
But that means that I would have to check through CAR_CONSTANT thousands of times every day. Wouldn't it be more efficient if I just insert the whole data as 1 row into CAR_ALL? I wouldn't have to check through CAR_CONSTANT every time.
The benefits of normalization are dependent on your specific use case. I can see both pros and cons to normalizing your schema, but its impossible to say which is better without more knowledge of your use case.
Pros:
With your schema, normalization could reduce the amount of data consumed by your DB since CAR_MEASUREMENT will probably be much larger than CAR_CONSTANT. This scales up if you are able to factor out additional data into CAR_CONSTANT.
Normalization could also improve data consistency if you ever begin tracking additional fixed data about a car, such as license plate number. You could simply update one row in CAR_CONSTANT instead of potentially thousands of rows in CAR_ALL.
A normalized data structure can make it easier to query data for a specific car. using a LEFT JOIN, the DBMS can search through the CAR_MEASUREMENT table based on the integer ID column instead of having to compare two string columns.
Cons:
As you noted, the normalized form requires an additional lookup and possible insert to CAR_CONSTANT for every addition to CAR_MEASUREMENT. Depending on how fast you are collecting this data, those extra queries could be too much overhead.
To answer your questions directly:
I would not create an extra table for just the date. The date is a part of the CAR_MEASUREMENT data and should not be separated. The only exception that I can think of to this is if you will eventually collect measurements that do not contain any car data. In that case, then it would make sense to split CAR_MEASUREMENT into separate MEASUREMENT and CAR_DATA tables with MEASUREMENT containing the date, and CAR_DATA containing just the car-specific data.
See above. If you have a use case to query data for a specific car, then the normalized form can be more efficient. If not, then the additional INSERT overhead may not be worth it.

What is the most efficient way to store 2-D timeseries in a database (sqlite3)

I am performing large scale wind simulations to produce hourly wind patterns over a city. The results is a time series of 2-dimensional contours. Currently I am storing the results in SQLite3 database tables with the following structure
Table: CFD
id, timestamp, velocity, cell_id
1 , 2010-01-01 08:00:00, 3.345, 1
2 , 2010-01-01 08:00:00, 2.355, 2
3 , 2010-01-01 08:00:00, 2.111, 3
4 , 2010-01-01 08:00:00, 6.432, 4
.., ..................., ....., .
1000 , 2010-01-01 09:00:00, 3.345, 1
1001 , 2010-01-01 10:00:00, 2.355, 2
1002 , 2010-01-01 11:00:00, 2.111, 3
1003 , 2010-01-01 12:00:00, 6.432, 4
.., ..................., ....., .
Actual create statement:
CREATE TABLE cfd(id INTEGER PRIMARY KEY, time DATETIME, u, cell_id integer)
CREATE INDEX idx_cell_id_cfd on cfd(cell_id)
CREATE INDEX idx_time_cfd on cfd(time)
(There are three of these tables, each for a different result variable)
where cell_id is a reference to the cell in the domain representing a location in the city. See this picture to have an idea of what it looks like at a specific timestep.
The typical query performs some kind of aggregation on the time dimension and group by on cell_id. For example, if I want to know the average local wind speed in each cell during a specific time interval, I would execute
select sum(time in ('2010-01-01 08:00:00','2010-01-01 13:00:00','2010-01-01 14:00:00', ...................., ,'2010-12-30 18:00:00','2010-12-30 19:00:00','2010-12-30 20:00:00','2010-12-30 21:00:00') and u > 5.0) from cfd group by cell_id
The number of timestamps can vary from 100 to 8,000.
This is fine for small databases, but it gets much slower for larger ones. For example, my last database was 60GB, 3 tables and each table had 222,000,000 rows.
Is there a better way to store the data? For example:
would it make sense to create a different table for each day?
would be better to use a separate table for the timesteps and then use a join?
is there a better way of indexing?
I have already adopted all the recommendations in this question to maximise the performance.
This particular query is hard to optimize because the sum() must be computed over all table rows. It is a better idea to filter rows with WHERE:
SELECT count(*)
FORM cfd
WHERE time IN (...)
AND u > 5
GROUP BY cell_id;
If possible, use a simpler expression to filter times, such as time BETWEEN a AND b.
It might be worthwhile to use a covering index, or in this case, when all queries filter on the time, a clustered index (without additional indexes):
CREATE TABLE cfd (
cell_id INTEGER,
time DATETIME,
u,
PRIMARY KEY (cell_id, time)
) WITHOUT ROWID;

Database schema design for financial forecasting

I need to develop a web app that allows companies to forecast financials.
the app has different screens, one for defining employee salaries, another for sales projections etc..
basically turn an excel financial forecast model into an app.
question is, what would be the best way to design the database, so that financial reports (e.g. a profit and loss statement or balance sheet) could be quickly generated?
assuming the forecast period is for 5 years, would you have a table
with 5 years*12 months = 60 fields per each row? is that performant
enough?
would you use DB triggers to recalculate salary expenses whenever a single employee data is changed?
I'd think it would be better to store each month's forecast in its own row in a table that looks like this
month forecast
----- --------
1 30000
2 31000
3 28000
... ...
60 52000
Then you can use the aggregate functions to calculate forecast reports, discounted cash flow etc. ( Like if you want the un-discounted total for just 4 years):
SELECT SUM(forecast) from FORECASTS where month=>1 and month<=48
For salary expenses, I would think that having a view that does calculations on the fly (or if you DB engine supports "materialized views" should have sufficient performance unless we're talking some giant number of employees or really slow DB.
Maybe have a salary history table, that trigger populates when employee data changes/payroll runs
employeeId month Salary
---------- ----- ------
1 1 4000
2 1 3000
3 1 5000
1 2 4100
2 2 3100
3 2 4800
... ... ...
Then again, you can do SUM or other aggregate function to get to the reported data.

Resources