Snowflake Partitioning Vs Manual Clustering - snowflake-cloud-data-platform

I have 2 large tables in Snowflake (~1 and ~15 TB resp.) that store click events. They live in two different schemas but have the same columns and structure; just different sources.
The data is dumped/appended into these tables on a monthly basis, and both tables have a time_id integer field which represents days since 1999-12-31 the click event took place.
Question is -- Should I leave it up to Snowflake to optimize the partitioning --OR-- Is this a good candidate for manually assigning a clustering key? And say, I do decide to add a clustering key to it, would re-clustering after next insert be just for the incremental data? --OR-- Would it be just as expensive as the initial clustering?
In case it helps, here is some clustering info on the larger of the 2 tables
select system$clustering_information( 'table_name', '(time_id)')
{
"cluster_by_keys" : "LINEAR(time_id)",
"total_partition_count" : 1151026,
"total_constant_partition_count" : 130556,
"average_overlaps" : 4850.673,
"average_depth" : 3003.3745,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 127148,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0,
"01024" : 984,
"02048" : 234531,
"04096" : 422451,
"08192" : 365912
}
}
A typical query I would run against these tables
select col1, col_2, col3, col4, time_id
from big_table
where time_id between 6000 and 7600;

Should I leave it up to Snowflake to optimize the partitioning? Is
this a good candidate for manually assigning a clustering key?
Yes, it seems it's a good candidate to assign a clustering key (size + update intervals + query filters)
And say, I do decide to add a clustering key to it, would
re-clustering after next insert be just for the incremental data?
After the initial reclustering, if you do not insert data belongs to earlier days, existing partitions would be in a "constant" state, so the reclustering will just process only the new data/micro-partitions.
https://docs.snowflake.com/en/user-guide/tables-auto-reclustering.html#optimal-efficiency
Would it be just as expensive as the initial clustering?
In normal conditions, it should not.

Mostly a long winded commend to the question on Gokhan's answer:
This is helpful! Just so I have a sense of cost and time, how long do you think it'll take to run the clustering?
I would suggest you do a one off rebuild of the table with the order by time verse leave auto-cluster to incrementally sort a table this large.
I say this as we had a collect of tables with about 3B rows each (there was about ~30x of these table), and would do a GDPR related PII clean up every month, that deleted 1 month's data via an UPDATE command, as the UPDATE has no order by the order was destroyed for about 1/3 of the table, which auto-cluster would then "fix" over that following day.
Our auto-clustering bill was normal ~100 credits a day, but on these days we where using ~300 credits. which is implying ~6 credits per table, where a full table re-create with a order by would maybe take a medium 15 minutes so ~1 credit.
Which is not deriding auto-clustering, but when a table gets random scrambled it's "a little at a time" approach is too passive/costly, imho.
But on the other hand is you cannot block the insert process for N minutes while you recreate the table, maybe auto-cluster might be your only option, that other other-hand to this if you are always writing to the table auto-cluster will back off a lot, from failed writes.. But this point is more the "general case details to watch out for", given as you state you do monthly loads.

Related

Time complexity of Cursor Pagination

I have read from different articles saying cursor pagination query has time complexity O(1) or O(limit) where limit is the number of item limit in sql. Some example article source:
https://uxdesign.cc/why-facebook-says-cursor-pagination-is-the-greatest-d6b98d86b6c0 and
https://dev.to/jackmarchant/offset-and-cursor-pagination-explained-b89
But I canont find related references explaining why the time complexity is O(limit). Say I have a table consist of 3 columns
id, name, created_at, where id is primary key,
if I use created_at as the cursor (which is unique and sequential), can someone explain why the time complexity is O(limit)?
Is it related to data structure used to store created_at?
After some reading, I guess the time complexity is talking about after retrieving the intermediate records, the time complexity of getting the final required records.
For offset case, all records will be selected, then database will discard x records where x is the offset, finally select y records (where y = limit), so the time complexity is O(offset + limit).
For cursor case, records matched the cursor where condition will be selected, then select y records (where y = limit), so the time complexity is O(limit).

How to calculate the tax free amount of sales, based on date fields?

i need your help for a task that i have undertaken and i face difficulties.
So, i have to calculate the NET amount of sales for some products, which were sold in different cities on different years and for this reason different tax rate is applied.
Specifically, i have a dimension table (Dim_Cities) which consists of the cities that the products can be sold.
i.e
Dim_Cities:
CityID, CityName, Area, District.
Dim_Cities:
1, "Athens", "Attiki", "Central Greece".
Also, i have a file/table which consists of the following information :
i.e
[SalesArea]
,[EffectiveFrom_2019]
,[EffectiveTo_2019]
,[VAT_2019]
,[EffectiveFrom_2018]
,[EffectiveTo_2018]
,[VAT_2018]
,[EffectiveFrom_2017]
,[EffectiveTo_2017]
,[VAT_2017]
,[EffectiveFrom_2016_Semester1]
,[EffectiveTo_2016_Semester1]
,[VAT_2016_Semester1]
,[EffectiveFrom_2016_Semester2]
,[EffectiveTo_2016_Semester2]
,[VAT_2016_Semester2]
i.e
"Athens", "2019-01-01", "2019-12-31", 0.24,
"2018-01-01", "2018-12-31", 0.24,
"2017-01-01", "2017-12-31", 0.17,
"2016-01-01", "2016-05-31", 0.16,
"2016-01-06", "2016-12-31", 0.24
And of course there is a fact table that holds all the information,
i.e
FactSales_ID, CityID, SaleAmount (with VAT), SaleDate_ID.
The question is how to compute for every city the "TAX-Free SalesAmount", that corresponds to each particular saledate? In other words, i think that i have to create a function that computes every time the NET amount, substracting in each case the corresponding tax rate, based on the date and city that it finds. Can anyone help me or guide me to achieve this please?
I'm not sure if you are asking how to query your data to produce this result or how to design your data warehouse to make this data available - but I'm hoping you are asking about how to design your data warehouse as this information should definitely be pre-calculated and held in your DW rather than being calculated every time anyone wants to report on the data.
One of the key points of building a DW is that all the complex business logic should be handled in the ETL (as much as possible) so that the actually reporting is simple; the only calculations in a reporting process are those that can't be pre-calculated.
If your CITY Dim is SCD2 (or could be made to be SCD2) then I would add the VAT rate as an attribute to that Dim - otherwise you could hold VAT Rate in a "worker" table.
When your ETL loads your Fact table you would use the VAT rate on the CITY Dim (or in the worker table) to calculate the Net and Gross amounts and hold both as measures in your fact table

Deciding the clustering key snowflakes

As per the documentation we can select a column to for clustering based on the cardinality (distinct values of the column) and the column used in the join condition .Here is the o/p of clustering information for one of the table in the select query on which the query execution is taking more than 80% of total execution time ( just to scan the table).FYI I have collected below output for the table based on the column used in join condition.
Based on the o/p relating with my understanding .Below are the point make me feel that clustering the table based on the column in the subject will helps in increasing the performance .
ratio of total_partition_count 20955 to average_overlaps : 17151.4681
ratio of total_partition_count 20955 to average_depth : 16142.2524
1.Correct me if my understanding is wrong.(based on below facts is this table is a good candidate for clustering or not)?
Please also help with below other points as well
2.If i opt for clustering the table ,will it needs any downtime(or)does clustering adds up my bill?
3.Does this clustering impacts the future DML operations?
4.I see the select query is returning 23 rows after scanning 37 GB of data ,what would be best solution to improve the performance of the query other than choosing clustering as an option.
Let me know for any details required
select SYSTEM$CLUSTERING_INFORMATION('tablename','(columnname)');
{
"cluster_by_keys" : "LINEAR(column used in join condition)",
"total_partition_count" : 20955,
"total_constant_partition_count" : 2702,
"average_overlaps" : 17151.4681,
"average_depth" : 16142.2524,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 1933,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0,
"08192" : 2,
"16384" : 3,
"32768" : 19017
}
}
a TABLE is a good candidate for clustering if the operation you care about "most" uses only a small number of rows than the total partitions read of which many are dropped by the filtering you would cluster by. Aka you are wasting $ reading data you don't want.
But you can only have one cluster on a table, so if you make one query better you might make others worse. Also
Auto-clustering does its work in the background, think of defragmentation of your harddrive. It's the same, and thus seems it is "work" yes you pay for it.
future DML is not directly impacted by clustering if you are mean will a future insert be N times slow because you have clustering on. But given you are altering data, there are two ways the DML multi impact your clustering, if you insert data is random sort, (with respect to your clustering) that data will be needed to be sort. Also you do you high frequency inserts this can interfere with the background clustering operations. Also the fact you insert new data to new partitions, the larger set needs reclustering.
You can rewrite the table with an ORDER BY and do the clustering yourself.
Analysis of SYSTEM$CLUSTERING_INFORMATION data.
"average_overlaps" : 17151.4681 >
Average number of overlapping micro-partitions for each micro-partition in the table. A high number indicates the table is not well-clustered.
"average_depth" : 16142.2524 >
Average overlap depth of each micro-partition in the table. A high number indicates the table is not well-clustered.
The buckets 00000 through 32768 describe in how many micro-partitions (similar in concept to files) your cluster keys are split into.
"00000" : 0 >
Zero (0) constant micro-partitions out of 20955 total micro-partitions.
"32768" : 19017 > The 32768 means that 32768 micro-partitions contain a cluster key. 19017 micro-partitions contain cluster keys that are also in up to 19017 other micro-partitions, which is bad because we'd need to scan all of these micro-partitions completely to find one of those cluster keys.
https://docs.snowflake.com/en/sql-reference/functions/system_clustering_information.html
As mentioned earlier Table would be good candidate for clustering if clustering keys are used in your queries like in selective filters or join predicates.
https://docs.snowflake.com/en/user-guide/tables-clustering-keys.html#strategies-for-selecting-clustering-keys
2.Automatic clustering works in the background and there no downtime required. As far as billing is concerned there will be some additional charges.
3.Clustering does not impact future DML operation as such but if there is constant DML operation on that table then as soon as the number of new micro partitions reach a certain threshold Automatic Clustering will start its operation to keep the table well clustered.
As mentioned by Simeon you can opt for rewriting the table with an ORDER BY and do the clustering yourself.

Best database structure for building levels and upgrades

I'm building a game where the player has to build different types of buildings and can upgrade them. Some buildings may be upgradable to level 30, whereas some others to level 5 only.
I wonder what is the best database layout for that. I am using sqlite3 if that makes any difference, but the questions applies to other engines as well.
I have through of two options for my buildings table:
Option one: Make a building_group column to idenfity which buildings are similar:
id (Integer, Auto increment), building_name, building_group, level, points, cost
1, path, 1, 1, 100, 1000
2, road, 1, 2, 200, 2000
3, highway, 1, 3, 300, 3000
4, village, 2, 1, 1000, 10000
5, town, 2, 2, 2000, 20000
6, city, 2, 3, 3000, 30000
Option two: Have one entry per building and have all level information in the same row. This doesn't seem the best approach to me but I thought I would mention it anyways.
id (Integer, Auto increment), building_name_1, points_1, cost_1, building_name_2, points_2, cost_2, building_name_3, points_3, cost_3,...
1, path, 100, 1000,road, 200, 2000, highway, 300, 3000
2, village, 1000, 10000, town, 2000, 20000, city, 3000, 30000
I'm sure there are better ways to handle that and I would like to hear your suggestions.
As you noted, the second approach makes very little sense. In order to effectively query or update it, you'll have to construct your column names dynamically, which is a great source for hard to track bugs (not to mention potential vulnerability to SQL injections if you don't do it properly). Moreover, sooner or later, you're going to think of some super-duper special upgrade which has a level higher than the columns you've planned originally, and you'll have to change the definition of the table just to accommodate it, which makes no sense at all.
The first design seems like the text-book way of doing things, which will easily allow yo to create complex queries on the buildings and on what a player has or hasn't built. If you ever need to extract some common data to an entire building group you could create another building_groups table with that information, and make the building_group column in buildings a foreign key to its primary key.

SSAS- MDX Assign fact row to dimension member base on calculation

I am looking to calculate in the calc script something, so I can allocate a row from a fact table to a dimension member.
The business scenario is the following. I have a fact table that record customer credit and debit ( customer can do a lot of little loan) and a dimension Customer.I want to classify my customer base on his history of credit and debit on a given period.Classification of customer change over time.
Example
The rule is, if a customer balance (for a given period ) is over - 50 000, the classification is "large", if he have more than a record and have done a payement in the last 3 month he is a "P&P.If he doesn't own any money and have done a payement in the last 3 month its "regular".
My question is more about direction than a specific code,which way is the best to implement this kind of rule ?
Best Regards
Vincent Diallo-Nort
I'd create a fact table with a balance auto-updated status every day:
check the rolling balance yesterday plus today's records.
when the balance = 0, then remove a record.
Plus add a flow fact table with payments only.
Add measures:
LastChild aggregation for the first fact table.
Sum aggregation for the second fact table.
When it's done, you may apply a MDX calculation:
case
when [Measure].[Balance] > 50000
then "Large"
when [Measure].[Payments] + ([Date].[Calendar].CurrentMember.Lag(1),[Measure].[Payments]) + ([Date].[Calendar].CurrentMember.Lag(2),[Measure].[Payments]) > 0
then "P&P"
else "Regular"
end
In order to give you answer in detail you have to provide more information about your data structure.

Resources