Need Information on Snowflake optimizer - snowflake-cloud-data-platform

What kind of optimizer does Snowflake use, rule based or cost based. Could not get to any documentation, need explanation on how it works to write better queries.

I find "knowing the 'rules'" less helpful, than understanding what the system is doing as more helpful.
I have found describing it to new team members has massive table scans, that do map/reduce/merge joins.
You can make the tables scans faster by selecting the smallest set of columns needed to get the answer you need.
There is partition pruning so if you have data in a 'inserted/sorted' order of x 1-2,3-4,5-6 and your query has x = 5, then the first two partitions will not be read.
Next because it's all merge joins, equi joins are the fastest thing todo. [Edit:] This is trying to say, that at the order of million's of rows and up. Joining 1m rows to 1m rows based on complex join logic like a.v1 > b.v2 or a.v2 < b.v3 ... etc means you have to more or less make you trillion+ rows and just try and see. Where-as if you can join on exact values a.v1 = b.v2 and a.v2 = b.v2 now the data can be sorted with respect to those keys, and a merge join can be done, and your performance is very good (sort-merge join on Wikipedia).
This means sometimes reading from the same set of source tables many times in different CTE's and joining those can be the fastest way to process large volumes of data.
[Edit:] which in the context of the above statement often in small db SQL people do correlated sub-queries, because a) you can, so why not, and b) they can be fast on indexed databases. But in snowflake with no indexes, besides the optimizer doesn't support most correlated sub-queries, you generally should avoid them and read the data twice in two CTEs and join/left-join those via a equi-join to answer the question that is done, as the CTE's tasks are independent, thus parallelisable, and the merge-join is near-optimal. And the waste of calculating (lets pretend sub-totals) for data that is not in the main join body, is less that gains of parallelism. (this holds best for queries in the 30 seconds or longer range, as compared to speeding up sub 5 second sized queries). But with everything, have a base model, and try/experiment, and poke and the slow stuff, till you cannot restructure you data or query to make it faster.
As always look at the profile of the run query, and look for area there many rows are dropped, and think how you can restructure the logic to push these restrictions earlier in the pipeline.

Brief description could be found in the following document:
The Snowflake Elastic Data Warehouse by Snowflake Computing
3.3.1 Query Management and Optimization
(...)
Snowflake’s query optimizer follows a typical Cascades-
style approach [28], with top-down cost-based optimization.
All statistics used for optimization are automatically main-
tained on data load and updates. Since Snowflake does
not use indices (cf. Section 3.3.3), the plan search space
is smaller than in some other systems. The plan space is
further reduced by postponing many decisions until execu-
tion time, for example the type of data distribution for joins.
This design reduces the number of bad decisions made by
the optimizer, increasing robustness at the cost of a small
loss in peak performance. It also makes the system easier
to use (performance becomes more predictable), which is in
line with Snowflake’s overall focus on service experience.
Once the optimizer completes, the resulting execution plan
is distributed to all the worker nodes that are part of the
query. As the query executes, Cloud Services continuously
tracks the state of the query to collect performance counters
and detect node failures. All query information and statis-
tics are stored for audits and performance analysis. (...)

Query Optimization:
Snowflake supports query vectorization and does some cost-based optimization. But its first-time run of queries are typically seconds-to-minutes. Snowflake has added local disk “caching” and also a result cache to speed up subsequent queries for repetitive workloads like reporting and dashboards.
Optimized Storage:
Snowflake has a micro-partition file system that is more optimized than S3 and supports partitioning and sorting with cluster keys
Split the data into multiple small files to support optimal data loading in Snowflake.
IMPROVING QUERY PERFORMANCE
Consider implementing clustering keys for large tables.
Try to execute relatively homogeneous queries (size, complexity,
data sets, etc.) on the same warehouse.
IMPROVING LOAD PERFORMANCE
Use bulk loading to get the data into tables in Snowflake. Consider splitting
large data files so the load can be efficiently distributed across servers in a
cluster.
Delete from internal stages files that are no longer needed. You may notice an
improvement performance in addition to saving on costs.
Isolate load and transform jobs from queries to prevent resource contention.
Dedicate separate warehouses for loading and querying operations to optimize
performance for each.
Leverage the scalable compute layer to do the bulk of the data processing.
Consider using Snowpipe in micro-batching scenarios. Your query may benefit from
cached results from a previous execution.
Use separate warehouses for your queries and load tasks. This will facilitate
targeted provisioning of warehouses and avoid any resource contention between
dissimilar operations.
Use a separate data warehouse for large files.
The number and capacity of the servers determine the number of data files.
Segment Data
Snowflake caches data in the virtual data warehouse, but it's still essential to segment data. Consider these best practices for data query performance:
Group users with common queries in the same virtual data warehouse to optimize data retrieval and use.
The Snowflake Query Profile supports query analysis to help identify and address performance concerns.
Snowflake draws from the same virtual data warehouse to support complex data science operations, business intelligence queries, and ELT data integration.
Scale-Up
Snowflake allows for a scale-up in the virtual data warehouse to better handle large workloads. When using scale-up to improve performance, make note of the following:
Snowflake supports fast and easy adjustments to the warehouse-size to handle the workload.
It can also automatically suspend or resume the scale-up, with complete transparency for the user.
Snowflake's scale-up functionality supports the continually changing requirements for processing.
Scale-Out
Snowflake supports the deployment of same-size clusters to support concurrency. Keep these points in mind for how scale-out can help performance optimization:
As users execute queries, the virtual data warehouse automatically adds clusters up to a fixed limit.
It can scale-up in a more controlled way instead of deploying one or more clusters of larger machines like legacy data platforms.
Snowflake automatically adjusts based on user queries, with automatic clustering during peak and off hours as needed.

Related

What are the performance advantages of using database aggregate functions?

What are the performance advantages of using database aggregate functions as opposed to getting all the data and doing the computation on the server or storing the same data in multiple documents or rows?
What are the measurable pros and cons?
It depends on your use case, but in general performing heavy computation directly on the database is better (unless you have some very complicated operation which performing on the database doesn't seem beneficial in terms of performance/development time/ease of understanding).
From a performance P.O.V here are some pointers
A server is meant to handle multiple requests, listen to events, make api calls and a lot more, and it needs to be responsive at the same time. Clearly, a servers memory is a crucial resource, and so is its processing time. If you bring a large chunk of data into the server, it will reside in-memory and if the computation takes time, it will hamper the response time(synchronous/asynchronous computation). This will also affect the garbage collection of the server.
Assuming you choose to perform computation on the server, a server is not optimized to perform such operations. Ex: Lets say you want to find the maximum of a million records, you receive a request, bring data in memory and perform the max operation. As the number of requests increase, you will keep bringing a million records in-memory(see where this goes?). Once you have the data, chances are your data needs to be scanned linearly to find the maximum(not bad but not good as well). The same thing when performed on a database, it wont create multiple copies of the data. A database stores data in an optimized way(indexes) and may also store some statistical data which may make aggregation operations cheap.
A database can be scaled to a larger magnitude(horizontally/vertically) as compared to servers. Computation on a server would result in a bottleneck much quicker than it would if done on a database. In many cases, the database is a separate machine from the server, so you are utilizing the resources much better offloading computation to the database. You are also avoiding single point of failure by doing so.
TLDR
A database is meant and designed to handle heavy computation whereas a server is meant and designed to handle requests.

Parallel inserts/delete on single snowflake table

I have a scenario where i have to run parallel inserts/deletes on a snowflake table.
For example: the table contains data related to different countries. And each insert pipe or thread will be contain data for only a specific country.
Similarly when i am running parallel deletes then each delete thread will be deleting data for only a specific country.
I was looking to partition the data in the snowflake table based on country which might have helped in avoiding any locks. however, it seems that option is not there in snowflake.
Can you suggest how can i achieve parallel inserts/deletes and avoid and contention or locks.
Note: I am using matillion to run different ELT jobs in parallel to do the inserts.
In snowflake, there is no option of partition. It's a true SaaS product with almost zero administration.
Coming to your question of the parallel insert and delete, Incase there is a huge delay, either you can scale up or make auto-scale. the Snowflake Data Platform implements a powerful and unique form of partitioning, called micro-partitioning, that delivers all the advantages of static partitioning without the known limitations, as well as providing additional significant benefits.
You can also go for table clustering
It is highly unlikely (almost impossible) that you will get locking on your data tables. If you are getting locking on the metadata tables then follow the solution suggested in #sergiu's link above.
If you have performance issues (but not table/record locking) then possible solutions include:
Larger warehouse (unlikely to improve performance much but worth trying if all else fails)
Auto-scale your warehouse (as suggested in the previous answer) so that more warehouses are running in parallel when there is a high workload
Run multiple warehouses: 1 per country or 1 per group of countries. No real benefit over auto-scaling unless there is significantly more data for some countries compared to others and there is a benefit to sizing the warehouses to match the data size

How to optimize query analysis and storage costs in Bigquery export data streaming inserts that occur at varied intervals during the day?

I am exploring options to optimize query analysis and cost to store data in a BigQuery table. If we are able to reuse the query that is made on a larger data vs reuse/extract data from the last queried result to only save cost for running the entire query again.
Limitations
Cannot use cached results since the data is streaming inserts and every rewrite will invalidate the cached results.
Even if there is a programmatic solution that can be built, trying to validate if data inconsistencies happens or managing it whenever a data is out of sync.
Thanks in advance!
To analyze BigQuery SQL cost usage you can list all BigQuery jobs (BigQuery API) and analyze bytes/slots usage and the execution time. Besides caching, you can analyze queries to see if there is any candidate for Partitioning and Clustering that could reduce significant cost and execution time. Reading other BigQuery SO posts I am under impression that Materialized Views are around the corner, that would be another great performance and cost optimization.
To optimize cost itself you can compare on-demand or slot reservation pricing model.
To optimize streaming insert cost, as long you can accept 2 min delay (as opposed sec delay with streaming) you can take into account event-driven serverless data ingestion like BqTail
When it comes to caching you may also explore eager caching options which creates cache for most commonly used SQL every time underlying data changes, but in that case you have to control all data ingestion to recreate cache. (*possible with BqTail API post load task)

SQL Server database splitting by purpose

Databases usually are a storage for most applications. Our company also makes a lot of calculations and data manipulations with that data on daily basis.
As soon as we get more and more data, data generation became an issue cause takes too long. And I think it can make sense to separate database to at least two :
for storing data with focus on read/write performance;
for calculations with focus on data aggregation performance.
Does anybody has similar experience and can tell if this idea is good and what will be design differences for mentioned two points?
Maybe it is worth to look for noSQL solution for calculating data e.g. in-memory databases?
it can make sense to separate database to at least two
If the databases are in different Disks (with different spindles ), it may help otherwise you get no gain because disk IO is shared between these databases.
For best practice,read Storage Top 10 Best Practices
Maybe it is worth to look for noSQL solution for calculating data e.g. in-memory databases?
No need to go to noSQL solution, you can use in-memory tables
In-Memory OLTP can significantly improve the performance of transaction processing, data load and transient data scenarios.
For more details, In-Memory OLTP (In-Memory Optimization)
Other Strategies
1) Tune tempdb
Tempdb is common for all databases and heavily used in calculations.
A more pragmatic approach, is to have a 1:1 mapping between files and logical CPUs(cores) up to eight.
for more details: SQL Server TempDB Usage, Performance, and Tuning Tips
2) Evaluate life expectancy (PLE) Counter and take actions for enhancement
To evaluate data cache, run the following query
SELECT [object_name],
[counter_name],
[cntr_value] FROM sys.dm_os_performance_counters
WHERE [object_name] LIKE '%Manager%'
AND [counter_name] = 'Page life expectancy'
The recommended value of the PLE counter (in seconds ) is greater than:
total_memory_dedicated_for_sql_server / 4 * 300
Page Life Expectancy is the number of seconds a page will stay in the buffer pool without references. In simple words, if your page stays longer in the buffer pool (area of the memory cache) your PLE is higher, leading to higher performance as every time request comes there are chances it may find its data in the cache itself instead of going to the hard drive to read the data.
If PLE is't enough Increase memory and tune indexes and statistics.
3) Use SSD disks
With the cost of solid state disks (SSDs) going down, use the SSDs as a second tier of cache.
4) Use RAID 5 for the databases; and RAID 10 for the transaction logs and tempdb.
In general, the SQL optimizer game is moving data from disk (low speed) to cache (memory- high speed).
Increase memory and enhance diskIo speed, you gain high performance

Database and large Timeseries - Downsampling - OpenTSDB InfluxDB Google DataFlow

I have a project where we sample "large" amount of data on per-second basis. Some operation are performed as filtering and so on and it needs then to be accessed as second, minute, hour or day interval.
We currently do this process with an SQL based system and a software that update different tables (daily average, hourly averages, etc...).
We are currently looking if other solution could fit our needs and I went across several solutions, as open tsdb, google cloud dataflow and influxdb.
All seem to address timeseries needs, but it gets difficult to get information about the internals. opentsdb do offer downsampling but it is not clearly specified how.
The need is since we can query vast amount of data, for instance a year, if the DB downsample at the query and is not pre-computed, it may take a very long time.
As well, downsampling needs to be "updated" when ever "delayed" datapoint are added.
On top of that, upon data arrival we perform some processing (outliner filter, calibration) and those operation should not be written on the disk, several solution can be used like a Ram based DB but perhaps some more elegant solution that would work together with the previous specification exists.
I believe this application is not something "extravagant" and that it must exist some tools to perform this, I'm thinking of stock tickers, monitoring and so forth.
Perhaps you may have some good suggestions into which technologies / DB I should look on.
Thanks.
You can accomplish such use cases pretty easily with Google Cloud Dataflow. Data preprocessing and optimizing queries is one of major scenarios for Cloud Dataflow.
We don't provide a "downsample" primitive built-in, but you can write such data transformation easily. If you are simply looking at dropping unnecessary data, you can just use a ParDo. For really simple cases, Filter.byPredicate primitive can be even simpler.
Alternatively, if you are looking at merging many data points into one, a common pattern is to window your PCollection to subdivide it according to the timestamps. Then, you can use a Combine to merge elements per window.
Additional processing that you mention can easily be tacked along to the same data processing pipeline.
In terms of comparison, Cloud Dataflow is not really comparable to databases. Databases are primarily storage solutions with processing capabilities. Cloud Dataflow is primarily a data processing solution, which connects to other products for its storage needs. You should expect your Cloud Dataflow-based solution to be much more scalable and flexible, but that also comes with higher overall cost.
Dataflow is for inline processing as the data comes in. If you are only interested in summary and calculations, dataflow is your best bet.
If you want to later take that data and access it via time (time-series) for things such as graphs, then InfluxDB is a good solution though it has a limitation on how much data it can contain.
If you're ok with 2-25 second delay on large data sets, then you can just use BigQuery along with Dataflow. Dataflow will receive, summarize, and process your numbers. Then you submit the result into BigQuery. HINT, divide your tables by DAYS to reduce costs and make re-calculations much easier.
We process 187 GB of data each night. That equals 478,439,634 individual data points (each with about 15 metrics and an average of 43,000 rows per device) for about 11,512 devices.
Secrets to BigQuery:
LIMIT your column selection. Don't ever do a select * if you can help it.
;)

Resources