Resize a warehouse from x-small to medium in snowflake is for what purpose.Kindly clarify.
To accommodate more queries or to accommodate more user or to optimized workload or to optimized complex workload.
There are detailed explanation on following link.
https://docs.snowflake.com/en/user-guide/warehouses-tasks.html#resizing-a-warehouse
If you want to increase the performance of a query/complez workload, you would change the size of the warehouse.
If you want to increase performance of number of users (how many concurrent users snowflake can handle) then you would increase number of nodes or clusters.
Eg.
alter warehouse my_wh set warehouse_size=XLARGE ; --Improves performance of query/workload
alter warehouse my_wh set MAX_CLUSTER_COUNT = 16
MIN_CLUSTER_COUNT = 2 -- Improves concurrent user and the performance of their queries
Related
While doing some query optimizing work, one of our queries has a very high "synchronization" time in a table scan step (up to 98% of the total query execution time). The query joins 3 tables, one fact table with 100B+ rows and two small dimension tables. The virtual warehouse is a S size.
The Snowflake docs are very limited on explaining what "synchronization" is: " various synchronization activities between participating processes." Analyzing the Query History — Snowflake Documentation
Does anyone know what this synchronization component is and how to improve it?
I am very new to Snowflake and while working with snowflake I had conflict between the below 2 options.
Single warehouse with size X-Large (16 credits / hour)
Multi-cluster (with max clusters=2 & min clusters=2) with size Large (8 credits / hour)
Considering the above 2 options
Is there any advantage I can get by choosing 2nd option in terms of performance?
Note: I know the advantages of multi-cluster over a single warehouse. Please share your answer for this specific scenario (when min = max).
So the things that happen in running a query are.
belong I am going to just use single to mean the single instance and 'multi` to mean the multi instance cluster, of which when we run a query it is only ever on one instance.
Reading\Writing IO from your storage layer:
Here a single has twice the IO over the multi thus if your query is IO saturated the single is the better choice.
Parallel steps:
So if you have a GROUP BY over a high cardinality columns, both the single and multi should be equally good. If you have a low cardinality but billions of rows, the smaller instance might give better results as those complex steps cannot be broken over the larger cluster size of the single instance. But this is most likely lost in the wash if you have many concurrent queries.
Many queries / Noisy neighbour:
If you have hundreds of queries hitting in waves the larger single instance is worse at starting those queries, as it just has less concurrent tasks at once, and a single very large query which can flush caches, or just dominate cluster, means you stop handling normal/small queries. Where-as having the mutli cluster allow if only one "super heavy" query comes in, you only stall half your normal queries.
Other thoughts
It also really depends on your load patterns, at my last job, we had auto-scaling cluster of SMALL instances used to used to answer our read queries of dashboards, reports, and we allowed it to run a little over provisioned, so things were snappy.
Where-as our data loading ran on second auto-scaling cluster of MEDIUM instances, and which we overloaded on purpose, as we were trying to load data the fastest/cheapest. And in non-peak times we programmatically reduced the auto-scalling MAX to almost starve the load. But would do some expensive reprocessing on a LARGE instance via those saved credits in "the middle of the night" and also our loading tasks had the ability to spin up exclusive LARGE+ size warehouses to do one off rebuilds, as this was all IO bound work, and thus the smaller the window of "outage" the better the system was, and the IO scale linear, so the total cost was the same.
Which is all to say, "what is best" really depends on what you are doing, your budget, and the trade offs you are prepared for. The golden thing about snowflake it is not like a classic DB where you have to pick the size and get it right, pick one, and watch it, if it's struggling change it on the fly. We did this a number of times when a release of our code or snowflake changed the performance of some critical SQL, we would jump in, and double or triple the instance count, or size, to get past the situation, while trying to fix or work around SF issues, or awaiting SF to roll a release back. for a couple hours generally spending more credits is not budget braking. This flexibility also means you can just experiment, "what happens if we trying 4x smaller instance.." "oh nothing... look we just saved heaps of money"..
If you have min=max=2 then you permanently have 2 warehouses running (as long as they are not suspended). If you configure your multi-cluster warehouse like this then you lose a lot of the advantages but for your specific use case it might make sense, I suppose
Based on your comment, here is my answer:
In both scenarios, you will have the same resources to process your queries. The important difference would be about running single heavy queries. As you may know, a single query can not spawn to multiple clusters (yet), so when you run a query in your multi-cluster warehouse, it will be processed on one of the Large warehouses (and use max 8 nodes).
If you run the same query on your single XL warehouse, it can be executed by (max) 16 nodes. So if you will run heavy queries which requires more memory and CPU, using a single XL warehouse would be better for you.
About concurrency, there is a parameter named "MAX_CONCURRENCY_LEVEL". Its default value is 8, and it limits maximum number of concurrent executions per warehouse. If you do not change it, your single XL warehouse will execute a maximum of 8 queries concurrently, while your multi-cluster warehouse can execute 16 queries concurrently.
https://docs.snowflake.com/en/sql-reference/parameters.html#max-concurrency-level
You may increase this parameter, and provide same concurrency on both single XL and multi-cluster L warehouse. But in this case, you should be careful when you runn heavy and light queries together. Because one query may use most of the resources of the warehouse, and your light queries may have less resources and take a longer time. So I would recommend using a multi-cluster warehouse, if you will have "relatively" light/concurrent queries.
Have been played around with the Snowflake Query Profile Interface but missing information about the parallelism in query execution. Using a Large or XLarge Warehouse it is still only using two servers to execute the query. Having an XLarge Warehouse a big sort could be divided in 16 parallel execution threads to fully exploit my Warehouse and credits. Or?
Example: Having a Medium Warehouse as:
Medium Warehouse => 4 servers
Executing the following query:
select
sum(o_totalprice) "order total",
count(*) "number of orders",
c.c_name "customer"
from
orders o inner join customer c on c.c_custkey = o.o_custkey
where
c.c_nationkey in (2,7,22)
group by
c.c_name
Gives the following Query Plan:
Query Plan
In the execution details I cannot see anything about the participating servers:
enter image description here
Best Regards
Jan Isaksson
In an ideal situation snowflake will try to split your query and let every core of the warehouse to process a piece of the query. For example, if you have a 2XL warehouse, you have 32x8 = 256 cores(each node in a warehouse has 8 cores). So, if a query is submitted, in an ideal situation snowflake will try to divide it into 256 parts and have each core process a part.
In reality, it may not be possible to parallize to that extent and that is because either the query itself cannot be broken down like that(for example, if you are trying to calculate let's say a median) or if the data itself is preventing it to parallelize(for example if you are trying to run a window function on a column which is skewed) it to that extent.
Hence, it is not always true that if you move to a bigger warehouse your query performance will improve linearly.
I tested your query starting with smallest compute size and the up. The linear scaling (more compute resource results in improved performance) stops around medium size, at which point there is no added benefit of performance improvement. This indicates your query is not big enough to take advantage of more compute resource and size s is good enough, especially considering cost optimization.
Currently, one of our table size is 500 Million rows (with 35 columns), and we are trying to determine, how big can our table be before it impacts our performance on running query on that table?
Performance cannot be measured like rows*columns.
It depends on the data types, joins, aggregations, etc. Your query performance can be drastically improved, for example, by creating int keys (adding columns) instead of char/varchar keys if used in joins.
An important addition to #vtuhtan 's answer : enable compression. Create tables with compression enabled for various data types - lzo, runlength etc. Proper compression type is also suggested by Redshif on tables with ANALYZE COMPRESSION SQL command. This reduces the read throughput and drastically increases your query performance. This will also make the table consume less storage space.
Doc on analyzing compression enabled tables
Loading tables with compression.
I have a growing table storing time series data, 500M entries now, and 200K new records every day. The total size is around 15GB for now.
My clients are querying the table via a PHP script mostly, and the size of the result set is around 10K records (not very large).
select * from T where timestamp > X and timestamp < Y and additionFilters
And I want this operation cheap.
Currently my table is hosting in Postgres 7, on a single 16G memory Box, and I would love to see some good suggestion for me to host this in low cost and also allow me to scale up for performance if needed.
The table serves:
1. Query: 90%
2. Insert: 9.9%
2. Update: 0.1% <-- very rare.
PostgreSQL 9.2 supports partitioning and partial indexes. If there are a few hot partitions, and you can put those partitions or their indexes on a solid state disk, you should be able to run rings around your current configuration.
There may or may not be a low cost, scalable option. It depends on what low cost and scalable mean to you.