While doing some query optimizing work, one of our queries has a very high "synchronization" time in a table scan step (up to 98% of the total query execution time). The query joins 3 tables, one fact table with 100B+ rows and two small dimension tables. The virtual warehouse is a S size.
The Snowflake docs are very limited on explaining what "synchronization" is: " various synchronization activities between participating processes." Analyzing the Query History — Snowflake Documentation
Does anyone know what this synchronization component is and how to improve it?
Related
Other than coder preference, is there any reason to use a series of WITH(SELECT . . . ) statements/clauses in Snowflake rather than create several temporary tables. Are there performance or other issues that would lead one to prefer one method over the other?
I have some real world experience with this. We followed a hybrid approach in our ETLs
CTEs keeps results in memory, hence saves I/O cost while processing huge volume of data. Based on our experience processing an ETL involving 10 billion rows, CTE took 2 hours while table approach took 4.5 hours.But we should carefully choose our weapon, CTEs will not perform well in all scenarios. In the below scenarios, you must do some testing before using CTE
When your ETL query has more than 7-8 steps
Query is a combination of large and small tables
Volume of data is low and there is an opportunity to reuse the transformation logic
Please also remember CTEs cannot take advantage of meta data and hence row count estimations for joins and scans will be inaccurate and the optimizer will not be able to optimize the query path.
Have been played around with the Snowflake Query Profile Interface but missing information about the parallelism in query execution. Using a Large or XLarge Warehouse it is still only using two servers to execute the query. Having an XLarge Warehouse a big sort could be divided in 16 parallel execution threads to fully exploit my Warehouse and credits. Or?
Example: Having a Medium Warehouse as:
Medium Warehouse => 4 servers
Executing the following query:
select
sum(o_totalprice) "order total",
count(*) "number of orders",
c.c_name "customer"
from
orders o inner join customer c on c.c_custkey = o.o_custkey
where
c.c_nationkey in (2,7,22)
group by
c.c_name
Gives the following Query Plan:
Query Plan
In the execution details I cannot see anything about the participating servers:
enter image description here
Best Regards
Jan Isaksson
In an ideal situation snowflake will try to split your query and let every core of the warehouse to process a piece of the query. For example, if you have a 2XL warehouse, you have 32x8 = 256 cores(each node in a warehouse has 8 cores). So, if a query is submitted, in an ideal situation snowflake will try to divide it into 256 parts and have each core process a part.
In reality, it may not be possible to parallize to that extent and that is because either the query itself cannot be broken down like that(for example, if you are trying to calculate let's say a median) or if the data itself is preventing it to parallelize(for example if you are trying to run a window function on a column which is skewed) it to that extent.
Hence, it is not always true that if you move to a bigger warehouse your query performance will improve linearly.
I tested your query starting with smallest compute size and the up. The linear scaling (more compute resource results in improved performance) stops around medium size, at which point there is no added benefit of performance improvement. This indicates your query is not big enough to take advantage of more compute resource and size s is good enough, especially considering cost optimization.
I am modelling for the Database CrateDB.
I have an avg. of 400 customers and the produce different amounts of time-series data every day. (Between 5K and 500K; avg. ~15K)
Later I should be able to query per customer_year_month and per customer_year_calendar_week.
That means that I will only query for the intervals:
week
and month
Now I'am asking myself how to partition this table?
I would partion per customer and year.
Does this make sense?
Or would it be better to partion by customer, year and month?
so the question of partitioning a table is quite complex and should consider a lot of things. Among others:
What queries should be run?
The way the data is inserted
Available hardware resources
Cluster size
Essentially, each partition also creates overhead by multiplying the shard count (a partition can be considered a "sub-table" based on a column value), which - if chosen improperly - can hinder performance a lot.
So in your case 15k inserts a day is not too much, however the distribution of inserts might cause problems, a customer's partition that grows with 500k inserts a day will run into performance problems earlier than the 5k person. As a consequence I would use weekly partitioning only.
create table "customer-logging" (
customer_id long,
log string,
ts timestamp,
week as date_trunc('week', ts)
) partitioned by (week) into 8 shards
Please only use 8 shards if you have an appropriate amount of CPU cores ;)
Docs: date_trunc(), partitioned tables
Ideally you try out a few different combinations and find what works best for you. Insights into shard sizes and locations are provided by our sys tables, so you can see if there's a particularly fat shard that overloads a node ;)
Cheers, Claus
Setup
Cost of Threshold for Parallelism : 5
Max Degree of Parallelism : 4
Number of Processors : 8
SQL Server 2008 10.0.2.2757
I have a query with many joins, many records.
The design is a star. ( Central table with fks to the reference tables )
The central table is partitioned on the relevant date column.
The partition schema is split by days
The data is very well split across the partition schema - as judged by comparing the sizes of the files in the filegroups assigned to the partition schema
Queries involved have the predicate set over the partitioned column. such as ( cs.dte >= #min_date and cs.dte < #max_date )
The values of the date parameters are a day apart # midnight so, 2010-02-01, 2010-02-02
The estimated query plan shows no parallelism
a) This question is in regards to Sql Server 2008 Database Engine. When a query in the OLTP engine is running, I would like to see / have the sort of insight one gets when profiling an SSAS Query using Progress End event - where one sees something like "Done reading PartititionXYZ".
b) if the estimated query plan or the actual query plan shows no parallel processing does that mean that all partitions will be / were checked / read? * What I was trying to say here was - just b/c I don't see parallelism in a query plan, that doesn't guarantee the query isn't hitting multiple partitions - right? Or - is there a solid relationship between parallelism and # partitions accessed?
c) suggestions? Is there more information that I need to provide?
d) how can I tell if a query is processing in parallel w/o looking # the actual query plan? * I'm really only interested in this if it is helpful in pinning down what partitions are being used.
Added Nov 10
Try this:
Create querys that should hit 1, 3, and all your partitions
Open an SSMS query window, and run SET SHOWPLAN_XML ON
Run each query one by one in that window
Each run will kick out a chunk of XML
Compare these XML results (I use a text diff tool, “CompareIt”, but any similar tool would do)
You should see that the execution plans are significantly different. In my “3” and “All” querys, there’s a chunk of text tagged as “ConstantScan” that has an entry for (respectively) 3 and All partitions in the table, and that section is not present for the “1 partition” query. I use this to infer that yes indeed, SQL doing what it says it will do, to wit: only read as much of the table as it believes it needs to in order to resovle the query.
Got a pretty good answer here: http://www.sqlservercentral.com/Forums/Topic1064946-391-1.aspx#bm1065048
a) I am not aware of any way to determine how a query has progressed while the query is still running. Maybe something finicky with the latching and locking system views, but I doubt it. (I am, alas, not familiar enough with SSAS to draw parallels between the two.)
b) SQL will probably use parallelism when working with multiple partitions within a single table, in which case you will see parallel processing "tokens" in your query plan. However, if for whatever reason parallelism is not invoked yet multiple partitions must be read, they will be read without the use of parallelism.
d) Another thing that perhaps cannot be done. Under very controlled cirsumstances, you could use System Monitor (Perfmon) to track CPU usage or perhaps disk reads during the execution of they query. This won't help if the server is performing other work, or the data is resident in memory (the buffer cache), and so may be of limited use.
c) What is it you are actually trying to figure out? Which partitions (if any) are being accessed by users over a period of time? Is SQL generating a "smart" query plan? Without details of the data, structure, and query, it's hard to come up with advice.
Im trying to squeeze some extra performance from searching through a table with many rows.
My current reasoning is that if I can throw away some of the seldom used member from the searched table thereby reducing rowsize the amount of pagesplits and hence IO should drop giving a benefit when data start to spill from memory.
Any good resource detailing such effects?
Any experiences?
Thanks.
Tuning the size of a row is only a major issue if the RDBMS is performing a full table scan of the row, if your query can select the rows using only indexes then the row size is less important (unless you are returning a very large number of rows where the IO of returning the actual result is significant).
If you are doing a full table scan or partial scans of large numbers of rows because you have predicates that are not using indexes then rowsize can be a major factor. One example I remember, On a table of the order of 100,000,000 rows splitting the largish 'data' columns into a different table from the columns used for querying resulted in an order of magnitude performance improvement on some queries.
I would only expect this to be a major factor in a relatively small number of situations.
I don't now what else you tried to increase performance, this seems like grasping at straws to me. That doesn't mean that it isn't a valid approach. From my experience the benefit can be significant. It's just that it's usually dwarfed by other kinds of optimization.
However, what you are looking for are iostatistics. There are several methods to gather them. A quite good introduction can be found ->here.
The sql server query plan optimizer is a very complex algorithm and decision what index to use or what type of scan depends on many factors like query output columns, indexes available, statistics available, statistic distribution of you data values in the columns, row count, and row size.
So the only valid answer to your question is: It depends :)
Give some more information like what kind of optimization you have already done, what does the query plan looks like, etc.
Of cause, when sql server decides to do a table scna (clustered index scan if available), you can reduce io-performance by downsize row size. But in that case you would increase performance dramatically by creating a adequate index (which is a defacto a separate table with smaller row size).
If the application is transactional then look at the indexes in use on the table. Table partitioning is unlikely to be much help in this situation.
If you have something like a data warehouse and are doing aggregate queries over a lot of data then you might get some mileage from partitioning.
If you are doing a join between two large tables that are not in a 1:M relationship the query optimiser may have to resolve the predicates on each table separately and then combine relatively large intermediate result sets or run a slow operator like nested loops matching one side of the join. In this case you may get a benefit from a trigger-maintained denormalised table to do the searches. I've seen good results obtained from denormalised search tables for complex screens on a couple of large applications.
If you're interested in minimizing IO in reading data you need to check if indexes are covering the query or not. To minimize IO you should select column that are included in the index or indexes that cover all columns used in the query, this way the optimizer will read data from indexes and will never read data from actual table rows.
If you're looking into this kind of details maybe you should consider upgrading HW, changing controllers or adding more disk to have more disk spindle available for the query processor and so allowing SQL to read more data at the same time
SQL Server disk I/O is frequently the cause of bottlenecks in most systems. The I/O subsystem includes disks, disk controller cards, and the system bus. If disk I/O is consistently high, consider:
Move some database files to an additional disk or server.
Use a faster disk drive or a redundant array of inexpensive disks (RAID) device.
Add additional disks to a RAID array, if one already is being used.
Tune your application or database to reduce disk access operations.
Consider index coverage, better indexes, and/or normalization.
Microsoft SQL Server uses Microsoft Windows I/O calls to perform disk reads and writes. SQL Server manages when and how disk I/O is performed, but the Windows operating system performs the underlying I/O operations. Applications and systems that are I/O-bound may keep the disk constantly active.
Different disk controllers and drivers use different amounts of CPU time to perform disk I/O. Efficient controllers and drivers use less time, leaving more processing time available for user applications and increasing overall throughput.
First thing I would do is ensure that your indexes have been rebuilt; if you are dealing with huge amount of data and an index rebuild is not possible (if SQL server 2005 onwards you can perform online rebuilds without locking everyone out), then ensure that your statistics are up to date (more on this later).
If your database contains representative data, then you can perform a simple measurement of the number of reads (logical and physical) that your query is using by doing the following:
SET STATISTICS IO ON
GO
-- Execute your query here
SET STATISTICS IO OFF
GO
On a well setup database server, there should be little or no physical reads (high physical reads often indicates that your server needs more RAM). How many logical reads are you doing? If this number is high, then you will need to look at creating indexes. The next step is to run the query and turn on the estimated execution plan, then rerun (clearing the cache first) displaying the actual execution plan. If these differ, then your statistics are out of date.
I think you're going to be farther ahead using standard optimization techniques first -- check your execution plan, profiler trace, etc. and see whether you need to adjust your indexes, create statistics etc. -- before looking at the physical structure of your table.