I am looking at pattern matching on streaming Stock data using Snowflake MATCH_RECOGNIZE. I need to limit the window of the data at any point to 30 days worth of data. In Microsoft Azure this is achieved using LIMIT DURATION clause in the MATCH_RECOGNIZE. How do I do that in Snowflake?
I am thinking of using a CTE to first select 30 days worth of data and then write a query with the MATCH_RECOGNIZE.
But I am not sure if CTE will be as performant as LIMIT DURATION. Are there more optimized way of achieving this?
Related
My BI developer wrote a query that took 14 hours to run and I'm trying to help him out. On a high level, it's a query that explores financial transaction of the past 15 years and break them down for each quarter.
I'm sharing the answers I already gave him here but I would like to know if you have any suggestion where we can explore and research further to improve the performance, answer such as: "perhaps you may want to look at snapshot.."
His query consists of:
Includes the use of multiple views, meaning select from one view to produce another view etc..
Some views join three tables, each has around 100 - 200 million rows.
Certain view use sub select query.
Here are my recommendations so far:
Do not use nested views to produce the query, instead of using views create new tables for each of them because the data is not dynamic (financial transaction data) and won't change. Nested views from my experience aren't good for performance.
Do not use sub query, use JOIN whenever possible.
I make sure he creates non cluster index wherever appropriate.
Do not use TEMPT table when there is this much data.
Try and use WITH(NO LOCK) on all tables used in JOIN
Find an common query and convert it into a stored procedure
When joining those three large tables (100 - 200 million rows), try to limit the data amount at the JOIN instead of using WHERE. Ex, instead of select * from tableA JOIN tableB WHERE... , USE SELECT * FROM TableA JOIN tableB ON .... AND tableA.date BETWEEN range. This will give less data when joining with other table later in the query.
The problem is the data he has to work with are too huge, I wonder the query performance can only do so much because at the end of the day, you still have to process all those data in your query. Perhaps the next step is to think of how one can prepare these data and store them in smaller table first such as CostQ1_2010, CostQ2_2020 ect... and then write your query based on all those tables.
You have given us very little information to go on. Tolstoy wrote, "All happy families are alike; each unhappy family is unhappy in its own way.” That's also true of SQL queries, especially big BI queries.
I'll risk some general answers.
With tables of the size you mention, your query surely contains date-range WHERE filters like transaction_date >= something AND transaction_date < anotherthing. In general, a useful report covers a year out of a decade of transactions. So make sure you have the right indexes to do index range scans where possible. SSMS, if you choose the Show Actual Execution Plan feature, sometimes suggests indexes.
Learn to read execution plans.
Read about covering indexes. They can sometimes make a big difference.
Use the statement SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED before starting this kind of long-running historical BI query. You'll get less interference between the BI query and other activity on the database.
It may make sense to preload some denormalized tables from the views used in the BI query.
Can some explain when do we use search optimization and cluster key for table or do we use both ?
I see that we are losing credits if we enable both of them?
Thanks,
Sye
The Search Optimization is used when you need to access small number of rows (point lookup queries), like when you access an OLTP database.
Cluster Key is for partitioning your data. It's generally good for any kind of workloads unless you need to read whole table.
If you don't need to access a specific row in your large table, you don't need Search optimization service.
If your table is not large, or if you ingest "ordered" data to your table, you don't need auto-clustering (cluster keys).
When you load a table into snowflake, it creates 'micropartitions' based on the order of the rows at load time. When a SQL statement is run, the where clause is used to prune the search space of which partitions need to be scanned.
A Cluster Key in Snowflake simply reorders the data by the cluster key, so that it is co-located within the same micropartitions. This can result in massive performance improvements if your queries frequently use the the cluster key in the where clause to filter the results.
Search optimization is for finding 1 or a small number of records based on using '=' in the where clause.
So if you have a table with Product_ID, Transaction_Date, Amount.
Queries using 'Where Year(Transaction Date) >= 2017' would benefit from a cluster key on Transaction Date.
Queries using 'Where Product_ID = 111222333' would benefit from search optimization.
In either case, these are only needed of your table is large (think billions of rows). Otherwise, the native Snowflake micropartition approach will do a good job at optimization.
Please don't call Cluster Key "partitioning". Although the effect is similar, they are two distinct operations with different meanings. I will be publishing an article on partitioning and pruning shortly.
Have been played around with the Snowflake Query Profile Interface but missing information about the parallelism in query execution. Using a Large or XLarge Warehouse it is still only using two servers to execute the query. Having an XLarge Warehouse a big sort could be divided in 16 parallel execution threads to fully exploit my Warehouse and credits. Or?
Example: Having a Medium Warehouse as:
Medium Warehouse => 4 servers
Executing the following query:
select
sum(o_totalprice) "order total",
count(*) "number of orders",
c.c_name "customer"
from
orders o inner join customer c on c.c_custkey = o.o_custkey
where
c.c_nationkey in (2,7,22)
group by
c.c_name
Gives the following Query Plan:
Query Plan
In the execution details I cannot see anything about the participating servers:
enter image description here
Best Regards
Jan Isaksson
In an ideal situation snowflake will try to split your query and let every core of the warehouse to process a piece of the query. For example, if you have a 2XL warehouse, you have 32x8 = 256 cores(each node in a warehouse has 8 cores). So, if a query is submitted, in an ideal situation snowflake will try to divide it into 256 parts and have each core process a part.
In reality, it may not be possible to parallize to that extent and that is because either the query itself cannot be broken down like that(for example, if you are trying to calculate let's say a median) or if the data itself is preventing it to parallelize(for example if you are trying to run a window function on a column which is skewed) it to that extent.
Hence, it is not always true that if you move to a bigger warehouse your query performance will improve linearly.
I tested your query starting with smallest compute size and the up. The linear scaling (more compute resource results in improved performance) stops around medium size, at which point there is no added benefit of performance improvement. This indicates your query is not big enough to take advantage of more compute resource and size s is good enough, especially considering cost optimization.
I am using Salesforce's parameterized search API - https://developer.salesforce.com/docs/atlas.en-us.api_rest.meta/api_rest/resources_search_parameterized.htm - to search my SF instance. However, it sometimes runs quite slow and I want to just the counts to begin with. I see there's a record count API - https://developer.salesforce.com/docs/atlas.en-us.api_rest.meta/api_rest/resources_record_count.htm - but it doesn't accept search terms.
Is there a way to combine these both ? Should I just use the query and use a SOSL query that would just return the counts ? Any pointers on what that SOSL query would look like ?
SOSL is not a good choice for counting records matching specific criteria. A SOSL result set maxes out at 2,000 records. Additionally, SOSL results have a latency of up to approximately 15 minutes while indexes are updated, and hence may not be fully up to date at any given time.
Instead, use the Query REST API resource to execute a SOQL query using the filters you're interested in on a single object at a time, using the COUNT() aggregate function in your SELECT clause.
Bear in mind that complex criteria and large data volumes, especially in combination, may cause even a COUNT() query to time out or execute slowly. The fix is situation-specific but likely to involve careful work tuning your query to use indexed fields and efficient comparisons.
Setup
Cost of Threshold for Parallelism : 5
Max Degree of Parallelism : 4
Number of Processors : 8
SQL Server 2008 10.0.2.2757
I have a query with many joins, many records.
The design is a star. ( Central table with fks to the reference tables )
The central table is partitioned on the relevant date column.
The partition schema is split by days
The data is very well split across the partition schema - as judged by comparing the sizes of the files in the filegroups assigned to the partition schema
Queries involved have the predicate set over the partitioned column. such as ( cs.dte >= #min_date and cs.dte < #max_date )
The values of the date parameters are a day apart # midnight so, 2010-02-01, 2010-02-02
The estimated query plan shows no parallelism
a) This question is in regards to Sql Server 2008 Database Engine. When a query in the OLTP engine is running, I would like to see / have the sort of insight one gets when profiling an SSAS Query using Progress End event - where one sees something like "Done reading PartititionXYZ".
b) if the estimated query plan or the actual query plan shows no parallel processing does that mean that all partitions will be / were checked / read? * What I was trying to say here was - just b/c I don't see parallelism in a query plan, that doesn't guarantee the query isn't hitting multiple partitions - right? Or - is there a solid relationship between parallelism and # partitions accessed?
c) suggestions? Is there more information that I need to provide?
d) how can I tell if a query is processing in parallel w/o looking # the actual query plan? * I'm really only interested in this if it is helpful in pinning down what partitions are being used.
Added Nov 10
Try this:
Create querys that should hit 1, 3, and all your partitions
Open an SSMS query window, and run SET SHOWPLAN_XML ON
Run each query one by one in that window
Each run will kick out a chunk of XML
Compare these XML results (I use a text diff tool, “CompareIt”, but any similar tool would do)
You should see that the execution plans are significantly different. In my “3” and “All” querys, there’s a chunk of text tagged as “ConstantScan” that has an entry for (respectively) 3 and All partitions in the table, and that section is not present for the “1 partition” query. I use this to infer that yes indeed, SQL doing what it says it will do, to wit: only read as much of the table as it believes it needs to in order to resovle the query.
Got a pretty good answer here: http://www.sqlservercentral.com/Forums/Topic1064946-391-1.aspx#bm1065048
a) I am not aware of any way to determine how a query has progressed while the query is still running. Maybe something finicky with the latching and locking system views, but I doubt it. (I am, alas, not familiar enough with SSAS to draw parallels between the two.)
b) SQL will probably use parallelism when working with multiple partitions within a single table, in which case you will see parallel processing "tokens" in your query plan. However, if for whatever reason parallelism is not invoked yet multiple partitions must be read, they will be read without the use of parallelism.
d) Another thing that perhaps cannot be done. Under very controlled cirsumstances, you could use System Monitor (Perfmon) to track CPU usage or perhaps disk reads during the execution of they query. This won't help if the server is performing other work, or the data is resident in memory (the buffer cache), and so may be of limited use.
c) What is it you are actually trying to figure out? Which partitions (if any) are being accessed by users over a period of time? Is SQL generating a "smart" query plan? Without details of the data, structure, and query, it's hard to come up with advice.