Snowflake Performance issues when querying large tables - snowflake-cloud-data-platform

I am trying to query a table which has 1Tb of data clustered by Date and Company. A simple query is taking long time
Posting the query and query profile
SELECT
sl.customer_code,
qt_product_category_l3_sid,
qt_product_brand_sid,
sl.partner_code,
sl.transaction_id,
dollars_spent,
units,
user_pii_sid,
promo_flag,
media_flag
FROM
cdw_dwh.public.qi_sg_promo_media_sales_lines_fact sl
WHERE
transaction_date_id >= (to_char(current_date - (52*7) , 'yyyymmdd') )
AND sl.partner_code IN ('All Retailers')
AND qt_product_category_l3_sid IN (SELECT DISTINCT qt_product_category_l3_sid
FROM cdw_dwh.PUBLIC.qi_sg_prompt_category_major_brand
WHERE qt_product_category_l1_sid IN (246))
AND qt_product_brand_sid IN (SELECT qt_product_brand_sid
FROM cdw_dwh.PUBLIC.qi_sg_prompt_category_major_brand
WHERE qt_product_major_brand_sid IN (246903, 430138))
enter image description here

"simple query" I am not sure there is such a thing. A naive query, sure.
select * from really_large_table where column1 = value;
will perform really badly if you only care for 1 or 2 of the columns. As snowflake has to load all the data. You will get a column data to row data ratio improvement by using
select column1, column2 from really_large_table where column1 = value;
now only two columns of data need to be read form the data store.
Maybe you are looking for data where the value is > 100 because you think that should not happen. Then
select column1, column2 from really_large_table where column1 > 100 limit 1;
will perform much better than
select column1, column2 from really_large_table order by column1 desc limit 50;
but if what you are doing is doing the minimum work is can to have a correct answer, you next option is to increase the warehouse size. Which for IO bound work gives a scalar improvement, but some aggregation steps don't scale as linear.
Another thing to look for with regards is sometime your calculation can produce too much intermediate state, and it "external spills" (exact wording not correct) which is much like running out of ram and going to swap disk.
Then we have seen memory pressure when doing too much work in a JavaScript UDF, that slowed things down.
But most of these can be spotted by looking at the query profile and looking at the hotspots.

99% of the time was spent scanning the table. The filters within the query do not match your clustering keys, therefore it won't help much. Depending how much historical data you have on this table, and whether you will continue to read a year's worth of data, you might be better off (or creating a materialized view) clustering by qt_product_brand_sid or qt_product_category_l3_sid, depending on which one is going to be filtering the data quicker.

A big change requires changing the data structure of the transaction date to a true date field vs varchar.
second you have an IN clause w/ a single value. Use = instead.
but for the other IN clauses, I would suggest re-writing the query to separate out those sub-queries as CTE and then just join to those CTE.

Use this query :
SELECT
sl.customer_code,
s1.qt_product_category_l3_sid,
s1.qt_product_brand_sid,
sl.partner_code,
sl.transaction_id,
s1.dollars_spent,
s1.units,
s1.user_pii_sid,
s1.promo_flag,
s1.media_flag
FROM
cdw_dwh.public.qi_sg_promo_media_sales_lines_fact sl,
cdw_dwh.PUBLIC.qi_sg_prompt_category_major_brand prod_cat,
cdw_dwh.PUBLIC.qi_sg_prompt_category_major_brand prod_brand
WHERE
s1.transaction_date_id >= (to_char(current_date - (52*7) , 'yyyymmdd') )
AND sl.partner_code IN ('All Retailers')
AND s1.qt_product_category_l3_sid =prod_cat.qt_product_category_l3_sid
AND prod_cat.qt_product_category_l1_sid =246
AND prod_cat.qt_product_brand_sid=prod_brand.qt_product_brand_sid
AND prod_brand.qt_product_major_brand_sid IN (246903, 430138)

Apparently performance is an area of focus for Snowflake R&D. After struggling to make complex queries perform on big data we got 100x improvements with Exasol, no tuning whatsoever.

Related

flask-sqlalchemy slow paginate count

I have a Postgres 10 database in my Flask app. I'm trying to paginate the filtering results on table over milions of rows. The problem is, that paginate method do counting total number of query results totaly ineffective.
Heres the example with dummy filter:
paginate = Buildings.query.filter(height>10).paginate(1,10)
Under the hood if perform 2 queries:
SELECT * FROM buildings where height > 10
SELECT count(*) FROM (
SELECT * FROM buildings where height > 10
)
--------
count returns 200,000 rows
The problem is that count on raw select without subquery is quite fast ~30ms, but paginate method wraps that into subquery that takes ~30s.
The query plan on cold database:
Is there an option of using default paginate method from flask-sqlalchemy in performant way?
EDIT:
To get the better understanding of my problem here is the real filter operations used in my case, but with dummy field names:
paginate = Buildings.query.filter_by(owner_id=None).filter(Buildings.address.like('%A%')).paginate(1,10)
So the SQL the ORM produce is:
SELECT count(*) AS count_1
FROM (SELECT foo_column, [...]
FROM buildings
WHERE buildings.owner_id IS NULL AND buildings.address LIKE '%A%' ) AS anon_1
That query is already optimized by indices from:
CREATE INDEX ix_trgm_buildings_address ON public.buildings USING gin (address gin_trgm_ops);
CREATE INDEX ix_buildings_owner_id ON public.buildings USING btree (owner_id)
The problem is just this count function, that's very slow.
So it looks like a disk-reading problem. The solutions would be get faster disks, get more RAM is it all can be cached, or if you have enough RAM than to use pg_prewarm to get all the data into the cache ahead of need. Or try increasing effective_io_concurrency, so that the bitmap heap scan can have more than one IO request outstanding at a time.
Your actual query seems to be more complex than the one you show, based on the Filter: entry and based on the Row Removed by Index Recheck: entry in combination with the lack of Lossy blocks. There might be some other things to try, but we would need to see the real query and the index definition (which apparently is not just an ordinary btree index on "height").

Optimizing queries vs. adding indexes

I have this very old and SLOW query that I am trying to optimize, but I am not sure I can do anything to it, but add more indexes on columns involved in WHERE, JOIN and ORDER BY.
Query:
SELECT TOP 400 jobticket.jobnumber, jobticket.typeform, jobticket.filename, jobticket.req_number, jobticket.reqd_del_date, jobticket.point_of_contact, jobticket.status, jobticket.DapsDate, jobticket.elpod, job_info.IDOrderMaskedStatus, job_info.job_status, job_info.SalesID, job_info.location, job_info.TOMetadataID
FROM jobticket WITH (NOLOCK)
INNER JOIN job_info WITH (NOLOCK) ON job_info.jobnumber = jobticket.jobnumber
WHERE
(
NOT(
(jobticket.status = 'Complete' OR jobticket.status = 'Completed')
and (job_info.job_status = 'Actualized' OR job_info.job_status = ''
OR job_info.job_status = 'Actualized Credit Billed'
OR job_info.job_status = 'DWAS Actualized' OR job_info.job_status = 'DWAS Actualized Credit Billed'
)
)
or
((SELECT COUNT(job_status) AS Expr1 FROM tblConsolidatedBilling AS tblConsolidatedBilling_1 WITH (NOLOCK)
WHERE (job_status <> 'Actualized'
AND job_status <> 'Actualized Credit Billed')
AND (master_jobnumber = jobticket.jobnumber)) > 0)
)
and (jobticket.status != 'Waiting Approval' or (jobticket.status = 'Waiting Approval' and jobticket.DPGType is null))
and jobticket.typeform <> 'todpg'
and ((job_info.isHidden <> 1 or job_info.isHidden is null) and job_info.isInConcurrentRelease is null)
and job_info.deleted != '1'
and jobticket.status != 'New Job'
and jobticket.status != 'PRFYCLSFD'
ORDER BY
job_info.expediencyLevel DESC,
jobticket.jobnumber DESC
Execution Plan:
In all honesty I don't know what to do with this query.
Should I add individual nonclustered indexes on all columns involved in WHERE JOIN and ORDER BY?
There are many indexes on these tables, but I am not sure whether they are helpful in this query:
Looking at this SQL, I don't really see any clear criteria that is being used to fetch the rows. It looks like it's just excluding a lot of rows with different criteria. My guess is that the tickets usually end up in a state where most of the rows are, and those are not included in the results?
The problem with this is, that it doesn't really have any clear criteria for that, and it has a lot of different rules there, so that's why it ends up doing a clustered index scan + key lookups for all the rows. The scan starts from jobinfo, but I'm not sure if it would make any difference if it would start from jobticket.
Removing most of the indexes is probably a good place to start, but it won't speed up the select at all.
The query looks quite complex, so my guess is that you can't create an index view that would contain this data. That might help assuming this query is executed often and data is not changed that much (and the overhead of maintaining huge number of indexes would have been removed), but this might not be possible.
Another idea would be to investigate the rules when the rows can be excluded, and is there a possibility have more clear rules for that, so it could be indexed, maybe by adding a persisted computed column into the table.
You haven't mentioned how long this actually takes, and how many rows there are in the tables, so everything is basically just a guess. Including more data + statistics io output into the question might help.
ps. I don't personally recommend using NOLOCK except in really special cases, because it can cause problems that are really hard to solve, like reading the same data more than once or skipping rows totally.
A simple fix would be to make the indexes on job_info and tblConsolidatedBilling covering because a ton of time is spent in key lookups there. That should give an integer factor speedup. If that's not enough we need to investigate further.

Query Performance in Access 2007 - drawing on a SQL Server Express backend

I've been banging my head on this issue for a little while now, and decided I should ask for help. I have a table which holds temperature/humidity chart recorder data (currently over 775,000 records) from which I am trying to run a statistical query against it. The problem is that this often will take up to two minutes, and sometimes will not come back at all - causing me to force close the program (Control-Alt-Delete). At first, I didn't have as much of a problem - it was only after I hit the magical 500k records mark that I started getting serious slowdowns, getting progressively worse as more data was compiled and imported into the table.
Here is the query (pass-through):
SELECT dbo.tblRecorderLogs.strAreaAssigned, Min(dbo.tblRecorderLogs.datDateRecorded) AS FirstRecorderDate, Max(dbo.tblRecorderLogs.datDateRecorded) AS LastRecordedDate,
Round(Avg(dbo.tblRecorderLogs.intTempCelsius),2) AS AverageTempC,
Round(Avg(dbo.tblRecorderLogs.intRHRecorded),2) AS AverageRH,
Count(dbo.tblRecorderLogs.strAreaAssigned) AS Records
FROM dbo.tblRecorderLogs
GROUP BY dbo.tblRecorderLogs.strAreaAssigned
ORDER BY dbo.tblRecorderLogs.strAreaAssigned;
Here is the table structure in which the chart data is stored:
idRecorderDataID Number Primary Key
datDateEntered Date/Time (indexed, duplicates OK)
datTimeEntered Date/Time
intTempCelcius Number
intDewPointCelcius Number
intWetBulbCelcius Number
intMixingGPP Number
intRHRecorded Number
strAssetRecorder Text (indexed, duplicates OK)
strAreaAssigned Text (indexed, duplicates OK)
I am trying to write a program which will allow people to pull data from this table based on Area Assigned, as well as start and end dates. With the dataset size I currently have, this kind of report is simply too much for it to handle (it seems) and the machine doesn't ever return an answer. I've had to extend the ODBC timeout to almost 180 seconds in any queries dealing with this table, simply because of the size. I could use some serious help, if people have some. Thank you in advance!
-- Edited 08/13/2012 # 1050 hours --
I have not been able to test the query on the SQL Server due to the fact that the IT department has taken control of the machine in question, and has someone logged into it full-time using the remote management console. I have tried an interim step to lessen the impact of the performance issue, but I am still looking for a permanent solution to this issue.
Interim step:
I created a local table mirroring the structure of the dbo.tblRecorderLogs SQL Server table, to which I do a INSERT INTO using the former SELECT statement as it's subquery. Then any subsequent statistical analysis is drawn from this 'temporary' local table. After the process is complete, the local table is truncated.
-- Edited 08/13/2012 # 1217 hours --
Ran the shown query on the SQL Server Management Console, took 1 minute 38 seconds to complete according to the query timer provided by the console.
-- Edit 08/15/2012 # 1531 hours --
Tried to run query as VBA DoCmd.RunSQL statement to populate a temporary table using the following code:
INSERT INTO tblTempRecorderDataStatsByArea ( strAreaAssigned, datFirstRecord,
datLastRecord, intAveTempC, intAveRH, intRecordCount )
SELECT dbo_tblRecorderLogs.strAreaAssigned, Min(dbo_tblRecorderLogs.datDateRecorded)
AS MinOfdatDateRecorded, Max(dbo_tblRecorderLogs.datDateRecorded) AS MaxOfdatDateRecorded,
Round(Avg(dbo_tblRecorderLogs.intTempCelsius),2) AS AveTempC,
Round(Avg(dbo_tblRecorderLogs.intRHRecorded),2) AS AveRHRecorded,
Count(dbo_tblRecorderLogs.strAreaAssigned) AS CountOfstrAreaAssigned FROM
dbo_tblRecorderLogs GROUP BY dbo_tblRecorderLogs.strAreaAssigned ORDER BY
dbo_tblRecorderLogs.strAreaAssigned
The problem arises when the code is executed, the query takes so long - it encounters Timeout before it finishes. Still hoping for a 'magic bullet' to fix this...
-- Edited 08/20/2012 # 1241 hours --
The only 'quasi' solution I've found is running the failed query repeatedly (sort of priming the pump, as it were) so that when the query is called again by my program - it has a relative chance of actually completing before the ODBC SQL Server driver times out. Basically, a filthy filthy hack - but I don't have a better one to combat this issue.
I've tried creating a view, which works on the server side - but doesn't speed things up.
The proper fields being aggregated are indexed properly, so I can't make any changes there.
I am only pulling information from the database that is immediately useful to user - no 'SELECT * madness' going on here.
I think I am, officially, out of things to try - aside from throwing raw computing horsepower at the problem, which isn't a solution right now as the item isn't live, and I have no budget to procure better hardware. I will post this as an 'answer' and leave it up until Sept 3rd - where if I do not have better answers, I will accept my own answer and accept defeat.
When I've had to run min/max functions on several fields from the same table I've often found it quicker to do each column separately as a subquery in the from line of the main/outer query.
So your query would be like this:
SELECT rLogs1.strAreaAssigned, rLogs1.FirstRecorderDate, rLogs2.LastRecorderDate, rLog3.AverageTempC, rLogs4.AverageRH, rLogs5.Records
FROM (((
(SELECT strAreaAssigned, min(datDateRecorded) as FirstRecorderDate FROM dbo.tblRecorderLogs GROUP BY strAreaAssigned) rLogs1
inner join
(SELECT strAreaAssigned, Max(datDateRecorded) as LastRecordedDate, FROM dbo.tblRecorderLogs GROUP BY strAreaAssigned) rLogs2
on rLogs1.strAreaAssigned = rLogs2.strAreaAssigned)
inner join
(SELECT strAreaAssigned, Round(Avg(intTempCelsius),2) AS AverageTempC, FROM dbo.tblRecorderLogs GROUP BY strAreaAssigned) rLogs3
on rLogs1.strAreaAssigned = rLogs3.strAreaAssigned)
inner join
(SELECT strAreaAssigned, Round(Avg(intRHRecorded),2) AS AverageRH, FROM dbo.tblRecorderLogs GROUP BY strAreaAssigned) rLogs4
on rLogs1.strAreaAssigned = rLogs4.strAreaAssigned)
inner join
(SELECT strAreaAssigned, Count(strAreaAssigned) AS Records, FROM dbo.tblRecorderLogs GROUP BY strAreaAssigned) rLogs5
on rLogs1.strAreaAssigned = rLogs5.strAreaAssigned
ORDER BY rLogs1.strAreaAssigned;
If you take your query and the one above, copy them into the same query window in SQL Server and run the estimated execution plan you should be able to compare them and see which one works better.

Two radically different queries against 4 mil records execute in the same time - one uses brute force

I'm using SQL Server 2008. I have a table with over 3 million records, which is related to another table with a million records.
I have spent a few days experimenting with different ways of querying these tables. I have it down to two radically different queries, both of which take 6s to execute on my laptop.
The first query uses a brute force method of evaluating possibly likely matches, and removes incorrect matches via aggregate summation calculations.
The second gets all possibly likely matches, then removes incorrect matches via an EXCEPT query that uses two dedicated indexes to find the low and high mismatches.
Logically, one would expect the brute force to be slow and the indexes one to be fast. Not so. And I have experimented heavily with indexes until I got the best speed.
Further, the brute force query doesn't require as many indexes, which means that technically it would yield better overall system performance.
Below are the two execution plans. If you can't see them, please let me know and I'll re-post then in landscape orientation / mail them to you.
Brute-force query:
SELECT ProductID, [Rank]
FROM (
SELECT p.ProductID, ptr.[Rank], SUM(CASE
WHEN p.ParamLo < si.LowMin OR
p.ParamHi > si.HiMax THEN 1
ELSE 0
END) AS Fail
FROM dbo.SearchItemsGet(#SearchID, NULL) AS si
JOIN dbo.ProductDefs AS pd
ON pd.ParamTypeID = si.ParamTypeID
JOIN dbo.Params AS p
ON p.ProductDefID = pd.ProductDefID
JOIN dbo.ProductTypesResultsGet(#SearchID) AS ptr
ON ptr.ProductTypeID = pd.ProductTypeID
WHERE si.Mode IN (1, 2)
GROUP BY p.ProductID, ptr.[Rank]
) AS t
WHERE t.Fail = 0
Index-based exception query:
with si AS (
SELECT DISTINCT pd.ProductDefID, si.LowMin, si.HiMax
FROM dbo.SearchItemsGet(#SearchID, NULL) AS si
JOIN dbo.ProductDefs AS pd
ON pd.ParamTypeID = si.ParamTypeID
JOIN dbo.ProductTypesResultsGet(#SearchID) AS ptr
ON ptr.ProductTypeID = pd.ProductTypeID
WHERE si.Mode IN (1, 2)
)
SELECT p.ProductID
FROM dbo.Params AS p
JOIN si
ON si.ProductDefID = p.ProductDefID
EXCEPT
SELECT p.ProductID
FROM dbo.Params AS p
JOIN si
ON si.ProductDefID = p.ProductDefID
WHERE p.ParamLo < si.LowMin OR p.ParamHi > si.HiMax
My question is, based on the execution plans, which one look more efficient? I realize that thing may change as my data grows.
EDIT:
I have updated the indexes, and now have the following execution plan for the second query:
Trust the optimizer.
Write the query that most simply expresses what you're trying to achieve. If you're having perfomance problems with that query, then you should look at whether there are any missing indexes. But you still shouldn't have to explicitly work with these indexes.
Don't concern yourself by considerations of how you might implement such a search.
In very rare circumstances, you may need to further force the query to use particular indexes (via hints), but this is probably < 0.1% of queries.
In your posted plans, your "optimized" version is causing scans against 2 indexes of your (I presume) Params table (PK_Params_1, IX_Params_1). Without seeing the queries, it's difficult to know why this is happening, but if you're comparing against having a single scan against a table ("Brute force") and two, it's easy to see why the second isn't more efficient.
I think I'd try:
SELECT p.ProductID, ptr.[Rank]
FROM dbo.SearchItemsGet(#SearchID, NULL) AS si
JOIN dbo.ProductDefs AS pd
ON pd.ParamTypeID = si.ParamTypeID
JOIN dbo.Params AS p
ON p.ProductDefID = pd.ProductDefID
JOIN dbo.ProductTypesResultsGet(#SearchID) AS ptr
ON ptr.ProductTypeID = pd.ProductTypeID
LEFT JOIN Params p_anti
on p_anti.ProductDefId = pd.ProductDefID and
(p_anti.ParamLo < si.LowMin or p_anti.ParamHi > si.HiMax)
WHERE si.Mode IN (1, 2)
AND p_anti.ProductID is null
GROUP BY p.ProductID, ptr.[Rank]
I.e. introduce an anti-join that eliminates the results you don't want.
In SQL Server Management Studio, put both queries in the same query window and get the query plan for both at once. It should determine the query plans for both and give you a 'percent of total batch' for each one. The query with the lower percent of the total batch will be the better performing one.
Does 6 seconds on a laptop = .006 seconds on productions hardware? The part of your queries which worry me are the clustered index scans shown in the query plan. In my experience any time a query plan includes a CI scan it means the query will only get slower when data is added to the table.
What do the two functions yield as it appears they are the cause of the table scans? Is it possible to persist the data in the db and update the LoMin and HiMax as rows are added.
Looking at the two execution plans neither is very good. Look how far to the left the wide lines are. The wide lines means there are many rows. We need to reduce the number of rows earlier in the process so we do not work with such large hash tables and large sorts and nested loops.
BTW how many rows does your source have and how many rows are included in the result set?
Thank you all for your input and help.
From reading what you wrote, experimenting, and digging into the execution plan, I discovered the answer is tipping point.
There were too many records being returned to warrant use of the index.
See here (Kimberly Tripp).

Why Is My Inline Table UDF so much slower when I use variable parameters rather than constant parameters?

I have a table-valued, inline UDF. I want to filter the results of that UDF to get one particular value. When I specify the filter using a constant parameter, everything is great and performance is almost instantaneous. When I specify the filter using a variable parameter, it takes a significantly larger chunk of time, on the order of 500x more logical reads and 20x greater duration.
The execution plan shows that in the variable parameter case the filter is not applied until very late in the process, causing multiple index scans rather than the seeks that are performed in the constant case.
I guess my questions are: Why, since I'm specifying a single filter parameter that is going to be highly selective against an indexed field, does my performance go into the weeds when that parameter is in a variable? Is there anything I can do about this?
Does it have something to do with the analytic function in the query?
Here are my queries:
CREATE FUNCTION fn_test()
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
SELECT DISTINCT GCN_SEQNO, Drug_package_version_ID
FROM
(
SELECT COALESCE(ndctbla.GCN_SEQNO, ndctblb.GCN_SEQNO) AS GCN_SEQNO,
dpv.Drug_package_version_ID, ROW_NUMBER() OVER (PARTITION BY dpv.Drug_package_version_id ORDER BY
ndctbla.GCN_SEQNO DESC) AS Predicate
FROM dbo.Drug_Package_Version dpv
LEFT JOIN dbo.NDC ndctbla ON ndctbla.NDC = dpv.Sp_package_code
LEFT JOIN dbo.NDC ndctblb ON ndctblb.SPC_NDC = dpv.Sp_package_code
) iq
WHERE Predicate = 1
GO
GRANT SELECT ON fn_test TO public
GO
-- very fast
SELECT GCN_SEQNO
FROM dbo.fn_test()
WHERE Drug_package_version_id = 10000
GO
-- comparatively slow
DECLARE #dpvid int
SET #dpvid = 10000
SELECT GCN_SEQNO
FROM dbo.fn_test()
WHERE Drug_package_version_id = #dpvid
Once you create a new projection through a UDF, it can't be expected that your indexes will still apply on the columns that are indexed on the original table and included in the projection. When you filter on the projection (and not in the UDF against the original table with the indexes) the indexes no longer apply.
What you want to do is parameterize the function to take in the parameter.
If you find that you have too many fields that you want to set parameters on, then you might want to take a look at indexed views, as you can create your projection and index it as well and then run queries against that.
Simply, the constant is easy to evaluate in the plan. The local variable is not. Especially with the ranking function and filter Predicate = 1
Paraphrasing casparOne, you need to push the filter as far inwards as possible so that you filter on dpv.Drug_package_version_id inside the iq derived table.
If you do that, then you also have no need for the PARTITION BY because you have only a single dpv.Drug_package_version_id. Then you can do a cleaner ...TOP 1 ... ORDER BY ndctbla.GCN_SEQNO DESC.
The responses I got were good, and I learned from them, but I think I've found an answer that satisfies me.
I do think it's the use of the PARTITION BY clause that is causing the problem here. I reformulated the UDF using a variant of the self-join idiom:
SELECT t1.A, t1.B, t1.C
FROM T t1
INNER JOIN
(
SELECT A, MAX(C) AS C
FROM T
GROUP BY A
) t2 ON t1.A = t2.A AND t1.C = t2.C
Ironically, this is more performant than using the SQL 2008-specific query, and also the optimizer doesn't have a problem with joining this version of the query using variables rather than constants. At this point, I'm concluding that the optimizer just doesn't handle the more recent SQL extensions as well as the older stuff. As a bonus, I can make use of the UDF now, in my pre-upgraded SQL 2000 platforms.
Thanks for your help, everyone!

Resources