snowflake sql udtf join performance - snowflake-cloud-data-platform

Coming from SQLSERVER land I have really come to enjoy querying table value functions with the CROSS APPLY / OUTER APPLY pattern. It appears snowsql does not have that kind of command but I was able to replicate the functionality. Can you tell if I am going to have performance issues with how this is wired up?
TVF
create or replace function udf_user_friends(user_id varchar)
returns table(user_id varchar,friend_id varchar)
as
$$
select
userjson:user_id::string as user_id
, f.value::string as friend_id
from yelp.user u,
lateral flatten (input=>split(u.userjson:friends, ',')) f
where userjson:user_id = user_id
$$;
Query that uses the udf
SELECT
u.userjson:user_id::string as user_id
, t.friend_id
FROM yelp.user u
LEFT JOIN TABLE(udf_user_friends(u.userjson:user_id::string)) t;

Fundamentally this seems rather wrong to me. Your UDTF is join to the table you already have data from.. so it appears to be the class of functions of "hiding the complexity" which also is the same class of problem called "hiding the performance".
I started writing your outer select with the zero value UDTF removed, to discover the result was the content of the UDTF.. sort of proving the UDTF is adding zero value.
SELECT
u.userjson:user_id::string as user_id
,f.value::string as friend_id
FROM yelp.user u,
lateral flatten (input=>split(u.userjson:friends, ',')) f
Things we use UDTF for is pushing filtering down when doing permissioning, where the optimizer sometimes does not see filters can get pushed down.

Related

Force joined view not to be optimized

I have a somewhat complex view which includes a join to another view. For some reason the generated query plan is highly inefficient. The query runs for many hours. However if I select the sub-view into a temporary table first and then join with this, the same query finished in a few minutes.
My question is: Is there some kind of query hint or other trick which will force the optimizer to execute the joined sub-view in isolation before performing the join, just as when using a temp table? Clearly the default strategy chosen by the optimizer is not optimal.
I cannot use the temporary table-trick since views does not allow temporary tables. I understand I could probably rewrite everything to a stored procedure, but that would break composeability of views, and it seems also like bad for maintenance to rewrite everything just to trick the optimizer to not use a bad optimization.
Adam Machanic explained one such way at a SQL Saturday I recently attended. The presentation was called Clash of the Row Goals. The method involves using a TOP X at the beginning of the sub-select. He explained that when doing a TOP X, the query optimizer assumes it is more efficient to grab the TOP X rows one at a time. As long as you set X as a sufficiently large number (limit of INT or BIGINT?), the query will always get the correct results.
So one example that Adam provided:
SELECT
x.EmployeeId,
y.totalWorkers
FROM HumanResources.Employee AS x
INNER JOIN
(
SELECT
y0.ManagerId,
COUNT(*) AS totalWorkers
FROM HumanResources.Employee AS y0
GROUP BY
y0.ManagerId
) AS y ON
y.ManagerId = x.ManagerId
becomes:
SELECT
x.EmployeeId,
y.totalWorkers
FROM HumanResources.Employee AS x
INNER JOIN
(
SELECT TOP(2147483647)
y0.ManagerId,
COUNT(*) AS totalWorkers
FROM HumanResources.Employee AS y0
GROUP BY
y0.ManagerId
) AS y ON
y.ManagerId = x.ManagerId
It is a super cool trick and very useful.
When things get messy the query optimize often resorts to loop joins
If materializing to a temp fixed it then most likely that is the problem
The optimizer often does not deal with views very well
I would rewrite you view to not uses views
Join Hints (Transact-SQL)
You may be able to use these hints on views
Try merge and hash
Try changing the order of join
Move condition into the join whenever possible
select *
from table1
join table2
on table1.FK = table2.Key
where table2.desc = 'cat1'
should be
select *
from table1
join table2
on table1.FK = table2.Key
and table2.desc = 'cat1'
Now the query optimizer will get that correct but as the query gets more complex the query optimize goes into what I call stupid mode and loop joins. But that is also done to protect the server and have as little in memory as possible.

Why the stored procedure needs an alphabet as suffix?

I have a stored Procedure as:
CREATE PROCEDURE [dbo].[spGetEmployeesNotInSkill]
AS
BEGIN
SELECT COUNT (*) as Total FROM
(
SELECT tblUser.EmployeeID FROM tblUser where tblUser.FirstName <> 'guest'
EXCEPT
SELECT tblSkillMetrics.EmployeeID FROM tblSkillMetrics
) r -- why is 'r' used here?
END
What I want to know is why are we using this r? If we change the r to any other letter "a/b/c/...x/y/z" it gives correct output, but if we remove it shows error.
Can anyone please explain this to me?
Whenever you introduce a subquery, CTE, or anything else the will be providing rows as part of a query, you need to provide a name by which that particular set of rows may be referred to elsewhere in the query.
In the case of a table or view, the introduction of an alias is optional, and if omitted, the name of the table or view is used. But for anything else, the name must be explicitly provided.
E.g. you could have had:
SELECT COUNT (*) as Total FROM (
SELECT tblUser.EmployeeID FROM tblUser u where u.FirstName <> 'guest'
EXCEPT
SELECT tblSkillMetrics.EmployeeID FROM tblSkillMetrics)r
Where I've now introduced an alias(u) for tblUser in the inner query. And so in your query, r is the name/alias that's being used for the subquery as a whole.
The r is being used as a table alias. The query is saying "select the number of records in this table that I'm going to call r". So you can call it anything you like.
When you select from a virtual table like this you have to give it an alias - it might be more helpful to call it something like SkilledUsers but as it's only used in one place it's quite common only to use a single character.
It's a quirk of the SQL Server implementation of SQL. The "r" is an alias for the subquery: SQL Server requires that it must be named, even if it is not otherwise referenced. You could just as easily have named it FOO or SUBQUERY or EMPLOYEES.
Other implementations of SQL don't impose this syntactic restriction.
We are using r as an table alias.When u use select * from (select cn from table_name) r you are assigning name r to derived table (select cn from table_name).It will be useful when you are using joins.

CROSS APPLY with table valued function restriction performance

I have problem with CROSS APPLY with parametrised table valued function.
Here is simplified pseudo code example:
SELECT *
FROM (
SELECT lor.*
FROM LOT_OF_ROWS_TABLE lor
WHERE ...
) AS lor
CROSS APPLY dbo.HeavyTableValuedFunction(lor.ID) AS htvf
INNER JOIN ANOTHER_TABLE AS at ON lor.ID = at.ID
WHERE ...
Inner select on table LOT_OF_ROWS_TABLE is returning many rows.
Joining tables LOT_OF_ROWS_TABLE and ANOTHER_TABLE returns only one or few rows.
Table valued function is very time consuming and when calling for a lot of
rows the select lasts very long time.
My problem:
The function is called for all rows returned from LOT_OF_ROWS_TABLE regardless of the fact that the data will be limited when just join ANOTHER_TABLE.
The select has to be in the shown format - it is generated and in fact it is much more dificult.
When I try to rewrite it, it can be very fast, but it cannot be rewritten like this:
SELECT *
FROM (
SELECT lor.*
FROM LOT_OF_ROWS_TABLE lor
WHERE ...
) AS lor
INNER JOIN ANOTHER_TABLE AS at ON lor.ID = at.ID
CROSS APPLY dbo.HeavyTableValuedFunction(at.ID) AS htvf
WHERE ...
I'd like to know:
Is there any setting or hint or something that forces select to call function only for finally restricted rows?
Thank you.
EDIT:
The table valued function is very complex: http://pastebin.com/w6azRvxR.
The select we are talking about is "user configured" and generated: http://pastebin.com/bFbanY2n.
you can divide this query into 2 parts use either table variable or temp table
SELECT lor.*,at.* into #tempresult
FROM (
SELECT lor.*
FROM LOT_OF_ROWS_TABLE lor
WHERE ...
) lor
INNER JOIN ANOTHER_TABLE AS at ON lor.ID = at.ID
WHERE ...
now do the time consuming part which is table valued function right
SELECT * FROM #tempresult
CROSS APPLY dbo.HeavyTableValuedFunction(#tempresult.ID) AS htvf
I believe this is what you are looking for.
Plan Forcing Scenario: Create a Plan Guide to Force a Plan Obtained from a Rewritten Query
Basically it describes re-writing the query to get a generated plan using the correct order of joins. Then saving off that plan and forcing your existing query (that does not get changed) to use the plan you saved off.
The BOL link I put in even gives a specific example of re-writing the query putting the joins in a different order and using a FORCE ORDER hint. Then using sp_create_plan_guild to take the plan from the re-written query and use it on the original query.
YES and NO... it's hard to interprit what you're trying to achieve without sample data IN and result OUT, to compare outcomes.
I'd like to know:
Is there any setting or hint or something that forces select to call
function only for finally restricted rows?
So I'll answer your question above (3 years later!!) directly, with a direct statement:
You need to learn about CTE and the difference between CROSS APPLY
compared to INNER JOIN and why using CROSS APPLY in your case is
necessary. You "could" take the code in your function and apply it
into a single SQL statement using CTE.
ie:
Read this and this.
Essentially, something like this...
WITH t2o AS
(
SELECT t2.*, ROW_NUMBER() OVER (PARTITION BY t1_id ORDER BY rank) AS rn
FROM t2
)
SELECT t1.*, t2o.*
FROM t1
INNER JOIN
t2o
ON t2o.t1_id = t1.id
AND t2o.rn <= 3
Apply your query to extrapolate the date you want ONCE, and using CTE, then apply your second SQL using the CROSS APPLY.
You have no choice. You cannot do what you're trying to do in ONE SQL.

Dynamic inner query

Is there a way to code a dynamic inner query? Basically, I find myself typing something like the following query over and over:
;with tempData as (
--this inner query is the part that changes, but there's always a timeGMT column.
select timeGMT, dataCol2, dataCol3
from tbl1 t1
join tbl2 t2 on t1.ID=t2.ID
)
select dateadd(ss,d.gmtOffset,t.timeGMT) timeLocal,
t.*
from tempData t
join dst d on t.timeGMT between d.sTimeGMT and d.eTimeGMT
where d.zone = 'US-Eastern'
The only thing I can think of is a stored proc with the inner query text as the input for some dynamic sql... However, my understanding of the optimizer (which is, admittedly, limited) says this isn't really a good idea.
From a performance perspective, what you have there is the version on which I would expect the optimizer to do the best job.
If the "outer" part of your example is static and code maintenance overrides performance, I'd look to encapsulating the dateadd result in a table-valued function (TVF). Since the time conversion is very much the common thread in these queries, I would definitely focus on that part of the workload.
For example, your query that can vary would look like this:
select timeGMT, dataCol2, dataCol3, lt.timeLocal
from tbl1 t1
join tbl2 t2 on t1.ID = t2.ID
cross apply dbo.LocalTimeGet(timeGMT, 'US-Eastern') AS lt
Where the TVF dbo.LocalTimeGet contains the logic for dateadd(ss,d.gmtOffset,t.timeGMT) and the lookup of the time zone offset value based on the time zone name. The implementation of that function would look something like:
CREATE FUNCTION dbo.LocalTimeGet (
#TimeGMT datetime,
#TimeZone varchar(20)
)
RETURNS TABLE
AS
RETURN (
SELECT DATEADD(ss, d.gmtOffset, #TimeGMT) AS timeLocal
FROM dst AS d
WHERE d.zone = #TimeZone
);
GO
The upside of this approach is when you upgrade to 2008 or later, there are system functions you could use to make this conversion a lot easier to code and you'll only have to alter the TVF. If your result sets are small, I'd consider a system scalar function (SQL 2008) over a TVF, even if it implements those same system functions. Based on your comment, it sounds like the system functions won't do what you need, but you could still stick with your implementation of a dst table, which is encapsulated in the TVF above.
TVFs can be a performance problem because the optimizer assumes they only return 1 row.
If you need to combine encapsulation and performance, then I'd do the time zone calc in the application code instead. Even though you'd have to apply it to each project that uses it, you would only have to implement it 1x in each project (in the Data Access Layer) and treat it as a common utility library if you'll be using across projects.
To answer the OP's follow-on question, a SQL Server 2008 solution would look like this:
First, create permanent definitions:
CREATE TYPE dbo.tempDataType AS TABLE (
timeGMT DATETIME,
dataCol2 int,
dataCol3 int)
GO
CREATE PROCEDURE ComputeDateWithDST
#tempData tempDataType READONLY
AS
SELECT dateadd(ss,d.gmtOffset,t.timeGMT) timeLocal, t.*
FROM #tempData t
JOIN dst d ON t.timeGMT BETWEEN d.sTimeGMT AND d.eTimeGMT
WHERE d.zone = 'US-Eastern'
GO
Thereafter, whenever you want to plug a subquery (which has now become a separate query, no longer a CTE) into the stored procedure:
DECLARE #tempData tempDataType
INSERT #tempData
-- sample subquery:
SELECT timeGMT, dataCol2, dataCol3
FROM tbl1 t1
JOIN tbl2 t2 ON t1.ID=t2.ID
EXEC ComputeDateWithDST #tempData;
GO
Performance could be an issue because you'd be running separately what used to be a CTE instead of letting SQL Server combine it with the main query to optimize the execution plan.

SQL Server 2005 Performance: Distinct or full table in WHERE IN statement

We have two Tables:
Document: id, title, document_type_id, showon_id
DocumentType: id, name
Relationship: DocumentType hasMany Documents. (Document.document_type_id = DocumentType.id)
We wish to retrieve a list of all document types for one given ShowOn_Id.
We see two possiblities:
SELECT DocumentType.*
FROM DocumentType
WHERE DocumentType.id IN (
SELECT DISTINCT Document.document_type_id FROM Document WHERE showon_id = 42
);
SELECT DocumentType.*
FROM DocumentType
WHERE DocumentType.id IN (
SELECT Document.document_type_id FROM Document WHERE showon_id = 42
);
Our question is: when and if is it better to use the DISTINCT to get the smaller record set versus retrieving the whole table and the IN statement walking the table to the first match. (We guess that's what it does ;-))
Is this different for different databases, is there a common answer?
Or is there a better way of doing it? (We are in .NET land)
You can use a join:
SELECT DISTINCT DocumentType.*
FROM DocumentType
INNER JOIN Document
ON DocumentType.id=Document.document_type_id
WHERE Document.showon_id = 42
I think it's the best way to do it.
For the best performance you should use:
SELECT DISTINCT dt.*
FROM
DocumentType dt
INNER JOIN Document d ON dt.id=d.document_type_id and d.showon_id = 42
Joins are very efficient at bridging multiple tables where as the nested query in the Where clause will need to perform a separate result selection that will filter down the From clause results. The join statement is also much more readable.
I would also put an index on showon_id, in addition to the primary keys and foreign key relationship.
My answer differs from wmasm's answer only by moving the showon_id filter up to the inner join. For MS SQL 2k5, I think the interpreter is smart enough to do this automatically, but you always want to work with the smallest result set possible. Bringing your filters up to inner join statements can limit the number of rows the query has to work with when joining many tables together. If you do this though, you should understand that this happens for every row comparison so complex filters (such as like x = '%a' or function calls) are better left for the Where clause so that the inner joins may filter out unnecessary comparisons.
Use an EXISTS. It sometimes is faster, but in my opinion, more readable than a DISTINCT and JOIN. Just for kicks, pls reply with the query plan for this query and the JOIN above, and see if anything is different (they may be optimized down to the same plan). If they are the same, I'd recommend the EXISTS as it is closer to a "plain language" description than a JOIN (because you don't want any of the data from Document, etc.)
SELECT whatever
FROM DocumentType dt
WHERE EXISTS( SELECT *
FROM Document
WHERE dt.id = document_type_id
AND showon_id = 42)
To get the query plan (ref: http://msdn.microsoft.com/en-us/library/ms180765(SQL.90).aspx), do:
SET SHOWPLAN_TEXT ON
GO
SELECT ...
GO
From my point of view it should not make any difference inside SQL Server (but who knows how this is implemented).
Think of it this way: to return the resultset the server needs to go into the Document table and retrieve all document_type_id WHERE showon_id = 42. In the process of retrieving the document_type_ids (e.g. by index seeking) it puts them into a hash table. When this process has finished the hash table will contain distinct values anyway. After that the query execution goes inside the Document_Type table, scans the primary key and probes into the hash table. Note that this depends, e.g. maybe it's more efficient to not use a hash table, when the expected row count from the Document table it low compared to Document_Type, but in general you get the same query plan as for the query wmasm just suggested.
Follow up on Matt's answer:
I've enabled the query plan and tested the following four different queries that have come up so far:
SELECT DocumentType.* FROM DocumentType WHERE DocumentType.id IN (SELECT DISTINCT Document.document_type_id FROM Document WHERE showon_id = 42);
SELECT DocumentType.* FROM DocumentType WHERE DocumentType.id IN (SELECT Document.document_type_id FROM Document WHERE showon_id = 42);
SELECT DISTINCT DocumentType.* FROM DocumentType INNER JOIN Document ON DocumentType.id=Document.document_type_id WHERE Document.showon_id = 42;
SELECT DocumentType.* FROM DocumentType WHERE EXISTS ( SELECT * FROM Document WHERE DocumentType.id=Document.document_type_id AND showon_id = 42);
The query plan for all four queries turned out to be the same:
|--Hash Match(Right Semi Join, HASH:([Document].[document_type_id])=([DocumentType].[Id]))
|--Hash Match(Inner Join, HASH:([Document].[Title], [Uniq1005])=([Document].[Title], [Uniq1005]), RESIDUAL:([Document].[Title] as [Document].[Title] = [Document].[Title] as [Document].[Title] AND [Uniq1005] = [Uniq1005]))
| |--Index Seek(OBJECT:([Document].[IX_Document_3] AS [Document]), SEEK:([Document].[showon_id]=(1)) ORDERED FORWARD)
| |--Index Scan(OBJECT:([Document].[IX_Document_1] AS [Document]))
|--Table Scan(OBJECT:([DocumentType] AS [DocumentType]))
I am not sure what every line and element means, but it seems that from the performance perspective it does not matter how you construct the query for this kind of problem...

Resources