Snowflake Profile Does Not Match Explain - snowflake-cloud-data-platform

I have found a scenario where the query EXPLAIN output does not match what's shown under the "Profile" tab of a completed query. There may be a simpler query to reproduce these results, but I haven't found it. What I've found, is if you join multiple WITH statements, and one of those WITH statements contains a UNION, the EXPLAIN will correctly display the pruned fields on the Table Scan step, but the Profile will not.
Here are some examples of both working and bugged queries:
-- Join On Union. No With (Good)
SELECT COUNT(*) as "count", RANDOM() cb
FROM (
SELECT *
FROM tpch_sf100.customer t1
JOIN (
SELECT * FROM tpch_sf100.customer
UNION ALL
SELECT * FROM tpch_sf100.customer
) t2 ON t1.c_custkey = t2.c_custkey
)
-- Join On Union. Using With (Good)
SELECT
COUNT(*) AS "count", RANDOM() cb
FROM (
WITH t1 AS (
SELECT * FROM tpch_sf100.customer
UNION ALL
SELECT * FROM tpch_sf100.customer
)
SELECT * FROM tpch_sf100.customer t2
JOIN t1 on t1.c_custkey = t2.c_custkey
)
-- Join On Union. Multiple Withs (Bugged)
EXPLAIN SELECT
COUNT(*) AS "count", RANDOM() cb
FROM (
WITH t1 AS (
SELECT * FROM tpch_sf100.customer
UNION ALL
SELECT * FROM tpch_sf100.customer
), t2 AS (
SELECT * FROM tpch_sf100.customer
)
SELECT *
FROM t1
JOIN t2 on t1.c_custkey = t2.c_custkey
)
For the "Good" queries, the explain and profile both correctly show that only the c_custkey field is fetched in the table scans.
Explain:
Profile:
But with the last "Bugged" query, the explain shows correctly that only c_custkey was used, but the profile does not.
Explain:
Profile:
I think this is a visual bug in the profile, but would like someone to confirm or explain why these two are different. And if it's true that the query did actually pull back those additional unused fields, why does that step show it scanned the same amount of data as other queries that didn't pull back extra fields?

Related

RIGHT\LEFT Join does not provide null values without condition

I have two tables one is the lookup table and the other is the data table. The lookup table has columns named cycleid, cycle. The data table has SID, cycleid, cycle. Below is the structure of the tables.
If you check the data table, the SID may have all the cycles and may not have all the cycles. I want to output the SID completed as well as missed cycles.
I right joined the lookup table and retrieved the missing as well as completed cycles. Below is the query I used.
SELECT TOP 1000 [SID]
,s4.[CYCLE]
,s4.[CYCLEID]
FROM [dbo].[data] s3 RIGHT JOIN
[dbo].[lookup_data] s4 ON s3.CYCLEID = s4.CYCLEID
The query is not displaying me the missed values when I query for all the SID's. When I specifically query for a SID with the below query i am getting the correct result including the missed ones.
SELECT TOP 1000 [SID]
,s4.[CYCLE]
,s4.[CYCLEID]
FROM [dbo].[data] s3 RIGHT JOIN [dbo].[lookup_data] s4
ON s3.CYCLEID = s4.CYCLEID
AND s3.SID = 101002
ORDER BY [SID], s4.[CYCLEID]
As I am supplying this query into tableau I cannot provide the sid value in the query. I want to return all the sid's and from tableau I will be do the rest of the things.
The expected output that i need is as shown below.
I wrote a cross join query like below to acheive my expected output
SELECT DISTINCT
tab.CYCLEID
,tab.SID
,d.CYCLE
FROM ( SELECT d.SID
,d.[CYCLE]
,e.CYCLEID
FROM ( SELECT e.sid
,e.CYCLE
FROM [db_temp].[dbo].[Sheet3$] e
) d
CROSS JOIN [db_temp].[dbo].[Sheet4$] e
) tab
LEFT OUTER JOIN [db_temp].[dbo].[Sheet3$] d
ON d.CYCLEID = tab.CYCLEID
AND d.SID = tab.SID
ORDER BY tab.SID
,tab.CYCLEID;
However I am not able to use this query for more scenarios as my data set have nearly 20 to 40 columns and i am having issues when i use the above one.
Is there any way to do this in a simpler manner with only left or right join itself? I want the query to return all the missing values and the completed values for the all the SID's instead of supplying a single sid in the query.
You can create a master table first (combine all SID and CYCLE ID), then right join with the data table
;with ctxMaster as (
select distinct d.SID, l.CYCLE, l.CYCLEID
from lookup_data l
cross join data d
)
select d.SID, m.CYCLE, m.CYCLEID
from ctxMaster m
left join data d on m.SID = d.SID and m.CYCLEID = d.CYCLEID
order by m.SID, m.CYCLEID
Fiddle
Or if you don't want to use common table expression, subquery version:
select d.SID, m.CYCLE, m.CYCLEID
from (select distinct d.SID, l.CYCLE, l.CYCLEID
from lookup_data l
cross join data d) m
left join data d on m.SID = d.SID and m.CYCLEID = d.CYCLEID
order by m.SID, m.CYCLEID

TSQL - Get sum or substract from two tables

I want to sum/substract 'salevalue' from the two tables in my procedure. Sale 1 has receipts, 2nd is with returns. But I am lost in ideas.
SELECT *
FROM #possale1
SELECT *
FROM #possale2
SELECT sum(salevalue) AS S1
FROM #possale1
SELECT sum(salevalue)*-1 AS S2
FROM #possale2
select sum(sum(a.salevalue)-sum(b.salevalue))
from #possale1 a
inner join #possale2 b on a.receiptdate=b.receiptdate
Without aggregation next should do:
select ((SELECT sum(salevalue) FROM #possale1) - (SELECT sum(salevalue) FROM #possale2)) as balance
Are you trying for this ?
SELECT SUM(ISNULL(a.salevalue,0) - ISNULL(b.salevalue,0))
FROM #possale1 a FULL OUTER JOIN #possale2 b on a.receiptdate=b.receiptdate

Insert Statement Error Message

Good Morning All
My boss helped me design a query where it populates 1.37 million lines of random data, he has now asked me to insert/update the results into a blank table. But for some reason I cannot get it to work.
The three columns are ArrivalDate, PitchType_Skey and Site_Skey. But when I run my query (See below) I get an error message and I don't know why. Can you help?
Msg 121, Level 15, State 1, Line 2
The select list for the INSERT statement contains more items than the insert list. The number of SELECT values must match the number of INSERT columns.
Query:
USE Occupancy
INSERT INTO Bookings (ArrivalDate, Site_Skey, PitchType_Skey)
SELECT
Time.Date, Site.Site_Skey, Site.SiteWeighting, PitchType.PitchType_Skey,
PitchType.PitchTypeWeighting,
RAND(checksum(NEWID())) * Site.SiteWeighting * PitchType.PitchTypeWeighting AS Expr1
FROM
Capacity
INNER JOIN
Site ON Capacity.Site_Skey = Site.Site_Skey
INNER JOIN
PitchType ON Capacity.PitchType_Skey = PitchType.PitchType_Skey
INNER JOIN
Time
INNER JOIN
AGKey ON Time.ArrivalDayWeighting = AGKey.[Key] ON Capacity.StartDate <= Time.Date AND Capacity.EndDate >= Time.Date
CROSS JOIN
(SELECT 0 AS col1
UNION ALL
SELECT 1 AS col1) AS aaav
WHERE
(Time.CalendarYear = 2010)
AND (RAND(checksum(NEWID())) * Site.SiteWeighting * PitchType.PitchTypeWeighting >= 1.22)
Thanks
Wayne
The error message give you the answer. You have more items in your SELECT list (6)
Time.Date
Site.Site_Skey
Site.SiteWeighting
PitchType.PitchType_Skey
PitchType.PitchTypeWeighting
RAND(checksum(NEWID())) * Site.SiteWeighting * PitchType.PitchTypeWeighting AS Expr1
Than you do in your INSERT list (3)
ArrivalDate
Site_Skey
PitchType_Skey
Either remove some columns from your SELECT list or add some to your INSERT list.
As you haven't given the complete structure of your Bookings table I can only guess that you will need to do this
USE Occupancy
INSERT INTO Bookings
(
ArrivalDate,
Site_Skey,
PitchType_Skey
)
SELECT
Time.Date,
Site.Site_Skey,
PitchType.PitchType_Skey
FROM
Capacity
INNER JOIN Site ON Capacity.Site_Skey = Site.Site_Skey
INNER JOIN PitchType ON Capacity.PitchType_Skey = PitchType.PitchType_Skey
INNER JOIN Time
INNER JOIN AGKey ON Time.ArrivalDayWeighting = AGKey.[Key] ON Capacity.StartDate <= Time.Date AND Capacity.EndDate >= Time.Date
CROSS JOIN
(
SELECT 0 AS col1
UNION ALL
SELECT 1 AS col1
) AS aaav
WHERE
Time.CalendarYear = 2010
AND (RAND(checksum(NEWID())) * Site.SiteWeighting * PitchType.PitchTypeWeighting >= 1.22)
I have found the solution and I cant believe how easy it was, I just un-ticked the boxes I didn't want on the Query Designer.

Join the table valued function in the query

I have one table vwuser. I want join this table with the table valued function fnuserrank(userID). So I need to cross apply with table valued function:
SELECT *
FROM vwuser AS a
CROSS APPLY fnuserrank(a.userid)
For each userID it generates multiple records. I only want the last record for each empid that does not have a Rank of Term(inated). How can I do this?
Data:
HistoryID empid Rank MonitorDate
1 A1 E1 2012-8-9
2 A1 E2 2012-9-12
3 A1 Term 2012-10-13
4 A2 E3 2011-10-09
5 A2 TERM 2012-11-9
From this 2nd record and 4th record must be selected.
In SQL Server 2005+ you can use this Common Table Expression (CTE) to determine the latest record by MonitorDate that doesn't have a Rank of 'Term':
WITH EmployeeData AS
(
SELECT *
, ROW_NUMBER() OVER (PARTITION BY empId, ORDER BY MonitorDate DESC) AS RowNumber
FROM vwuser AS a
CROSS APPLY fnuserrank(a.userid)
WHERE Rank != 'Term'
)
SELECT *
FROM EmployeeData AS ed
WHERE ed.RowNumber = 1;
Note: The statement before this CTE will need to end in a semi-colon. Because of this, I have seen many people write them like ;WITH EmployeeData AS...
You'll have to play with this. Having trouble mocking your schema on sqlfiddle.
Select bar.*
from
(
SELECT *
FROM vwuser AS a
CROSS APPLY fnuserrank(a.userid)
where rank != 'TERM'
) foo
left join
(
SELECT *
FROM vwuser AS b
CROSS APPLY fnuserrank(b.userid)
where rank != 'TERM'
) bar
on foo.empId = bar.empId
and foo.MonitorDate > bar.MonitorDate
where bar.empid is null
I always need to test out left outers on dates being higher. The way it works is you do a left outer. Every row EXCEPT one per user has row(s) with a higher monitor date. That one row is the one you want. I usually use an example from my code, but i'm on the wrong laptop. to get it working you can select foo., bar. and look at the results and spot the row you want and make the condition correct.
You could also do this, which is easier to remember
SELECT *
FROM vwuser AS a
CROSS APPLY fnuserrank(a.userid)
) foo
join
(
select empid, max(monitordate) maxdate
FROM vwuser AS b
CROSS APPLY fnuserrank(b.userid)
where rank != 'TERM'
) bar
on foo.empid = bar.empid
and foo.monitordate = bar.maxdate
I usually prefer to use set based logic over aggregate functions, but whatever works. You can tweak it also by caching the results of your TVF join into a table variable.
EDIT:
http://www.sqlfiddle.com/#!3/613e4/17 - I mocked up your TVF here. Apparently sqlfiddle didn't like "go".
select foo.*, bar.*
from
(
SELECT f.*
FROM vwuser AS a
join fnuserrank f
on a.empid = f.empid
where rank != 'TERM'
) foo
left join
(
SELECT f1.empid [barempid], f1.monitordate [barmonitordate]
FROM vwuser AS b
join fnuserrank f1
on b.empid = f1.empid
where rank != 'TERM'
) bar
on foo.empId = bar.barempid
and foo.MonitorDate > bar.barmonitordate
where bar.barempid is null

Join subquery with min

I'm pulling my hair out over a subquery that I'm using to avoid about 100 duplicates (out of about 40k records). The records that are duplicated are showing up because they have 2 dates in h2.datecreated for a valid reason, so I can't just scrub the data.
I'm trying to get only the earliest date to return. The first subquery (that starts with "select distinct address_id", with the MIN) works fine on it's own...no duplicates are returned. So it would seem that the left join (or just plain join...I've tried that too) couldn't possibly see the second h2.datecreated, since it doesn't even show up in the subquery. But when I run the whole query, it's returning 2 values for some ipc.mfgid's, one with the h2.datecreated that I want, and the other one that I don't want.
I know it's got to be something really simple, or something that just isn't possible. It really seems like it should work! This is MSSQL. Thanks!
select distinct ipc.mfgid as IPC, h2.datecreated,
case when ad.Address is null
then ad.buildingname end as Address, cast(trace.name as varchar)
+ '-' + cast(trace.Number as varchar) as ONT,
c.ACCOUNT_Id,
case when h.datecreated is not null then h.datecreated
else h2.datecreated end as Install
from equipmentjoin as ipc
left join historyjoin as h on ipc.id = h.EQUIPMENT_Id
and h.type like 'add'
left join circuitjoin as c on ipc.ADDRESS_Id = c.ADDRESS_Id
and c.GRADE_Code like '%hpna%'
join (select distinct address_id, equipment_id,
min(datecreated) as datecreated, comment
from history where comment like 'MAC: 5%' group by equipment_id, address_id, comment)
as h2 on c.address_id = h2.address_id
left join (select car.id, infport.name, carport.number, car.PCIRCUITGROUP_Id
from circuit as car (NOLOCK)
join port as carport (NOLOCK) on car.id = carport.CIRCUIT_Id
and carport.name like 'lead%'
and car.GRADE_Id = 29
join circuit as inf (NOLOCK) on car.CCIRCUITGROUP_Id = inf.PCIRCUITGROUP_Id
join port as infport (NOLOCK) on inf.id = infport.CIRCUIT_Id
and infport.name like '%olt%' )
as trace on c.ccircuitgroup_id = trace.pcircuitgroup_id
join addressjoin as ad (NOLOCK) on ipc.address_id = ad.id
The typical approach to only getting the lowest row is one of the following. You didn't bother to specify what version of SQL Server you're using, what you want to do with ties, and I have little interest to try to work this into your complex query, so I'll show you an abstract simplification for different versions.
SQL Server 2000
SELECT x.grouping_column, x.min_column, x.other_columns ...
FROM dbo.foo AS x
INNER JOIN
(
SELECT grouping_column, min_column = MIN(min_column)
FROM dbo.foo GROUP BY grouping_column
) AS y
ON x.grouping_column = y.grouping_column
AND x.min_column = y.min_column;
SQL Server 2005+
;WITH x AS
(
SELECT grouping_column, min_column, other_columns,
rn = ROW_NUMBER() OVER (ORDER BY min_column)
FROM dbo.foo
)
SELECT grouping_column, min_column, other_columns
FROM x
WHERE rn = 1;
This subqery:
select distinct address_id, equipment_id,
min(datecreated) as datecreated, comment
from history where comment like 'MAC: 5%' group by equipment_id, address_id, comment
Probably will return multiple rows because the comment is not guaranteed to be the same.
Try this instead:
CROSS APPLY (
SELECT TOP 1 H2.DateCreated, H2.Comment -- H2.Equipment_id wasn't used
FROM History H2
WHERE
H2.Comment LIKE 'MAC: 5%'
AND C.Address_ID = H2.Address_ID
ORDER BY DateCreated
) H2
Switch that to OUTER APPLY in case you want rows that don't have a matching desired history entry.

Resources