How to ingest a large Stream table using Snowflake Task?

How to ingest a large Stream table using Snowflake Task? - snowflake-cloud-data-platform

I am using the Snowflake Kafka Sink Connector to ingest data from Debezium into a Snowflake table. I have created a Stream and a Task on this table. As the data from Kafka lands into the source table, the stream gets populated and the task runs a MERGE command to write the data into a final table.
However, as the stream has grown moderately large with about 50 million rows, the task fails to run to completion and times out.
To address this, I have tried the following:
Increase the timeout of the task from 1 hour to 24 hours.
Increase the warehouse size to Medium.
The task still doesn't finish after 24 hours and times out.
Is it the case that ingesting 50M rows requires an even larger warehouse to ingest these rows? How do I get the task to run to completion?
MERGE statement
MERGE INTO TARGET.MESSAGE AS P
USING (SELECT RECORD_CONTENT:payload:before.id::VARCHAR AS BEFORE_ID,
RECORD_CONTENT:payload:before.agency_id::VARCHAR AS BEFORE_AGENCY_ID,
RECORD_CONTENT:payload:after.id::VARCHAR AS AFTER_ID,
RECORD_CONTENT:payload:after.agency_id::VARCHAR AS AFTER_AGENCY_ID,
RECORD_CONTENT:payload:after::VARIANT AS PAYLOAD,
RECORD_CONTENT:payload:source.ts_ms::INT AS TS_MS,
RECORD_CONTENT:payload:op::VARCHAR AS OP
FROM RAW.MESSAGE_STREAM
QUALIFY ROW_NUMBER() OVER (
PARTITION BY COALESCE(AFTER_ID, BEFORE_ID), COALESCE(AFTER_AGENCY_ID, BEFORE_AGENCY_ID)
ORDER BY TS_MS DESC
) = 1) PS ON (P.ID = PS.AFTER_ID AND P.AGENCY_ID = PS.AFTER_AGENCY_ID) OR
(P.ID = PS.BEFORE_ID AND P.AGENCY_ID = PS.BEFORE_AGENCY_ID)
WHEN MATCHED AND PS.OP = 'd' THEN DELETE
WHEN MATCHED AND PS.OP IN ('u', 'r') THEN UPDATE SET P.PAYLOAD = PS.PAYLOAD, P.TS_MS = PS.TS_MS
WHEN NOT MATCHED AND PS.OP IN ('c', 'r', 'u') THEN INSERT (P.ID, P.AGENCY_ID, P.PAYLOAD, P.TS_MS) VALUES (PS.AFTER_ID, PS.AFTER_AGENCY_ID, PS.PAYLOAD, PS.TS_MS);
EXPLAIN Plan
GlobalStats:
partitionsTotal=742
partitionsAssigned=742
bytesAssigned=3596441600
Operations:
1:0 ->Result number of rows inserted, number of rows updated, number of rows deleted
1:1 ->WindowFunction ROW_NUMBER() OVER (PARTITION BY IFNULL(TO_CHAR(GET(GET(GET(UNION_ALL(CHANGES."a_RECORD_CONTENT", CHANGES."d_RECORD_CONTENT"), 'payload'), 'after'), 'id')), TO_CHAR(GET(GET(GET(UNION_ALL(CHANGES."a_RECORD_CONTENT", CHANGES."d_RECORD_CONTENT"), 'payload'), 'before'), 'id'))), IFNULL(TO_CHAR(GET(GET(GET(UNION_ALL(CHANGES."a_RECORD_CONTENT", CHANGES."d_RECORD_CONTENT"), 'payload'), 'after'), 'agency_id')), TO_CHAR(GET(GET(GET(UNION_ALL(CHANGES."a_RECORD_CONTENT", CHANGES."d_RECORD_CONTENT"), 'payload'), 'before'), 'agency_id'))) ORDER BY TO_NUMBER(GET(GET(GET(UNION_ALL(CHANGES."a_RECORD_CONTENT", CHANGES."d_RECORD_CONTENT"), 'payload'), 'source'), 'ts_ms')) DESC NULLS FIRST)
1:2 ->LeftOuterJoin joinFilter: ((P.ID = (TO_CHAR(GET(GET(GET(UNION_ALL(CHANGES."a_RECORD_CONTENT", CHANGES."d_RECORD_CONTENT"), 'payload'), 'after'), 'id')))) AND (P.AGENCY_ID = (TO_CHAR(GET(GET(GET(UNION_ALL(CHANGES."a_RECORD_CONTENT", CHANGES."d_RECORD_CONTENT"), 'payload'), 'after'), 'agency_id'))))) OR ((P.ID = (TO_CHAR(GET(GET(GET(UNION_ALL(CHANGES."a_RECORD_CONTENT", CHANGES."d_RECORD_CONTENT"), 'payload'), 'before'), 'id')))) AND (P.AGENCY_ID = (TO_CHAR(GET(GET(GET(UNION_ALL(CHANGES."a_RECORD_CONTENT", CHANGES."d_RECORD_CONTENT"), 'payload'), 'before'), 'agency_id')))))
1:3 ->Filter ROW_NUMBER() OVER (PARTITION BY IFNULL(TO_CHAR(GET(GET(GET(UNION_ALL(CHANGES."a_RECORD_CONTENT", CHANGES."d_RECORD_CONTENT"), 'payload'), 'after'), 'id')), TO_CHAR(GET(GET(GET(UNION_ALL(CHANGES."a_RECORD_CONTENT", CHANGES."d_RECORD_CONTENT"), 'payload'), 'before'), 'id'))), IFNULL(TO_CHAR(GET(GET(GET(UNION_ALL(CHANGES."a_RECORD_CONTENT", CHANGES."d_RECORD_CONTENT"), 'payload'), 'after'), 'agency_id')), TO_CHAR(GET(GET(GET(UNION_ALL(CHANGES."a_RECORD_CONTENT", CHANGES."d_RECORD_CONTENT"), 'payload'), 'before'), 'agency_id'))) ORDER BY TO_NUMBER(GET(GET(GET(UNION_ALL(CHANGES."a_RECORD_CONTENT", CHANGES."d_RECORD_CONTENT"), 'payload'), 'source'), 'ts_ms')) DESC NULLS FIRST) = 1
1:4 ->UnionAll
1:5 ->Filter CHANGES.A_METADATA$ACTION IS NOT NULL
1:6 ->WithReference
1:7 ->WithClause CHANGES
1:8 ->Filter (A.METADATA$SHORTNAME IS NULL) OR (D.METADATA$SHORTNAME IS NULL) OR (NOT(EQUAL_NULL(SCAN_FDN_FILES.RECORD_METADATA, SCAN_FDN_FILES.RECORD_METADATA))) OR (NOT(EQUAL_NULL(SCAN_FDN_FILES.RECORD_CONTENT, SCAN_FDN_FILES.RECORD_CONTENT)))
1:9 ->FullOuterJoin joinKey: (D.METADATA$ROW_ID = A.METADATA$ROW_ID) AND (D.METADATA$SHORTNAME = A.METADATA$SHORTNAME)
1:10 ->TableScan DATABASE.RAW.MESSAGE as SCAN_FDN_FILES METADATA$PARTITION_ROW_NUMBER, METADATA$PARTITION_NAME, RECORD_METADATA, RECORD_CONTENT, METADATA$ORIGINAL_PARTITION_NAME, METADATA$ORIGINAL_PARTITION_ROW_NUMBER {partitionsTotal=17, partitionsAssigned=17, bytesAssigned=20623360}
1:11 ->TableScan DATABASE.RAW.MESSAGE as SCAN_FDN_FILES METADATA$PARTITION_ROW_NUMBER, METADATA$PARTITION_NAME, RECORD_METADATA, RECORD_CONTENT, METADATA$ORIGINAL_PARTITION_NAME, METADATA$ORIGINAL_PARTITION_ROW_NUMBER {partitionsTotal=507, partitionsAssigned=507, bytesAssigned=3519694336}
1:12 ->Filter CHANGES.D_METADATA$ACTION IS NOT NULL
1:13 ->WithReference
1:14 ->TableScan DATABASE.TARGET.MESSAGE as P ID, AGENCY_ID {partitionsTotal=218, partitionsAssigned=218, bytesAssigned=56123904}
Query Profile
Query Profile

I have re-jiggled you SQL just so it's more readable to me..
MERGE INTO target.message AS p
USING (
SELECT
record_content:payload:before.id::VARCHAR AS before_id,
record_content:payload:before.agency_id::VARCHAR AS before_agency_id,
record_content:payload:after.id::VARCHAR AS after_id,
record_content:payload:after.agency_id::VARCHAR AS after_agency_id,
record_content:payload:after::VARIANT AS payload,
record_content:payload:source.ts_ms::INT AS ts_ms,
record_content:payload:op::VARCHAR AS op,
COALESCE(after_id, before_id) AS id_a
COALESCE(after_agency_id, before_agency_id) AS id_b
FROM raw.message_stream
QUALIFY ROW_NUMBER() OVER (PARTITION BY id_a, id_b ORDER BY ts_ms DESC ) = 1
) AS ps
ON (p.id = ps.after_id AND p.agency_id = ps.after_agency_id) OR
(p.id = ps.before_id AND p.agency_id = ps.before_agency_id)
WHEN MATCHED AND ps.op = 'd'
THEN DELETE
WHEN MATCHED AND ps.op IN ('u', 'r')
THEN UPDATE SET p.payload = ps.payload, p.ts_ms = ps.ts_ms
WHEN NOT MATCHED AND ps.op IN ('c', 'r', 'u')
THEN INSERT (p.id, p.agency_id, p.payload, p.ts_ms)
VALUES (ps.after_id, ps.after_agency_id, ps.payload, ps.ts_ms);
I don't see anything super horrible here. I push the COALESCE of the two values used in the QUALIFY into the SELECT just so I can read it simpler.
But looking the ON logic, you are prepared to match before's if after's don't match, but mixing that with the COALESCE logic are the two after values both null at the same time? aka if after_id is null, after_agency_id will also be null. Because then if ALSO you don't want to care about check "before" if the "after" are not null but don't match. Then you could use:
ON p.id = ps.id_a AND p.agency_id = ps.id_b
albeit, you might want to name them better then. That should improve it a simdge.
Back to the JOIN logic, another reason I think the above might apply is you are grouping/partitioning the ROW_NUMBER by the after values if present, which implies if you had values with the same after values, and different before values, due to the current ROW_NUMBER the later might be getting thrown away.
But otherwise it doesn't look like it's doing anything "truely bad" at which point you might want to run a 4-8 times bigger warehouse, and let it for 24/8 hours and see if it completes in 10% extra time. The cost of the bigger warehouse should be offset in the small real clock time.
Silly Idea:
on the smaller data set you mention, try the SQL made real nice and simple:
MERGE INTO target.message AS p
USING (
(
SELECT
b.before_id,
b.before_agency_id,
b.after_id,
b.after_agency_id,
b.payload,
b.ts_ms,
b.op,
FROM (
SELECT
A.*
,COALESCE(a.after_id, a.before_id) AS id_a
,COALESCE(a.after_agency_id, a.before_agency_id) AS id_b
,ROW_NUMBER() OVER (PARTITION BY id_a, id_b ORDER BY ts_ms DESC ) as rn
FROM (
SELECT
record_content:payload:before.id::VARCHAR AS before_id,
record_content:payload:before.agency_id::VARCHAR AS before_agency_id,
record_content:payload:after.id::VARCHAR AS after_id,
record_content:payload:after.agency_id::VARCHAR AS after_agency_id,
record_content:payload:after::VARIANT AS payload,
record_content:payload:source.ts_ms::INT AS ts_ms,
record_content:payload:op::VARCHAR AS op,
FROM raw.message_stream
` ) as A
) AS B
WHERE b.rn = 1
) AS ps
ON (p.id = ps.after_id AND p.agency_id = ps.after_agency_id) OR
(p.id = ps.before_id AND p.agency_id = ps.before_agency_id)
WHEN MATCHED AND ps.op = 'd'
THEN DELETE
WHEN MATCHED AND ps.op IN ('u', 'r')
THEN UPDATE SET p.payload = ps.payload, p.ts_ms = ps.ts_ms
WHEN NOT MATCHED AND ps.op IN ('c', 'r', 'u')
THEN INSERT (p.id, p.agency_id, p.payload, p.ts_ms)
VALUES (ps.after_id, ps.after_agency_id, ps.payload, ps.ts_ms);
and with the joins as I suspect would work for you data.. on cloned tables, just to see how the performance impact is:
MERGE INTO target.message AS p
USING (
(
SELECT
--b.before_id,
--b.before_agency_id,
b.after_id,
b.after_agency_id,
b.payload,
b.ts_ms,
b.op,
b.id_a,
b.id_b
FROM (
SELECT
A.*
,COALESCE(a.after_id, a.before_id) AS id_a
,COALESCE(a.after_agency_id, a.before_agency_id) AS id_b
,ROW_NUMBER() OVER (PARTITION BY id_a, id_b ORDER BY ts_ms DESC ) as rn
FROM (
SELECT
record_content:payload:before.id::VARCHAR AS before_id,
record_content:payload:before.agency_id::VARCHAR AS before_agency_id,
record_content:payload:after.id::VARCHAR AS after_id,
record_content:payload:after.agency_id::VARCHAR AS after_agency_id,
record_content:payload:after::VARIANT AS payload,
record_content:payload:source.ts_ms::INT AS ts_ms,
record_content:payload:op::VARCHAR AS op,
FROM raw.message_stream
` ) as A
) AS B
WHERE b.rn = 1
) AS ps
ON p.id = ps.id_a AND p.agency_id = ps.id_b
WHEN MATCHED AND ps.op = 'd'
THEN DELETE
WHEN MATCHED AND ps.op IN ('u', 'r')
THEN UPDATE SET p.payload = ps.payload, p.ts_ms = ps.ts_ms
WHEN NOT MATCHED AND ps.op IN ('c', 'r', 'u')
THEN INSERT (p.id, p.agency_id, p.payload, p.ts_ms)
VALUES (ps.after_id, ps.after_agency_id, ps.payload, ps.ts_ms);
Another thing to try to "get thought the backlog"
Split the task into to steps, just for now, make a temp table that is the first half:
CREATE TABLE perm_but_call_temp_table AS
SELECT
record_content:payload:before.id::VARCHAR AS before_id,
record_content:payload:before.agency_id::VARCHAR AS before_agency_id,
record_content:payload:after.id::VARCHAR AS after_id,
record_content:payload:after.agency_id::VARCHAR AS after_agency_id,
record_content:payload:after::VARIANT AS payload,
record_content:payload:source.ts_ms::INT AS ts_ms,
record_content:payload:op::VARCHAR AS op,
COALESCE(after_id, before_id) AS id_a
COALESCE(after_agency_id, before_agency_id) AS id_b
FROM raw.message_stream
QUALIFY ROW_NUMBER() OVER (PARTITION BY id_a, id_b ORDER BY ts_ms DESC ) = 1
then merge that into your main table.
MERGE INTO target.message AS p
USING perm_but_call_temp_table AS ps
ON p.id = ps.id_a AND p.agency_id = ps.id_b
WHEN MATCHED AND ps.op = 'd'
THEN DELETE
WHEN MATCHED AND ps.op IN ('u', 'r')
THEN UPDATE SET p.payload = ps.payload, p.ts_ms = ps.ts_ms
WHEN NOT MATCHED AND ps.op IN ('c', 'r', 'u')
THEN INSERT (p.id, p.agency_id, p.payload, p.ts_ms)
VALUES (ps.after_id, ps.after_agency_id, ps.payload, ps.ts_ms);
which will give you an idea "where the problem is" the first or the second operation. and it will also let you merge into clones, and test if the equi join version runs faster, and the results are the same.

Related

SQL statement is nested too deeply when adding more subqueries to projection

Reproduction project available on github, using AdventureWorks2016 database. GitHub Reproduction
We have custom filtering mechanism to accomodate for our needs. What it does is - it takes input data, builds expression tree with full query and passes it to EntityFramework to execute. We have two parts of the query - getting basic entity and getting some extra data values, represented as sub queries inside final projection.
Problem:
When getting more than ~20 subqueries, SqlServer throws an error:
Some part of your SQL statement is nested too deeply. Rewrite the query or break it up into smaller queries.
Upon closer investigation it turns out queries similar to this:
var products = db.Products.Where(p => productIds.Contains(p.ProductID))
.Select(p => new
{
Entity = p,
Extras = new
{
TotalTransactions = p.TransactionHistories.Count(),
TotalCostChanges = p.ProductCostHistories.Count(),
AverageTransactionCost = p.TransactionHistories.Average(t => t.Quantity * t.ActualCost),
MaxQuantity = (int?)p.TransactionHistories.Max(t => t.Quantity)
}
});
Are resulting in SQL query generated like this:
SELECT
[Project3].[ProductID] AS [ProductID],
[Project3].[Name] AS [Name],
[Project3].[C1] AS [C1],
[Project3].[C2] AS [C2],
[Project3].[C3] AS [C3],
(SELECT
MAX([Extent5].[Quantity]) AS [A1]
FROM [Production].[TransactionHistory] AS [Extent5]
WHERE [Project3].[ProductID] = [Extent5].[ProductID]) AS [C4]
FROM ( SELECT
[Project2].[ProductID] AS [ProductID],
[Project2].[Name] AS [Name],
[Project2].[C1] AS [C1],
[Project2].[C2] AS [C2],
(SELECT
AVG([Filter4].[A1]) AS [A1]
FROM ( SELECT
CAST( [Extent4].[Quantity] AS decimal(19,0)) * [Extent4].[ActualCost] AS [A1]
FROM [Production].[TransactionHistory] AS [Extent4]
WHERE [Project2].[ProductID] = [Extent4].[ProductID]
) AS [Filter4]) AS [C3]
FROM ( SELECT
[Project1].[ProductID] AS [ProductID],
[Project1].[Name] AS [Name],
[Project1].[C1] AS [C1],
(SELECT
COUNT(1) AS [A1]
FROM [Production].[ProductCostHistory] AS [Extent3]
WHERE [Project1].[ProductID] = [Extent3].[ProductID]) AS [C2]
FROM ( SELECT
[Extent1].[ProductID] AS [ProductID],
[Extent1].[Name] AS [Name],
(SELECT
COUNT(1) AS [A1]
FROM [Production].[TransactionHistory] AS [Extent2]
WHERE [Extent1].[ProductID] = [Extent2].[ProductID]) AS [C1]
FROM [Production].[Product] AS [Extent1]
WHERE [Extent1].[ProductID] IN (707, 708, 709, 711)
) AS [Project1]
) AS [Project2]
) AS [Project3]
With each property added to Extras, one more nested query is created.
Is there any way in which EntityFramework will generate better query (see: all those nested queries for values C1, C2... can be represented as simple sub-queries in main select) or should this kind of query be created in some completely different way?

What is the most efficient way to compare a table against itself in SQL server?

I have recently developed some coding that will identify first call resolution for an insurance company that I currently work for. The code basically creates two identical SELECT statements from a call data table, sorts and ranks the data by a member_id, call_date and call_time and then compares each query against one another where the member_id and call_type are equal, but the "rank" or RowId is not (so the comparison happens for each Rowid from one side of the inner join to the other but not for the same Rowid). This acts as a quasi "loop" to compare all the records to see where the call_dates are > 14 days OR where the call_dates are equal AND the call_times are > 24 hours....for the same call_type and same member_id.
Here is a sample of the code:
DECLARE #Temp TABLE (
[MEM_ID] [varchar](10) NULL,
[CALL_DATE] [datetime2](3) NULL,
[CALL_TIME] [varchar](8) NULL,
[OPER_CODE] [varchar](8) NULL,
[TEXT_DEPT] [varchar](6) NULL,
[CAT_DESC] [varchar](30) NULL
)
INSERT INTO #Temp
VALUES ('00000-1400', '2018-07-16 00:00:00.000','10:12:23','YYZ2500','SERV06','PLAN BENEFITS')
,('00000-1400', '2018-06-10 00:00:00.000','10:12:23','YYZ2500','SERV06','PLAN BENEFITS')
,('00000-1400', '2018-07-01 00:00:00.000','18:12:23','YYZ2500','SERV06','CLAIMS')
,('00000-1400', '2018-07-02 00:00:00.000','05:12:23','YYZ2500','SERV06','OTHER')
,('00000-1400', '2018-07-14 00:00:00.000','02:12:23','YYZ2500','SERV06','CLAIMS')
,('00000-1400', '2018-07-27 00:00:00.000','11:12:23','YYZ2500','SERV06','PLAN BENEFITS')
,('00000-1400', '2018-06-30 00:00:00.000','08:12:23','YYZ2500','SERV06','PLAN BENEFITS')
,('00000-1400', '2018-06-29 00:00:00.000','07:12:23','YYZ2500','SERV06','AUTHORIZATIONS')
,('00000-1400', '2018-06-29 00:00:00.000','07:26:23','YYZ2500','SERV06','AUTHORIZATIONS')
,('00000-1400', '2018-06-25 00:00:00.000','09:38:23','YYZ2500','SERV06','OTHER')
Select Calc.*, CASE WHEN Calc.Disposition = 'No subsequent within 14' THEN CONVERT(tinyint, 1) ELSE 0 END as FCR_Ind
From (Select DISTINCT Final.RowID, Final.CallCount, Final.MEM_ID, Final.CALL_DATE,
Final.CALL_TIME, Final.CAT_DESC, Final.OPER_CODE, Final.TEXT_DEPT,
CASE WHEN Final.[Had a subsequent within 14 days] IS NULL
THEN 'No subsequent within 14' ELSE Final.[Had a subsequent within 14 days]
END as Disposition
From (Select *
From (Select DISTINCT T1.RowID, T1.CallCount, T1.MEM_ID, T1.CALL_DATE,
T1.CALL_TIME, T1.CAT_DESC, T1.OPER_CODE, T1.TEXT_DEPT,
CASE WHEN T2.CALL_DATE BETWEEN T1.CALL_DATE AND DateAdd(DAY, 14, T1.CALL_DATE) AND T2.RowID <> T1.RowID AND T2.CALL_DATE <> T1.CALL_DATE OR T2.CALL_TIME BETWEEN T1.CALL_TIME AND DateAdd(HOUR, 24, T1.CALL_DATE) AND T2.RowID <> T1.RowID AND T2.CALL_DATE = T1.CALL_DATE THEN 'Had a subsequent within 14 days' END as Disposition
From (Select ROW_NUMBER() OVER (PARTITION BY MEM_ID ORDER BY MEM_ID,
[CALL_DATE], CALL_TIME) AS RowID, Count(CALL_TIME) as CallCount,
MEM_ID, [CALL_DATE], CALL_TIME, CAT_DESC, OPER_CODE, TEXT_DEPT
From #Temp
Group by MEM_ID, [CALL_DATE], CALL_TIME, CAT_DESC, OPER_CODE,
TEXT_DEPT) as T1
Inner Join (Select ROW_NUMBER() OVER (PARTITION BY MEM_ID ORDER BY MEM_ID,
[CALL_DATE], CALL_TIME) AS RowID, Count(CALL_TIME) as
CallCount, MEM_ID, [CALL_DATE], CALL_TIME, CAT_DESC,
OPER_CODE, TEXT_DEPT
From #Temp
Group by MEM_ID, [CALL_DATE], CALL_TIME, CAT_DESC, OPER_CODE,
TEXT_DEPT) as T2
ON T1.MEM_ID = T2.MEM_ID AND T1.CAT_DESC = T2.CAT_DESC ) as Sub
PIVOT( MAX(Sub.Disposition)
FOR Sub.Disposition IN ([Had a subsequent within 14 days])) AS PivotTable) as Final) as Calc
The Output yields the following:
I know this is not the most efficient way to solve this problem and was wondering if there is a way to write the INNER JOIN portion where only 1 of the sorted tables is used to iterate thru out the records of the 2nd sorted table instead of each table iterating against each other? I appreciate any insights on how to make this code use less system resources and make it more efficient from an execution plan standpoint.
Thanks!!

T-SQL join query using distinct values from second table

I know this question might sound like a duplicate, but I've been through every question I could find; though it's still possible it might be a duplicate of a question I might have missed.
I have what at surface value appears to be a trivial requirement but no matter how I script it out there's always some caveat that's just not working. I've tried GROUP, DISTINCT, JOIN, aggregate functions, etc.
Scenario:
PRIMARYTABLE contains a set of campaigns and SECONDARYTABLE contains the dates on which campaigns were run. There can be multiple runs per campaign and I've included a SUBKEY for each run.
Requirement:
I need to be able to get the most recently run campaigns into a list so the user can more easily select from the campaigns that get run the most frequent.
PRIMARYTABLE
KEYCOLUMN INFOCOLUMN
100000 Test 1
100001 Test Campaign
100002 Test Image 2
100003 Test Img
100004 Image Test
100005 Test
100006 Test Image 3
100007 Test Image 4
100008 Test Image 5
100009 Image Comparison Test 2
100010 Testing
100011 Test Fields
100012 Test 5
100013 test
SECONDARYTABLE
KEYCOLUMN SUBKEY DATECOLUMN
100000 100000 2017-06-02 04:09:57.593
100001 100001 2017-06-19 12:09:54.093
100001 100002 2017-06-27 10:51:14.140
100004 100003 2017-06-27 12:33:47.747
100006 100004 2017-06-28 10:29:53.387
100007 100005 2017-06-28 10:36:23.710
100008 100006 2017-06-29 22:31:03.790
100009 100007 2017-06-29 23:07:52.870
100009 100010 2017-10-04 16:05:40.583
100009 100011 2017-10-04 16:09:55.470
100011 100008 2017-09-08 14:02:28.017
100012 100009 2017-09-11 16:17:23.870
100013 100012 2017-11-07 16:55:55.403
100013 100013 2017-11-08 15:37:16.430
Below is somewhat of an idea of more or less what I'm after.
SELECT DISTINCT( a.[INFOCOLUMN] )
FROM [PRIMARYTABLE] a
INNER JOIN [SECONDARYTABLE] b ON ( a.[KEYCOLUMN] = b.[KEYCOLUMN] )
ORDER BY a.[DATECOLUMN]
Here's hoping for a Homer Simpson "Doh!" moment once I see how it's supposed to be done.
Much appreciated.

the most recently run campaigns >> use row_number() over(.. order by ... DESC)
that get run the most frequent >> use count(*) over(partition by ..)
Using window functions row_number() over() and count() over() enables selection by row of data that is "most recent" and ordering by "most frequent". Note that the DESCending order of dates brings about "recent" = 1.
select
p.*, s.*
from PRIMARYTABLE p
inner join (
select KEYCOLUMN, SUBKEY, DATECOLUMN
, row_number() over(partition by KEYCOLUMN order by DATECOLUMN DESC) recent
, count(*) over(partition by KEYCOLUMN) frequency
from SECONDARYTABLE
) s on p.KEYCOLUMN = s.KEYCOLUMN and s.recent = 1
order by s.frequency DESC, p.INFOCOLUMN

You can try this:
DECLARE #PRIMARYTABLE TABLE
(
[KEYCOLUMN] INT
,[INFOCOLUMN] VARCHAR(24)
);
DECLARE #SECONDARYTABLE TABLE
(
[KEYCOLUMN] INT
,[SUBKEY] INT
,[DATECOLUMN] DATETIME2
);
INSERT INTO #PRIMARYTABLE ([KEYCOLUMN], [INFOCOLUMN])
VALUES (100000, 'Test 1')
,(100001, 'Test Campaign')
,(100002, 'Test Image 2')
,(100003, 'Test Img')
,(100004, 'Image Test')
,(100005, 'Test')
,(100006, 'Test Image 3')
,(100007, 'Test Image 4')
,(100008, 'Test Image 5')
,(100009, 'Image Comparison Test 2')
,(100010, 'Testing')
,(100011, 'Test Fields')
,(100012, 'Test 5')
,(100013, 'test');
INSERT INTO #SECONDARYTABLE ([KEYCOLUMN], [SUBKEY], [DATECOLUMN])
VALUES (100000, 100000, '2017-06-02 04:09:57.593')
,(100001, 100001, '2017-06-19 12:09:54.093')
,(100001, 100002, '2017-06-27 10:51:14.140')
,(100004, 100003, '2017-06-27 12:33:47.747')
,(100006, 100004, '2017-06-28 10:29:53.387')
,(100007, 100005, '2017-06-28 10:36:23.710')
,(100008, 100006, '2017-06-29 22:31:03.790')
,(100009, 100007, '2017-06-29 23:07:52.870')
,(100009, 100010, '2017-10-04 16:05:40.583')
,(100009, 100011, '2017-10-04 16:09:55.470')
,(100011, 100008, '2017-09-08 14:02:28.017')
,(100012, 100009, '2017-09-11 16:17:23.870')
,(100013, 100012, '2017-11-07 16:55:55.403')
,(100013, 100013, '2017-11-08 15:37:16.430');
SELECT a.[INFOCOLUMN]
,b.[DATECOLUMN]
FROM #PRIMARYTABLE A
CROSS APPLY
(
SELECT TOP 1 [DATECOLUMN]
FROM #SECONDARYTABLE B
WHERE A.[KEYCOLUMN] = B.[KEYCOLUMN]
ORDER BY [DATECOLUMN] DESC
) b;
It will give you the last execution of each campaign. You can filter then by date or ORDER BY and get TOP N from the final query.
Or you can use ROW_NUMBER:
WITH DataSource AS
(
SELECT A.[INFOCOLUMN]
,B.[DATECOLUMN]
,ROW_NUMBER() OVER (PARTITION BY A.[KEYCOLUMN] ORDER BY B.[KEYCOLUMN]) AS [RowID]
FROM #PRIMARYTABLE A
INNER JOIN #SECONDARYTABLE B
ON A.[KEYCOLUMN] = B.[KEYCOLUMN]
)
SELECT [INFOCOLUMN]
,[DATECOLUMN]
FROM DataSource
WHERE [RowID] = 1;

try this, it will return the list of campaigns in most frequent order of use. Note campaigns never run wont appear in your list. in this case you will to do a left join
SELECT a.[INFOCOLUMN]
FROM [PRIMARYTABLE] a
/* left */ JOIN [SECONDARYTABLE] b ON a.[KEYCOLUMN] = b.[KEYCOLUMN]
group BY a.[infocolumn]
order by max(datecolumn) desc
here is a stub i did to test it
select 10000 id,'Campain A' cname into #a1 union all
select 10002,'Campain B' union all
select 10004,'Campain C' union all
select 10009,'Campain E'
select 10000 id,'20170101' thedate into #a2 union all
select 10000,'20170102' union all
select 10009,'20170103' union all
select 10002,'20170104' union all
select 10004,'20170105' union all
select 10000,'20170201' union all
select 10000,'20170302' union all
select 10009,'20170403' union all
select 10002,'20170104' union all
select 10004,'20170205' union all
select 10000,'20170101' union all
select 10004,'20170302' union all
select 10000,'20170103' union all
select 10002,'20170404' union all
select 10002,'20170105'
select #a1.cname
from #a1 join #a2 on #a1.id = #a2.id
group by #a1.cname
order by max(thedate) desc

Exact Claim Count for each rows using pivot with join in SQL Server 2012

By executing the below SQL 2012 Query, I got the following output
declare
#ticketstatus nvarchar(20) = 'To Be Allocated'
SELECT m1.ClaimSource, m1.Insurance, n1.[Claim Count], n1.[Claim Value],
ISNULL(m1.[0-30],0) [0-30],
ISNULL(m1.[31-60],0) [31-60],
ISNULL(m1.[61-90],0) [61-90],
ISNULL(m1.[91-120],0) [91-120],
ISNULL(m1.[121-210],0) [121-210],
ISNULL(m1.[210++],0) [210++]
FROM (
SELECT *
FROM (
SELECT ClaimSource, Insurance, CurrentBalance _Count, AgeBucket
FROM ClaimMaster
) m
PIVOT (
COUNT(_Count)
FOR AgeBucket IN ([0-30],[31-60],[61-90],[91-120],[121-210],[210++])
) n
) m1
join
(SELECT Insurance, COUNT(Insurance) [Claim Count], SUM(CurrentBalance) [Claim Value] FROM ClaimMaster
WHERE (TicketStatus = #ticketstatus OR #ticketstatus IS NULL)
GROUP BY Insurance) n1
ON m1.Insurance = n1.Insurance
ORDER BY n1.[Claim Count] DESC
How can I get the correct output for Claim Count, Claim Value on the 4, 5 & 6 rows. Instead of showing full claim count, it should show the respective claim count filter by Claim Source such as Claim Count should be 2 and appropriate Claim Value.
Can anyone help me on this.

Add claimsource and join on that as well?
declare
#ticketstatus nvarchar(20) = 'To Be Allocated'
SELECT m1.ClaimSource, m1.Insurance, n1.[Claim Count], n1.[Claim Value],
ISNULL(m1.[0-30],0) [0-30],
ISNULL(m1.[31-60],0) [31-60],
ISNULL(m1.[61-90],0) [61-90],
ISNULL(m1.[91-120],0) [91-120],
ISNULL(m1.[121-210],0) [121-210],
ISNULL(m1.[210++],0) [210++]
FROM (
SELECT *
FROM (
SELECT ClaimSource, Insurance, CurrentBalance _Count, AgeBucket
FROM ClaimMaster
) m
PIVOT (
COUNT(_Count)
FOR AgeBucket IN ([0-30],[31-60],[61-90],[91-120],[121-210],[210++])
) n
) m1
join
(SELECT ClaimSource, Insurance, COUNT(Insurance) [Claim Count], SUM(CurrentBalance) [Claim Value] FROM ClaimMaster
WHERE (TicketStatus = #ticketstatus OR #ticketstatus IS NULL)
GROUP BY ClaimSource, Insurance) n1
ON m1.Insurance = n1.Insurance and m1.ClaimSource = n1.ClaimSource
ORDER BY n1.[Claim Count] DESC

joining on count and rank the result t sql

Here's my Count_query:
Declare #yes_count decimal;
Declare #no_count decimal;
set #yes_count=(Select count(*) from Master_Data where Received_Data='Yes');
set #no_count=(Select count(*) from Master_Data where Received_Data='No');
select #yes_count As Yes_Count,#no_count as No_Count,(#yes_count/(#yes_count+#no_count)) As Submission_Count
I am having trouble making joins on these two queries
This is the rest of the query:
Select Distinct D.Member_Id,d.Name,d.Region_Name, D.Domain,e.Goal_Abbreviation,
e.Received_Data, case when Received_Data = 'Service Not Provided' then null
when Received_Data = 'No' then null else e.Improvement end as
Percent_Improvement , case when Received_Data = 'Service Not Provided' then null
when Received_Data = 'No' then null else e.Met_40_20 end as Met_40_20
FROM (
select distinct member_Domains.*,
(case when NoData.Member_Id is null then 'Participating' else ' ' end) as Participating
from
(
select distinct members.Member_Id, members.Name, Members.Region_Name,
case when Domains.Goal_Abbreviation = 'EED Reduction' then 'EED'
When Domains.Goal_Abbreviation = 'Pressure Ulcers' then 'PRU'
when Domains.Goal_Abbreviation = 'Readmissions' then 'READ' else Domains.Goal_Abbreviation end as Domain from
(select g.* from Program_Structure as ps inner join Goal as g on ps.Goal_Id = g.Goal_Id
and ps.Parent_Goal_ID = 0) as Domains
cross join
(select distinct hc.Member_ID, hc.Name,hc.Region_Name from zsheet as z
inner join Hospital_Customers$ as hc on z.CCN = hc.Mcare_Id) as Members
) as member_Domains
left outer join Z_Values_Hospitals as NoData on member_Domains.member_ID = NoData.Member_Id
and Member_Domains.Domain = noData.ReportName) D
Left Outer JOIN
(SELECT B.Member_ID, B.Goal_Abbreviation, B.minRate, C.maxRate, B.BLine, C.Curr_Quarter, B.S_Domain,
(CASE WHEN B.Member_ID IN
(SELECT member_id
FROM Null_Report
WHERE ReportName = B.S_Domain) THEN 'Service Not Provided' WHEN Curr_Quarter = 240 THEN 'Yes' ELSE 'No' END) AS Received_Data,
ROUND((CASE WHEN minRate = 0 AND maxRate = 0 THEN 0 WHEN minRate = 0 AND maxRate > 0 THEN 1 ELSE (((maxRate - minRate) / minRate) * 100) END), .2) AS Improvement,
(CASE WHEN ((CASE WHEN minRate = 0 AND maxRate = 0 THEN 0 WHEN minRate = 0 AND maxRate > 0 THEN 1 ELSE (maxRate - minRate) / minRate END)) <= - 0.4 OR
maxRate = 0 THEN 'Yes' WHEN ((CASE WHEN minRate = 0 AND maxRate = 0 THEN 0 WHEN minRate = 0 AND maxRate > 0 THEN 1 ELSE (maxRate - minRate) / minRate END))
<= - 0.2 OR maxRate = 0 THEN 'Yes' ELSE 'No' END) AS Met_40_20
FROM (SELECT tab.Member_ID, tab.Measure_Value AS minRate, tab.Goal_Abbreviation, A.BLine, tab.S_Domain
FROM Measure_Table_Description AS tab INNER JOIN
(SELECT DISTINCT
Member_ID AS new_memid, Goal_Abbreviation AS new_measure, MIN(Reporting_Period_ID) AS BLine, MAX(Reporting_Period_ID)
AS Curr_Quarter
FROM Measure_Table_Description
WHERE (Member_ID > 1) AND (Measure_Value IS NOT NULL) AND (Measure_ID LIKE '%O%')
GROUP BY Goal_Abbreviation, Member_ID) AS A ON tab.Member_ID = A.new_memid AND tab.Reporting_Period_ID = A.BLine AND
tab.Goal_Abbreviation = A.new_measure) AS B FULL OUTER JOIN
(SELECT tab.Member_ID, tab.Measure_Value AS maxRate, tab.Goal_Abbreviation, A_1.Curr_Quarter
FROM Measure_Table_Description AS tab INNER JOIN
(SELECT DISTINCT
Member_ID AS new_memid, Goal_Abbreviation AS new_measure,
MIN(Reporting_Period_ID) AS BLine, MAX(Reporting_Period_ID)
AS Curr_Quarter
FROM Measure_Table_Description AS Measure_Table_Description_1
WHERE (Member_ID >1) AND (Measure_Value IS NOT NULL) AND (Measure_ID LIKE '%O%')
GROUP BY Goal_Abbreviation, Member_ID) AS A_1 ON tab.Member_ID = A_1.new_memid
AND tab.Reporting_Period_ID = A_1.Curr_Quarter AND
tab.Goal_Abbreviation = A_1.new_measure) AS C ON B.Member_ID = C.Member_ID
WHERE (B.Goal_Abbreviation = C.Goal_Abbreviation) ) E ON D.Member_Id = E.Member_ID AND d.Domain = E.S_Domain
ORDER BY D.Domain,D.Member_ID
How do I get a count of the 'yes'/ (count(yes)+count(no)) for each member_ID as column1 and also display the rank of each member_ID against all the member_IDs in the result as column2. I have come up with a query that generates the count for the entire table, but how do I restrict it each Member_ID.
Thanks for your help.

I haven't taken the time to digest your provided query, but if abstracted to the concept of having an aggregate over a range of data repeated on each row, you should look at using windowing functions. There are other methods, such as using a CTE to do your aggregation and then JOINing back to your detailed data. That might work better for more complex calculations, but the window functions are arguably the more elegant option.
DECLARE #MasterData AS TABLE
(
MemberID varchar(50),
MemberAnswer int
);
INSERT INTO #MasterData (MemberID, MemberAnswer) VALUES ('Jim', 1);
INSERT INTO #MasterData (MemberID, MemberAnswer) VALUES ('Jim', 0);
INSERT INTO #MasterData (MemberID, MemberAnswer) VALUES ('Jim', 1);
INSERT INTO #MasterData (MemberID, MemberAnswer) VALUES ('Jim', 1);
INSERT INTO #MasterData (MemberID, MemberAnswer) VALUES ('Jane', 1);
INSERT INTO #MasterData (MemberID, MemberAnswer) VALUES ('Jane', 0);
INSERT INTO #MasterData (MemberID, MemberAnswer) VALUES ('Jane', 1);
-- Method 1, using windowing functions (preferred for performance and syntactical compactness)
SELECT
MemberID,
MemberAnswer,
CONVERT(numeric(19,4),SUM(MemberAnswer) OVER (PARTITION BY MemberID)) / CONVERT(numeric(19,4),COUNT(MemberAnswer) OVER (PARTITION BY MemberID)) AS PercentYes
FROM #MasterData;
-- Method 2, using a CTE
WITH MemberSummary AS
(
SELECT
MemberID,
SUM(MemberAnswer) AS MemberYes,
COUNT(MemberAnswer) AS MemberTotal
FROM #MasterData
GROUP BY MemberID
)
SELECT
md.MemberID,
md.MemberAnswer,
CONVERT(numeric(19,4),MemberYes) / CONVERT(numeric(19,4),MemberTotal) AS PercentYes
FROM #MasterData md
JOIN MemberSummary ms
ON md.MemberID = ms.MemberID;

First thought is: your query is much, much too complicated. I have spent about 10 minutes now trying to make sense of it and haven't gotten anywhere, so it's obviously going to pose a long-term maintenance challenge to those within your organization going forward as well. I would really recommend you try to find some way of simplifying it.
That said, here is a simplified, general example of how to query on a calculated value and rank the results:
CREATE TABLE member (member_id INT PRIMARY KEY);
CREATE TABLE master_data (
transaction_id INT PRIMARY KEY,
member_id INT FOREIGN KEY REFERENCES member(member_id),
received_data BIT
);
-- INSERT data here
; WITH member_data_counts AS (
SELECT
m.member_id,
(SELECT COUNT(*) FROM master_data d WHERE d.member_id = m.member_id AND d.received_data = 1) num_yes,
(SELECT COUNT(*) FROM master_data d WHERE d.member_id = m.member_id AND d.received_data = 0) num_no
FROM member m
), member_data_calc AS (
SELECT
*,
CASE
WHEN (num_yes + num_no) = 0 THEN NULL -- avoid division-by-zero error
ELSE num_yes / (num_yes + num_no)
END pct_yes
FROM member_data_counts
), member_data_rank AS (
SELECT *, RANK() OVER (ORDER BY pct_yes DESC) AS RankValue
FROM member_data_calc
)
SELECT *
FROM member_data_rank
ORDER BY RankValue ASC;

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How to ingest a large Stream table using Snowflake Task? - snowflake-cloud-data-platform

Related

SQL statement is nested too deeply when adding more subqueries to projection

What is the most efficient way to compare a table against itself in SQL server?

T-SQL join query using distinct values from second table

Exact Claim Count for each rows using pivot with join in SQL Server 2012

joining on count and rank the result t sql

Categories

Resources