Snowflake Stream NOT Purging - snowflake-cloud-data-platform

Snowflake Stream NOT Purging - snowflake-cloud-data-platform

I create a Stream in Snowflake on a table and created a task to move the data to a table. Even after the task is complete the data in the stream is not purging. Because of that the task is not getting skipped and keep reinserting data from stream to the table and the final table keeps on growing. What can be the reason? It was working yesterday but from today the stream is not purging even after a DML is executed using that stream using a task.
create or replace stream test_stream on table test_table_raw APPEND_ONLY = TRUE;
create or replace task test_task_task warehouse = test_warehouse
schedule = '1 minute'
when system$stream_has_data('test_stream')
as insert into test_table
SELECT
level1.FILE_NAME,
level1.FILE_ROWNUMBER,
GET(lvl, '#id')::string as app_id
FROM (SELECT FILE_NAME,FILE_ROWNUMBER,src:"$" as lvl FROM test_table_raw) level1,
lateral FLATTEN(LVL:"$") level2
where level2.value like '%<test %';
alter task test_task resume;
select
(select count(*) from test_table) table_count,
(select count(*) from test_stream) stream_count;
TABLE_COUNT STREAM_COUNT
500 1

Is the transaction committing; i.e. do you see the inserts or whatever the DML in the task using that stream is supposed to do happening?
Any chance you can post the SQL.
Stream offset changes when a transaction where the stream is used commits. There is really no "purge" but the stream offset just moves forward so you don't see the same rows again.
Dinesh Kulkarni
(PM, Snowflake)

My bad! I am using the base table in the task instead of using the stream.
create or replace task test_task_task warehouse = test_warehouse
schedule = '1 minute'
when system$stream_has_data('test_stream')
as insert into test_table
SELECT
level1.FILE_NAME,
level1.FILE_ROWNUMBER,
GET(lvl, '#id')::string as app_id
FROM (SELECT FILE_NAME,FILE_ROWNUMBER,src:"$" as lvl FROM *test_table_raw* test_stream) level1,
lateral FLATTEN(LVL:"$") level2
where level2.value like '%<test %';

Related

SymmetricDS Tables not syncing

I took over a project that uses SymmetricDs to sync tables between a target and a source. The tables from the source aren't syncing to the target at the moment.
I have searched online but found no help.
I have checked the sym_outgoing_batch and sym_incoming_batch tables but can't figure out the use of the information there.
I also queried the sync_trigger table. I have the result of the query as a link below.
If you have an idea on where I could look, please let me know. I can run queries and give you the result.sync_trigger result

Uncomment these lines "--where status != 'OK'" to see if anything is NOT in OK state if there is that is causing the SYNC to STOP
-- SQL QUERY
-- Symmetric DS : MONITOR : HEARTBEAT / INCOMING / OUTGOING / MONITOREVENTS
SELECT node_id, host_name, getdate() as dtNOW ,heartbeat_time FROM [tablename].[dbo].[sd_node_host] with (NOLOCK) where heartbeat_time > '2022-01-01'
--SELECT * FROM [tablename].[dbo].[sd_context] with (NOLOCK)
SELECT * FROM [tablename].[dbo].[sd_outgoing_batch] with (NOLOCK)
--where status != 'OK'
order by create_time desc
SELECT * FROM [tablename].[dbo].[sd_incoming_batch] with (NOLOCK)
--where status != 'OK'
order by create_time desc
-- Symmetric DS : MONITOR
SELECT * FROM [tablename].[dbo].[sd_monitor_event] with (NOLOCK) WHERE is_resolved != 1
SELECT * FROM [tablename].[dbo].[sd_monitor_event] with (NOLOCK)
EDIT
here is the link for the symmetricds user guide.
https://www.symmetricds.org/doc/3.13/html/user-guide.html#_outgoing_batch
basically NE means it is ready for replication. Did you ever have it set up or is this a new setup that you are trying to get started?
EDIT 2 ENGINE CONFIGS
MAIN
engine.name=<SDS_MAIN>
db.driver=net.sourceforge.jtds.jdbc.Driver
db.url=jdbc:jtds:sqlserver://<IP>:1433/<DB>;useCursors=true;bufferMaxMemory=10240;lobBuffer=5242880
db.user=***********
db.password=***********
registration.url=
sync.url=ttp://<IP>:<PORT>/sync/<SDS_MAIN>
group.id=<GID>
external.id=000
auto.registration=true
initial.load.create.first=true
sync.table.prefix=sym
#start.initial.load.extract.job=false
compression.level=-1
compression.strategy=0
CHILD
engine.name=<SDS_CHILD>
db.driver=net.sourceforge.jtds.jdbc.Driver
db.url=jdbc:jtds:sqlserver://<IP>:1433/<DB>;useCursors=true;bufferMaxMemory=10240;lobBuffer=5242880
db.user=***********
db.password=***********
registration.url=http://<IP>:<PORT>/sync/<SDS_MAIN>
sync.url=http://<IP>:<PORT>/sync/<SDS_CHILD>
group.id=<GID>
external.id=100
auto.registration=true
initial.load.create.first=true
sync.table.prefix=sym
start.initial.load.extract.job=false
compression.level=-1
compression.strategy=0

Streams + tasks missing inserts?

We've setup a stream on a table that is continuously loaded via snowpipe.
We're consuming this data with a task that runs every minute where we merge into another table. There is a possibility of duplicate keys so we use a ROW_NUMBER() window function, ordered by the file created timestamp descending where row_num=1. This way we always get the latest insert
Initially we used a standard task with the merge statement but we noticed that in some instances, since snowpipe does not guarantee loading in order of when the files were staged, we were updating rows with older data. As such, on the WHEN MATCHED section we added a condition so only when the file created ts > existing, to update the row
However, since we did that, reconciliation checks show that some new inserts are missing. I don't know for sure why changing the matched clause would interfere with the not matched clause.
My theory was that the extra clause added a bit of time to the task run where some runs were skipped or the next run happened almost immediately after the last one completed. The idea being that the missing rows were caught up in the middle and the offset changed before they could be consumed
As such, we changed the task to call a stored procedure which uses an explicit transaction. We did this because the docs seem to suggest that using a transaction will lock the stream. However even with this we can see that new inserts are still missing. We're talking very small numbers e.g. 8 out of 100,000s
Any ideas what might be happening?
Example task code below (not the sp version)
WAREHOUSE = TASK_WH
SCHEDULE = '1 minute'
WHEN SYSTEM$stream_has_data('my_stream')
AS
MERGE INTO processed_data pd USING (
select
ms.*,
CASE WHEN ms.status IS NULL THEN 1/mv.count ELSE NULL END as pending_count,
CASE WHEN ms.status='COMPLETE' THEN 1/mv.count ELSE NULL END as completed_count
from my_stream ms
JOIN my_view mv ON mv.id = ms.id
qualify
row_number() over (
partition by
id
order by
file_created DESC
) = 1
) ms ON ms.id = pd.id
WHEN NOT MATCHED THEN INSERT (col1, col2, col3,... )
VALUES (ms.col1, ms.col2, ms.col3,...)
WHEN MATCHED AND ms.file_created >= pd.file_created THEN UPDATE SET pd.col1 = ms.col1, pd.col2 = ms.col2, pd.col3 = ms.col3, ....
;

I am not fully sure what is going wrong here, but the file created time related recommendation is given by Snowflake somewhere. It suggest that the file created timestamp is calculated in cloud service and it may be bit different than you think. There is another recommendation related to snowpipe and data ingestion. The queue service takes a min to consume the data from pipe and if you have lot of data being flown inside with in a min, you may end up this issue. Look you implementation and simulate if pushing data in 1min interval solve that issue and don't rely on file create time.

The condition "AND ms.file_created >= pd.file_created" seems to be added as a mechanism to avoid updating the same row multiple times.
Alternative approach could be using IS DISTINCT FROM to compare source against target columns(except id):
MERGE INTO processed_data pd USING (
select
ms.*,
CASE WHEN ms.status IS NULL THEN 1/mv.count ELSE NULL END as pending_count,
CASE WHEN ms.status='COMPLETE' THEN 1/mv.count ELSE NULL END as completed_count
from my_stream ms
JOIN my_view mv ON mv.id = ms.id
qualify
row_number() over (
partition by
id
order by
file_created DESC
) = 1
) ms ON ms.id = pd.id
WHEN NOT MATCHED THEN INSERT (col1, col2, col3,... )
VALUES (ms.col1, ms.col2, ms.col3,...)
WHEN MATCHED
AND (pd.col1, pd.col2,..., pd.coln) IS DISTINCT FROM (ms.col1, ms.col2,..., ms.coln)
THEN UPDATE SET pd.col1 = ms.col1, pd.col2 = ms.col2, pd.col3 = ms.col3, ....;
This approach will also prevent updating row when nothing has changed.

Increase performance of trigger in SQL Server

I'm not a database expert but I am in need of some help making sure a trigger we are using to track an update on a table is the best way to handle our situation and is performing as it should. After loading the trigger we noticed some slow performance on the actual business system (user side).
Background: we are trying to capture the date/time of a transaction that happens so it can be referenced on a customer portal for our website.
The theory: the trigger monitors for a Update to a column to 'PI' and if that happens, it writes data to a table giving some basic information from 2 other tables that are related to to update.
Table 1 columns
RH.kbranch, RH.kordnum, RH.kcustnum, RH.custsnum, RH.[program]
Table 2 columns
RD.kbranch, RD.kordnum, RD.kpart
Table 3 columns (where trigger is attached)
EQ.kequipnum, EQ.eqpstatus
Trigger
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
ALTER TRIGGER [dbo].[PICKUPTrigger]
ON [TEST].[dbo].[equip]
FOR UPDATE
AS
IF (SELECT eqpstatus FROM inserted) = 'PI'
BEGIN
SET NOCOUNT ON
INSERT INTO [Workfiles].[dbo].[PickupAudit] ([HHBranch],[HHOrder],[HHCustomer], [HHShipTo], [EquipID], [EQStatus], [PickupNo], [StatusDate])
SELECT
RH.kbranch, RH.kordnum, RH.kcustnum, RH.custsnum,
RD.kpart, EQ.eqpstatus, RH.[program], GETDATE()
FROM
TEST.dbo.renthead RH
JOIN
TEST.dbo.rentdetl RD ON RH.kbranch = RD.kbranch
AND RH.kordnum = RD.kordnum
AND RH.program NOT LIKE 'OPSS%'
JOIN
TEST.dbo.equip EQ ON EQ.kequipnum = RD.kpart
WHERE
RD.kpart = (SELECT kequipnum FROM inserted);
END
The trigger works, but it appears to be causing problems and slowing down the actual user experience. Any help in tweaking what we have done is appreciated and if you have any questions, feel free to ask. Thanks.

You should use explicit joins:
INSERT INTO [Workfiles].[dbo].[PickupAudit]
([HHBranch],[HHOrder],[HHCustomer],[HHShipTo],[EquipID],[EQStatus],[PickupNo],[StatusDate])
SELECT RH.kbranch, RH.kordnum, RH.kcustnum, RH.custsnum, RD.kpart, EQ.eqpstatus, RH.[program], GETDATE()
FROM TEST.dbo.renthead RH JOIN
TEST.dbo.rentdetl RD
ON RH.kbranch = RD.kbranch AND
RH.kordnum = RD.kordnum AND
RH.program NOT LIKE 'OPSS%' JOIN
TEST.dbo.equip EQ
ON EQ.kequipnum = RD.kpart JOIN
inserted i
ON RD.kpart = i.kequipnum;
For performance, you want indexes on the columns used in the JOINs, in this order:
TEST.dbo.rentdetl(kpart, kbanch, kordnum)
TEST.dbo.equip(kequipnum)
TEST.dbo.renthead(kbranch, kbanch, program)

The slow is caused by th join statements, as I think you join heavy tables with data or tables under load sometimes at the trigger operation time
The solution to get better performance is to create "indexed view" not just "view"
And use it in trigger, and you will see drastically affect

Event processing by using Flink SQL API

My Use case-
Collect events for a particular duration and then group them based on the key
Objective
After processing, user can save data of particular duration based on the key
How i am planning to do
1)Receive events from Kafka
2)Create data stream of events
3)associate a table with it and collect data for a particular duration by running a SQL query
4)associate a new table with step-2 output and group collected data according to the key
5)save the data in DB
Solution i tried-
I am able to-
1)receive events from Kafka,
2)setup a data stream(lets say sensorDataStream)-
DataStream<SensorEvent> sensorDataStream
= source.flatMap(new FlatMapFunction<String, SensorEvent>() {
#Override
public void flatMap(String catalog, Collector<SensorEvent> out) {
// create SensorEvent(id, sensor notification value, notification time) creation
});
3)associate a table(lets say table1) with data stream and after running SQL query like-
SELECT id, sensorNotif, notifTime FROM SENSORTABLE WHERE notifTime > t1_Timestamp AND notifTime < t2_Timestamp.
Here t1_Timestamp and t2_Timestamp is predefined epoch time and will change based on some predefined conditions
4)I am able to print this sql query result by using following query on the console-
tableEnv.toAppendStream(table1, Row.class).print();
5)Created a new table(lets say table2) by using table1 and following type of sql query-
Table table2 = tableEnv.sqlQuery("SELECT id AS SensorID, COUNT(sensorNotif) AS SensorNotificationCount FROM table1 GROUP BY id);
6)Collecting and print data by using -
tableEnv.toRetractStream(table2 , Row.class).print();
Problem
1)I am not able to see output of step 6 on the console.
I did some experiment and found that If i skip table1 setup step(that means no sensor data clubbing for a duration) and directly associate my senserDataStream with table2 then i can see the output of step-6 but as this is RetractStream so i can see data in the form of and if new event is coming then this retract stream will invalidate data and print newly calculated data.
Suggestion i would like to have
1)How can i merge step 5 and step 6(means table1 and table2). I already merged these tables but as data is not visible on console so i have doubt? Am i doing something wrong? Or data is merged but not visible?
2)My plan is to --
2.a)filter data in 2 pass, in first pass filter data for a particular interval and in second pass group this data
2.b)Save 2.a output in DB
Will this approach work(i have doubt because i am using data stream and table1 out put is append stream but table2 output is retract stream)?

How to find rows in ms-sql with another rows' value followingly?

I have created an Sql table to trace objects' operation history. I have two columns; first one is the self tracing code and second tracing code is the tracing code for the code coming from source object to target. I created this to be able to look up the route of operations through the objects. You can see the tracing sample table below:
I need to create an sql code to query to show all the route in one table. When I first select the self code, it will be the incoming code for previous rows. There may be more than one incoming code to self and I want to be able to trace all. And I want to reach end until my search is null.
I tried select query like below but I am so new sql and need your help.
SELECT [TracingCode.Self],
[TracingCode.Incoming],
[EquipmentNo]
FROM [MKP_PROCESS_PRODUCT_REPORTS].[dbo].[ProductionTracing.Main]
WHERE [TracingCode.Self] = (SELECT [TracingCode.Incoming]
FROM [MKP_PROCESS_PRODUCT_REPORTS].[dbo].[ProductionTracing.Main]
WHERE [TracingCode.Self] = (SELECT [TracingCode.Incoming]
FROM [MKP_PROCESS_PRODUCT_REPORTS].[dbo].[ProductionTracing.Main]
WHERE [TracingCode.Self] = (SELECT [TracingCode.Incoming]
FROM [MKP_PROCESS_PRODUCT_REPORTS].[dbo].[ProductionTracing.Main]
WHERE [TracingCode.Self] = '028.001.19.2.3')));

To do this kind of parent/child thing to any level without explicitly coding all levels you need to use a recursive CTE.
More details here
https://www.red-gate.com/simple-talk/sql/t-sql-programming/sql-server-cte-basics/
Here is some test data and a solution I came up with. Note that three records actually match 028.001.19.2.3
If this doesn't do what you need please explain further with sample data.
DECLARE #Sample TABLE (
TC_Self CHAR(14) NOT NULL,
TC_In CHAR(14) NOT NULL,
EquipmentNo INT NOT NULL
);
INSERT INTO #Sample (TC_Self, TC_In, EquipmentNo)
VALUES
('028.001.19.2.3','026.003.19.2.2',96),
('028.001.19.2.3','026.001.19.2.2',96),
('028.001.19.2.3','026.002.19.2.2',96),
('028.001.19.2.2','026.002.19.2.1',96),
('028.001.19.2.2','026.002.19.2.1',96),
('028.001.19.2.1','026.002.19.1.1',96),
('026.003.19.2.2','024.501.19.2.5',117),
('024.501.19.2.5','024.501.19.2.6',999),
('024.501.19.2.6','024.501.19.2.7',998);
WITH CTE (RecordType, TC_Self, TC_In, EquipmentNo)
AS
(
-- This is the 'root'
SELECT 'Root' RecordType, TC_Self, TC_In, EquipmentNo FROM #Sample
WHERE TC_Self = '028.001.19.2.3'
UNION ALL
SELECT 'Leaf' RecordType, S.TC_Self, S.TC_In, S.EquipmentNo FROM #Sample S
INNER JOIN CTE
ON S.TC_Self = CTE.TC_In
)
SELECT * FROM CTE;
Also please note that most of the time to generate this answer was taken in generating the sample data to use.
In future when asking questions, people are far more likely to help if you post this sample data generation yourself

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight