Snowflake Task dependents - snowflake-cloud-data-platform

Lets says I have 2 tasks (file_feed_tsk and tbl_merge_tsk) with dependency -
tbl_merge_tsk is scheduled to run after file_feed_tsk completes.
when I run this sql -
select *
from table(information_schema.task_dependents(task_name => 'src_db.src_schema.file_feed_tsk', recursive => true)) ;
It displays both the root and child tasks-
name, predecessor
file_feed_tsk,null
tbl_merge_tsk,file_feed_tsk
I copied these 2 rows onto a separate table. Now I need to add a successor to the table. Like,
name, predecessor, successor
file_feed_tsk, null, tbl_merge_tsk
tbl_merge_tsk, file_feed_tsk, null
How do we accomplish this?
Thanks

Related

How to reset SQLite autoincrement while using Flyway in Unit test

Consider the code below in a unit test, where I add a new Tag object in a pre-populated SQLite database.
#Test // Line 1
public void add() {
Tag tagToAdd = new Tag("Tall");
Tag addedTag = this.tagDao.add(tagToAdd);
assertNotNull(addedTag);
assertEquals(3L, addedTag.getId()); // Line 6
assertEquals(tagToAdd.getTag(), addedTag.getTag());
List<Tag> tags = this.tagDao.get();
assertEquals(3, tags.size());
}
On line 6, I expect the ID of the Tag to be 3, because the field is an AUTOINCREMENT and the test is initialized with a database already containing 2 Tags. This works fine every time I run the test and the ID is always 3.
Now, I am integrating flyway to the project. Every time I run the test, the AUTOINCREMENT starts from the value of the last run, so the Tag ID increments by 1 every run, and the test fails.
Any idea on how I can get flyway to always reset the database to a brand new state, and reset the AUTOINCREMENT value ? I could write a query to do it manually, but this is not maintainable.
What I have tried so far ?
Integrate #FlywayTest, as this executes flyway task clean
Defined a FlywayMigrationStrategy bean, which contains flyway.clean()
Set spring.flyway.clean-on-validation-error to true in my application.properties (that said, there was no change in my sql, so not sure if this changed anything)
-- Edit
My 1st migration script contains the below.
DROP TABLE IF EXISTS Tag;
CREATE TABLE Tag(
id INTEGER PRIMARY KEY AUTOINCREMENT,
tag VARCHAR(255) NOT NULL UNIQUE,
createdDate TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
modifiedDate TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
If I understood everything correctly - you have a database and a table in this database which is created once and the same table is used for tests every time - you just delete rows from the table (without removing it) when tests are completed (or before starting next tests) and flyway just inserts two tags into this table every time you run the tests.
If that's right - you can just reset sequence in SQLite to set it back to 1 so next inserted row will be inserted with this id. You can do it by running the following query:
UPDATE `sqlite_sequence` SET `seq` = 1 WHERE `name` = 'tags_table_name';
Alternatively, you can set seq to 0 - this value is incorrect so SQLite will use next available correct value (if there are no rows in the table - it will be one, if there are some values - it will first available number).
Yet another possibility is just to delete your table after tests and recreate it before running next tests - as it is a database and table just for tests - it should work correctly. This way you have your sequence counter set back to value 1 each time. I would actually go this way until you have really good reason not to delete the table.

Streams + tasks missing inserts?

We've setup a stream on a table that is continuously loaded via snowpipe.
We're consuming this data with a task that runs every minute where we merge into another table. There is a possibility of duplicate keys so we use a ROW_NUMBER() window function, ordered by the file created timestamp descending where row_num=1. This way we always get the latest insert
Initially we used a standard task with the merge statement but we noticed that in some instances, since snowpipe does not guarantee loading in order of when the files were staged, we were updating rows with older data. As such, on the WHEN MATCHED section we added a condition so only when the file created ts > existing, to update the row
However, since we did that, reconciliation checks show that some new inserts are missing. I don't know for sure why changing the matched clause would interfere with the not matched clause.
My theory was that the extra clause added a bit of time to the task run where some runs were skipped or the next run happened almost immediately after the last one completed. The idea being that the missing rows were caught up in the middle and the offset changed before they could be consumed
As such, we changed the task to call a stored procedure which uses an explicit transaction. We did this because the docs seem to suggest that using a transaction will lock the stream. However even with this we can see that new inserts are still missing. We're talking very small numbers e.g. 8 out of 100,000s
Any ideas what might be happening?
Example task code below (not the sp version)
WAREHOUSE = TASK_WH
SCHEDULE = '1 minute'
WHEN SYSTEM$stream_has_data('my_stream')
AS
MERGE INTO processed_data pd USING (
select
ms.*,
CASE WHEN ms.status IS NULL THEN 1/mv.count ELSE NULL END as pending_count,
CASE WHEN ms.status='COMPLETE' THEN 1/mv.count ELSE NULL END as completed_count
from my_stream ms
JOIN my_view mv ON mv.id = ms.id
qualify
row_number() over (
partition by
id
order by
file_created DESC
) = 1
) ms ON ms.id = pd.id
WHEN NOT MATCHED THEN INSERT (col1, col2, col3,... )
VALUES (ms.col1, ms.col2, ms.col3,...)
WHEN MATCHED AND ms.file_created >= pd.file_created THEN UPDATE SET pd.col1 = ms.col1, pd.col2 = ms.col2, pd.col3 = ms.col3, ....
;
I am not fully sure what is going wrong here, but the file created time related recommendation is given by Snowflake somewhere. It suggest that the file created timestamp is calculated in cloud service and it may be bit different than you think. There is another recommendation related to snowpipe and data ingestion. The queue service takes a min to consume the data from pipe and if you have lot of data being flown inside with in a min, you may end up this issue. Look you implementation and simulate if pushing data in 1min interval solve that issue and don't rely on file create time.
The condition "AND ms.file_created >= pd.file_created" seems to be added as a mechanism to avoid updating the same row multiple times.
Alternative approach could be using IS DISTINCT FROM to compare source against target columns(except id):
MERGE INTO processed_data pd USING (
select
ms.*,
CASE WHEN ms.status IS NULL THEN 1/mv.count ELSE NULL END as pending_count,
CASE WHEN ms.status='COMPLETE' THEN 1/mv.count ELSE NULL END as completed_count
from my_stream ms
JOIN my_view mv ON mv.id = ms.id
qualify
row_number() over (
partition by
id
order by
file_created DESC
) = 1
) ms ON ms.id = pd.id
WHEN NOT MATCHED THEN INSERT (col1, col2, col3,... )
VALUES (ms.col1, ms.col2, ms.col3,...)
WHEN MATCHED
AND (pd.col1, pd.col2,..., pd.coln) IS DISTINCT FROM (ms.col1, ms.col2,..., ms.coln)
THEN UPDATE SET pd.col1 = ms.col1, pd.col2 = ms.col2, pd.col3 = ms.col3, ....;
This approach will also prevent updating row when nothing has changed.

How to use copy Storage Integration in a Snowflake task statement?

I'm testing SnowFlake. To do this I created an instance of SnowFlake on GCP.
One of the tests is to try the daily load of data from a STORAGE INTEGRATION.
To do that I had generated the STORAGE INTEGRATION and the stage.
I tested the copy
copy into DEMO_DB.PUBLIC.DATA_BY_REGION from #sg_gcs_covid pattern='.*data_by_region.*'
and all goes fine.
Now it's time to test the daily scheduling with the task statement.
I created this task:
CREATE TASK schedule_regioni
WAREHOUSE = COMPUTE_WH
SCHEDULE = 'USING CRON 42 18 9 9 * Europe/Rome'
COMMENT = 'Test Schedule'
AS
copy into DEMO_DB.PUBLIC.DATA_BY_REGION from #sg_gcs_covid pattern='.*data_by_region.*';
And I enabled it:
alter task schedule_regioni resume;
I got no errors, but the task don't loads data.
To resolve the issue i had to put the copy in a stored procedure and insert the call of the storede procedure instead of the copy:
DROP TASK schedule_regioni;
CREATE TASK schedule_regioni
WAREHOUSE = COMPUTE_WH
SCHEDULE = 'USING CRON 42 18 9 9 * Europe/Rome'
COMMENT = 'Test Schedule'
AS
call sp_upload_c19_regioni();
The question is: this is a desired behavior or an issue (as I suppose)?
Someone can give to me some information about this?
I've just tried ( but with storage integration and stage on AWS S3) and it works fine also using copy command inside sql part of the task, without calling a stored procedure.
In order to start investigating the issue, I would check following info (maybe for debugging I would create the task scheduling it every few minutes):
check task_history and verify executions
select *
from table(information_schema.task_history(
scheduled_time_range_start=>dateadd('hour',-1,current_timestamp()),
result_limit => 100,
task_name=>'YOUR_TASK_NAME'));
if previous step is successfull, check copy_history and verify the input file name , target table and number of records/errors are the expected ones
SELECT *
FROM TABLE (information_schema.copy_history(TABLE_NAME => 'YOUR_TABLE_NAME',
start_time=> dateadd(hours, -1, current_timestamp())))
ORDER BY 3 DESC;
Check if the results are the same you get when the task with sp call is executed.
Please also confirm that you are loading new files not yet loaded into your table with COPY command (otherwise you need to specify FORCE = TRUE parameter in the copy command or remove metadata information truncating your target table to reload the same files).

SSIS Package how to successfully do a foreach or for loop to auto increment a value for field insert?

First of all I have never attempted something like this in SSIS and I am very new to SSIS package development.
I need to build a component in my package that will run through a table of data (say 80 rows) and set a field titled DisplayOrder to the auto incremented number. The catch is that one of the records HAS to be set to 0 and then the rest of he records set to the auto incremented number.
In regards to code, I am not even sure what code to attach to this question or even what screenshots.
I finally figured it out and there is no need for a loop.
Create a SQL Task to clear the linked Table.
Script Used
DELETE FROM [Currency].[ExchangeRates]
Create a SQL Task to clear the main table.
Script Used
DELETE FROM [Currency].[CurrencyList]
Load the values into the main table.
Actions Used
Load values from XML Source
Dump values to [ExchangeRates] Table
Create a SQL Task to load the Values from the main table to the linked table.
Script Used
INSERT INTO [Currency].[CurrencyList] (CurrencyCode, CurrencyName, ExchangeRateID, DisplayOrder) SELECT [er].[TargetCurrency] AS [CurrencyCode], [er].[TargetName] AS [CurrencyName], [er].[ID] AS [ExchangeRateID], ROW_NUMBER() OVER (ORDER BY [ER].[TargetName]) AS [DisplayOrder] FROM [Currency].[ExchangeRates] AS [er] ORDER BY [CurrencyName]
Create a SQL Task to load a new record to the main table for use as DisplayOrder 0.
Script Used
INSERT INTO [Currency].[ExchangeRates] ([Title], [Link], [Description], [PubDate], [BaseCurrency], [TargetCurrency], [TargetName], [ExchangeRate]) VALUES ('1 USD = 1 USD','http://www.floatrates.com/usd/usd/','1 U.S. Dollar = 1 U.S. Dollar',(SELECT TOP 1 [PubDate] FROM [Currency].[ExchangeRates]),'USD','USD','United States Dollar','1')
Create a SQL Task to reference the newly created record from the main table.
Script Used
INSERT INTO [Currency].[CurrencyList] (CurrencyCode, CurrencyName, ExchangeRateID, DisplayOrder) SELECT [er].[TargetCurrency] AS [CurrencyCode], [er].[TargetName] AS [CurrencyName], [er].[ID] AS [ExchangeRateID], 0 AS [DisplayOrder] FROM [Currency].[ExchangeRates] AS [er] WHERE [er].[TargetCurrency] = 'USD'

How could we design database for table job subjob and contract?

the relationship between table is following this
1 job may contain 0-M subjob = 1 : M
0-M subjob may have 0-M contract = M :M
the table I design are
Job :JobID
Subjob:SubjobID
Contract:ContractID
Subjob_Contract:SubjobID,ContractID
The problem I have face is
when we want to view Job and contract.....incase Job doesnt have Subjob so how could contract link with Job
I would eliminate the distinction between a job and a SubJob from the table structure. You could use the SubJob table as a link to to other Jobs, then you only have a Job_Contract reference.
SubJob would then contain a link between all Jobs and their SubJobs.
Subjobs:
parent_job_id -- Reference to parent Job
job_id -- Previously, your SubJobId
Example:
Select * from subjobs where subjobs.parent_job = {jobid};
Return a set of all subjobs which contain "subjobid"s that are actually job id's.
This way you can reference a contract from any job.

Resources