I would like to include a condition in my Snowflake task to run only if a specified table has data in it. This would be similar to task condition:
WHEN SYSTEM$STREAM_HAS_DATA('my_schema.my_table')
Except I do not wish to use a stream. The problem with using a stream in some cases, is that streams can go stale. I have tables in my ELT process that may not receive updates for weeks or months. Possibly even years.
One thought I had was to use a UDF in the task condition:
WHEN PUBLIC.TABLE_HAS_DATA('my_schema.my_table')
This would be great if I could throw a SELECT CAST(COUNT(1) AS BOOLEAN) FROM "my_schema"."my_table" in there. But a SQL UDF will not be able to do anything with a table name that is passed as a parameter. And a Javascript UDF seems too limiting when it comes to querying tables.
Admittedly, I am not a Javascript programmer. Nor am I too familiar with Snowflake's Javascript UDF abilities. I can perform the desired queries in a Javascript Stored Proc just fine. But those don't seem to translate over to UDFs.
Snowflake Streams should only go stale if you don't do something with the data within its set retention period. As long as you have a task to process data in the stream (change records) when they show up you should be fine. So if you don't see a change show up in a Stream for 6 months, that's fine as long as you process that change record within your data retention period (14 days as an example).
If your task has a STREAM_HAS_DATA condition and the stream doesn't get data for 14 days, the stream will go stale because a stream's offset is only updated when it is queried. You can work around this issue by removing the condition and letting the task run more often.
The SYSTEM$STREAM_HAS_DATA does only apply to streams https://docs.snowflake.com/en/sql-reference/functions/system_stream_has_data.html.
As streams can get stale, we can check (since Snowflake 5.1.x which was released last Jan 2021) the stale_after timestamp property returned by the SHOW STREAMS command so that we can promptly re-create streams that are about to get stale.
A solution to retrieve stale streams is provided here: Snowflake - How can I query the stream's metadata and save to table
Related
I need to do one-time historical data load, followed by incremental load every 10 minutes.
is there a way to parametrize snowflake task to 1st run the historical load and then change the parameter to execute incremental loads? if not, can you suggest a better approach to handle historical (One-time) and incremental loads via tasks
Note: An underlying table of snowflake stream contains historical records and any new data after implementing stream/tasks is considered as incremental.
if you have a task call a stored procedure, you could have the stored procedure first check to see if the target table is empty (or whatever check you want. As long as you can write it as code, it'll work. Heck you could have it insert a task run log into a separate table, and check to see if it's the first time it's run.) and do the initial historical load in that case, and not otherwise.
Then the first time you run it, it will do one code path, and foreverafter it will do the other.
I'm new to Flink and I'm trying to use it to have a bunch of live views of my application. At least one of the dynamic views I'd like to build would be to show entries that have not met an SLA -- or essentially expired -- and the condition for this would be a simple timestamp comparison. So I would basically want an entry to show up in my dynamic table if it has NOT been touched by an event recently. In playing around with Flink 1.6 (constrained to this due to AWS Kinesis) in a dev environment, I'm not seeing that Flink is re-evaluating a condition unless an event touches that entry.
I've got my dev environment plugged into a Kinesis stream that's sending in live access log events from a web server. This isn't my real use case but it was an easy one to begin testing with. I've written a simple table query that pulls in a request path, its last access time, and computes a boolean flag to indicate whether it hasn't been accessed in the last minute. I'm debugging this via a retract stream connected to PrintSinkFunction so all updates/deletes are printed to my console.
tEnv.registerDataStream("AccessLogs", accessLogs, "username, status, request, responseSize, referrer, userAgent, requestTime, ActionTime.rowtime");
Table paths = tEnv.sqlQuery("SELECT request AS path, MAX(requestTime) as lastTime, CASE WHEN MAX(requestTime) < CURRENT_TIMESTAMP - INTERVAL '1' MINUTE THEN 1 ELSE 0 END AS expired FROM AccessLogs GROUP BY request");
DataStream<Tuple2<Boolean, Row>> retractStream = tEnv.toRetractStream(paths, Row.class);
retractStream .addSink(new PrintSinkFunction<>());
I expect that when I access a page, an Add event is sent to this stream. Then if I wait 1 minute (do nothing), the CASE statement in my table will evaluate to 1, so I should see a Delete and then Add event with that flag set.
What I actually see is that nothing happens until I load that page again. The Delete event actually has the flag set, while the Add event that immediate follows that has it cleared again (as it should since it's no longer "expired).
// add/delete, path, lastAccess, expired
(true,/mypage,2019-05-20 20:02:48.0,0) // first page load, add event
(false,/mypage,2019-05-20 20:02:48.0,1) // second load > 2 mins later, remove event for the entry with expired flag set
(true,/mypage,2019-05-20 20:05:01.0,0) // second load, add event
Edit: The most useful tip I've come across in my searching is to create a ProcessFunction. I think this is something I could make work with my dynamic tables (in some cases I'd end up with intermediate streams to look at computed dates), but hopefully it doesn't have to come to that.
I've gotten the ProcessFunction approach to work but it required a lot more tinkering than I initially thought it would:
I had to add a field to my POJO that changes in the onTimer() method (could be a date or a version that you simply bump each time)
I had to register this field as part of the dynamic table
I had to use this field in my query in order for the query to get re-evaluated and change the boolean flag (even though I don't actually use the new field). I just added it as part of my SELECT clause.
Your approach looks promising but a comparison with a moving "now" timestamp is not supported by Flink's Table API / SQL (yet).
I would solve this in two steps.
register the dynamic table in upsert mode, i.e., a table that is upserted per key (request in your case) based on a version timestamp (requestTime in your case). The resulting dynamic table would hold the latest row for every request.
Have a query with a simple filter predicate like yours that compares the version timestamp of the rows of the dynamic (upsert) table and filters out all rows that have timestamps which are too close to now.
Unfortunately, neither of both features (upsert conversions and comparisons against the moving "now" timestamp) are available in Flink, yet. There is some ongoing work for upsert table conversions though.
I have data to load where I only need to pull records since the last time I pulled this data. There are no date fields to save this information in my destination table so I have to keep track of the maximum date that I last pulled. The problem is I can't see how to save this value in SSIS for the next time the project runs.
I saw this:
Persist a variable value in SSIS package
but it doesn't work for me because there is another process that purges and reloads the data separate from my process. This means that I have to do more than just know the last time my process ran.
The only solution I can think of is to create a table but it seems a bit much to create a table to hold one field.
This is a very common thing to do. You create an execution table that stores the package name, the start time, the end time, and whether or not the package failed/succeeded. You are then able to pull the max start time of the last successfully ran execution.
You can't persist anything in a package between executions.
What you're talking about is a form of differential replication and this has been done many many times.
For differential replication it is normal to store some kind of state in the subscriber (the system reading the data) or the publisher (the system providing the data) that remembers what state you're up to.
So I suggest you:
Read up on differential replication design patterns
Absolutely put your mind at rest about writing data to a table
If you end up having more than one source system or more than one source table your storage table is not going to have just one record. Have a think about that. I answered a question like this the other day - you'll find over time that you're going to add handy things like the last time the replication ran, how long it took, how many records were transferred etc.
Is it viable to have a SQL table with only one row and one column?
TTeeple and Nick.McDermaid are absolutely correct, and you should follow their advice if humanly possible.
But if for some reason you don't have access to write to an execution table, you can always use a script task to read/write the last loaded date to a text file on on whatever local file-system you're running SSIS on.
Using CDC on SQL Server 2012.
I have a table (MyTable) which is CDC enabled. I thought the following two queries would always return the same value:
SELECT MIN(__$start_lsn) FROM cdc.dbo_MyTable_CT;
SELECT sys.fn_cdc_get_min_lsn('dbo_MyTable');
But they don't seem to do so: in my case the first one returns 0x00001EC6000000DC0003 and the second one 0x00001E31000000750001, so the absolute minimum in the table is actually greater than the value returned by fn_cdc_get_min_lsn.
My questions:
Why are the results different?
Is there any problem with using the value from the first query as the first parameter on fn_cdc_get_all_changes_dbo_MyTable? (all examples I've seen use the value from the second query)
My understanding is that the first one returns the oldest LSN for the data that's currently in the CDC table and the latter reflects when the table was added to CDC. I will note though that you'll only want to use the minimum (whichever method you go with) once so you don't process duplicate records. Also, since the second method gets its result from sys.cdc_tables (which very likely has far fewer rows than your CDC table does), it's going to be more efficient.
sys.fn_cdc_get_min_lsn returns the minimum available lsn for a change captured table.
Like #Ben says, this can be different (earlier) from the earliest change actually captured, for example when a table is first added to CDC and there haven't been any changes yet.
As per the MSDN doco you should always use this to validate your query ranges prior to execution because change data will eventually get cleaned up. So you will not only use this once - you will check it every time.
You should use this rather than getting the min LSN other ways because
it'll be faster (as Ben pointed out). Much faster potentially.
it's the documented API for doing so. The implementation of the backing tables might change in future versions etc...
Workflow is generally:
load your previous LSN from (your state)
query current LSN
query minimum available for the table
if prev > min available load changes only
otherwise load whole table and handle it (somehow)
save current LSN to (your state)
Is it possible to effectively tail a database table such that when a new row is added an application is immediately notified with the new row? Any database can be used.
Use an ON INSERT trigger.
you will need to check for specifics on how to call external applications with the values contained in the inserted record, or you will write your 'application' as a SQL procedure and have it run inside the database.
it sounds like you will want to brush up on databases in general before you paint yourself into a corner with your command line approaches.
Yes, if the database is a flat text file and appends are done at the end.
Yes, if the database supports this feature in some other way; check the relevant manual.
Otherwise, no. Databases tend to be binary files.
I am not sure but this might work for primitive / flat file databases but as far as i understand (and i could be wrong) the modern database files are encrypted. Hence reading a newly added row would not work with that command.
I would imagine most databases allow for write triggers, and you could have a script that triggers on write that tells you some of what happened. I don't know what information would be available, as it would depend on the individual database.
There are a few options here, some of which others have noted:
Periodically poll for new rows. With the way MVCC works though, it's possible to miss a row if there were two INSERTS in mid-transaction when you last queried.
Define a trigger function that will do some work for you on each insert. (In Postgres you can call a NOTIFY command that other processes can LISTEN to.) You could combine a trigger with writes to an unpublished_row_ids table to ensure that your tailing process doesn't miss anything. (The tailing process would then delete IDs from the unpublished_row_ids table as it processed them.)
Hook into the database's replication functionality, if it provides any. This should have a means of guaranteeing that rows aren't missed.
I've blogged in more detail about how to do all these options with Postgres at http://btubbs.com/streaming-updates-from-postgres.html.
tail on Linux appears to be using inotify to tell when a file changes - it probably uses similar filesystem notifications frameworks on other operating systems. Therefore it does detect file modifications.
That said, tail performs an fstat() call after each detected change and will not output anything unless the size of the file increases. Modern DB systems use random file access and reuse DB pages, so it's very possible that an inserted row will not cause the backing file size to change.
You're better off using inotify (or similar) directly, and even better off if you use DB triggers or whatever mechanism your DBMS offers to watch for DB updates, since not all file updates are necessarily row insertions.
I was just in the middle of posting the same exact response as glowcoder, plus another idea:
The low-tech way to do it is to have a timestamp field, and have a program run a query every n minutes looking for records where the timestamp is greater than that of the last run. The same concept can be done by storing the last key seen if you use a sequence, or even adding a boolean field "processed".
With oracle you can select an psuedo-column called 'rowid' that gives a unique identifier for the row in the table and rowid's are ordinal... new rows get assigned rowids that are greater than any existing rowid's.
So, first select max(rowid) from table_name
I assume that one cause for the raised question is that there are many, many rows in the table... so this first step will be taxing the db a little and take some time.
Then, select * from table_name where rowid > 'whatever_that_rowid_string_was'
you still have to periodically run the query, but it is now just a quick and inexpensive query