Will snowflake stream flush out if we do select count(*) from stream? - snowflake-cloud-data-platform

Stream has been created on a table and stream is becoming stale after some time. While selecting the stream where condition false most of the time. Also count(*) has been used to check stream count. Will that flush out the stream?
select count(*) from MY_STREAM where false;
With above sql, stream is becoming stale after some time.

A stream advances its offset when it is used in a DML transaction, so SELECT COUNT is not consuming the stream.
It became stale most probably because the MAX_DATA_EXTENSION_TIME_IN_DAYS has been reached (default 14 days).
To view the current staleness status of a stream, execute the DESCRIBE STREAM or SHOW STREAMS command.
The STALE_AFTER column timestamp indicates when the stream is currently predicted to become stale (or when it became stale, if the timestamp is in the past). This is the extended data retention period for the source object.
This timestamp is calculated by adding the larger of the DATA_RETENTION_TIME_IN_DAYS or MAX_DATA_EXTENSION_TIME_IN_DAYS parameter setting for the source object to the current timestamp. Consuming the change data for a stream moves the STALE_AFTER timestamp forward. Note that reading from the stream could succeed for some time after the STALE_AFTER timestamp. However, the stream might become stale at any time during this period. The STALE column indicates whether the stream is currently expected to be stale, though the stream may not actually be stale yet.
For more information, have a look here.

Related

How can I advance the offset of a Snowflake stream without a DML?

Snowflake streams are really cool, but I can't seem to figure out how to apply them to my use case. I have an external process that I want to use to detect changes to rows in my tables and notify other consumers that this happened. There are several ways to do this, but the nice thing about streams is that they can keep track of last time they were asked, and that would provide a clean way to track the offset and prevent dupes or gaps. For example, an alternative that uses explicit time travel would need externally keep track of the last time it ran the query, including accounting for clock skew between Snowflake and the process.
However, the offset only seems to move up if you push the data in it to another table. Meaning, this changes the offset:
insert into other_table select * from my_stream
but this does not:
select * from my_stream
In my case, I don't need the data in another table. I could insert into a temp table or something to create the side effect of advancing the offset, but that seems wasteful and messy. Is there some alternative I'm missing? Some way to "bump" the stream?
You can use a where clause that selects no rows. It will still consume the stream.
insert into other_table select * from my_stream where false;
If you're never going to use the rows in the stream, just use it to detect when changes are there since the last consumption point, you may also consider replacing the stream.
create or replace stream my_stream on table my_table;
That will allow you to check for changes on the old stream and start with a new one when you do whatever needs to happen when the old one reports changed rows.

Snowflake Task Condition: When Table Has Data

I would like to include a condition in my Snowflake task to run only if a specified table has data in it. This would be similar to task condition:
WHEN SYSTEM$STREAM_HAS_DATA('my_schema.my_table')
Except I do not wish to use a stream. The problem with using a stream in some cases, is that streams can go stale. I have tables in my ELT process that may not receive updates for weeks or months. Possibly even years.
One thought I had was to use a UDF in the task condition:
WHEN PUBLIC.TABLE_HAS_DATA('my_schema.my_table')
This would be great if I could throw a SELECT CAST(COUNT(1) AS BOOLEAN) FROM "my_schema"."my_table" in there. But a SQL UDF will not be able to do anything with a table name that is passed as a parameter. And a Javascript UDF seems too limiting when it comes to querying tables.
Admittedly, I am not a Javascript programmer. Nor am I too familiar with Snowflake's Javascript UDF abilities. I can perform the desired queries in a Javascript Stored Proc just fine. But those don't seem to translate over to UDFs.
Snowflake Streams should only go stale if you don't do something with the data within its set retention period. As long as you have a task to process data in the stream (change records) when they show up you should be fine. So if you don't see a change show up in a Stream for 6 months, that's fine as long as you process that change record within your data retention period (14 days as an example).
If your task has a STREAM_HAS_DATA condition and the stream doesn't get data for 14 days, the stream will go stale because a stream's offset is only updated when it is queried. You can work around this issue by removing the condition and letting the task run more often.
The SYSTEM$STREAM_HAS_DATA does only apply to streams https://docs.snowflake.com/en/sql-reference/functions/system_stream_has_data.html.
As streams can get stale, we can check (since Snowflake 5.1.x which was released last Jan 2021) the stale_after timestamp property returned by the SHOW STREAMS command so that we can promptly re-create streams that are about to get stale.
A solution to retrieve stale streams is provided here: Snowflake - How can I query the stream's metadata and save to table

Is there a way to define a Dynamic Table comprised of entries that have NOT been touched by an event recently?

I'm new to Flink and I'm trying to use it to have a bunch of live views of my application. At least one of the dynamic views I'd like to build would be to show entries that have not met an SLA -- or essentially expired -- and the condition for this would be a simple timestamp comparison. So I would basically want an entry to show up in my dynamic table if it has NOT been touched by an event recently. In playing around with Flink 1.6 (constrained to this due to AWS Kinesis) in a dev environment, I'm not seeing that Flink is re-evaluating a condition unless an event touches that entry.
I've got my dev environment plugged into a Kinesis stream that's sending in live access log events from a web server. This isn't my real use case but it was an easy one to begin testing with. I've written a simple table query that pulls in a request path, its last access time, and computes a boolean flag to indicate whether it hasn't been accessed in the last minute. I'm debugging this via a retract stream connected to PrintSinkFunction so all updates/deletes are printed to my console.
tEnv.registerDataStream("AccessLogs", accessLogs, "username, status, request, responseSize, referrer, userAgent, requestTime, ActionTime.rowtime");
Table paths = tEnv.sqlQuery("SELECT request AS path, MAX(requestTime) as lastTime, CASE WHEN MAX(requestTime) < CURRENT_TIMESTAMP - INTERVAL '1' MINUTE THEN 1 ELSE 0 END AS expired FROM AccessLogs GROUP BY request");
DataStream<Tuple2<Boolean, Row>> retractStream = tEnv.toRetractStream(paths, Row.class);
retractStream .addSink(new PrintSinkFunction<>());
I expect that when I access a page, an Add event is sent to this stream. Then if I wait 1 minute (do nothing), the CASE statement in my table will evaluate to 1, so I should see a Delete and then Add event with that flag set.
What I actually see is that nothing happens until I load that page again. The Delete event actually has the flag set, while the Add event that immediate follows that has it cleared again (as it should since it's no longer "expired).
// add/delete, path, lastAccess, expired
(true,/mypage,2019-05-20 20:02:48.0,0) // first page load, add event
(false,/mypage,2019-05-20 20:02:48.0,1) // second load > 2 mins later, remove event for the entry with expired flag set
(true,/mypage,2019-05-20 20:05:01.0,0) // second load, add event
Edit: The most useful tip I've come across in my searching is to create a ProcessFunction. I think this is something I could make work with my dynamic tables (in some cases I'd end up with intermediate streams to look at computed dates), but hopefully it doesn't have to come to that.
I've gotten the ProcessFunction approach to work but it required a lot more tinkering than I initially thought it would:
I had to add a field to my POJO that changes in the onTimer() method (could be a date or a version that you simply bump each time)
I had to register this field as part of the dynamic table
I had to use this field in my query in order for the query to get re-evaluated and change the boolean flag (even though I don't actually use the new field). I just added it as part of my SELECT clause.
Your approach looks promising but a comparison with a moving "now" timestamp is not supported by Flink's Table API / SQL (yet).
I would solve this in two steps.
register the dynamic table in upsert mode, i.e., a table that is upserted per key (request in your case) based on a version timestamp (requestTime in your case). The resulting dynamic table would hold the latest row for every request.
Have a query with a simple filter predicate like yours that compares the version timestamp of the rows of the dynamic (upsert) table and filters out all rows that have timestamps which are too close to now.
Unfortunately, neither of both features (upsert conversions and comparisons against the moving "now" timestamp) are available in Flink, yet. There is some ongoing work for upsert table conversions though.

CommitLog Recovery with Cassandra

I have noted following statement in the Cassandra documentation on commit log archive configuration:
https://docs.datastax.com/en/cassandra/3.0/cassandra/configuration/configLogArchive.html
"Restore stops when the first client-supplied timestamp is greater than the restore point timestamp. Because the order in which the database receives mutations does not strictly follow the timestamp order, this can leave some mutations unrecovered."
This statement made us concerned about using point in time recovery based on Cassandra commit logs, since this indicates a point in time recovery will not recover all mutations with timestamp lower than the indicated restore point timestamp if we have mutations out of timestamp order (which we will have).
I tried to verify this behavior via some experiments but have not been able to reproduce this behavior.
I did 2 experiments:
Simple row inserts
Set restore_point_in_time to 1 hour ahead in time.
insert 10 rows (using default current timestamp)
insert a row using timestamp <2 hours ahead in time>
insert 10 rows (using default current timestamp)
Now I killed my cassandra instance making sure it was terminated without having a chance to flush to SS tables.
During startup I could see from cassandra logs that it was doing CommitLog replay.
After replay I queried by table and could see that 20 rows had been recovered but the one with the timestamp ahead of time was not inserted. Though here based on the documentation I would have expected that only the first 10 rows had been inserted. I verified in casssandra log that CommitLog replay had been done.
Larger CommitLog split experiment
I wanted to see if the documented feature then was working over a commitlog split/rollover.
So I set commitlog_segment_size_in_mb to 1 MB to cause the commitlog to rollover more frequently instead of the 32MB default.
I then ran a script to mass insert rows to force the commit log to split.
So the results here was that I inserted 12000 records, then inserted a record with a timestamp ahead of my restore_point_in_time then I inserted 8000 records afterwards.
At about 13200 rows my commitlog rolled over to a new file.
I then again killed my cassandra instance and restarted. Again I could see in the log that CommitLog replay was being done and after replay I could see that all rows except the single row with timestamp ahead of restore_point_in_time was recovered.
Notes
I did similar experiments using commitlog_sync batch option and also to make sure my rows had not been flushed to SSTables I tried restoring snapshot with empty tables before starting up cassandra to make it perform commitlog replay. In all cases I got the same results.
I guess my question is if the statement in the documentation is still valid? or maybe I'm missing something in my experiments?
Any help would be greatly appreciated ? I need an answer for this to be able to conclude on a backup/recovery mechanism we want to implement in a larger scale cassandra cluster setup.
All experiments where done using Cassandra 3.11 (single-node-setup) in a Docker container (the official cassandra docker image). I ran the experiments on the image "from-scratch" so no changes in configs where done other than what I included in the description here.
I think that it will be relatively hard to reproduce, as you'll need to make sure that some of the mutations come later than other, and this may happen mostly when some clients has not synchronized clocks, or nodes are overloaded, and then hints are replayed some time later, etc.
But this parameter may not be required at all - if you look into CommitLogArchiver.java, then you can see that if this parameter is not specified, then it's set to the Long.MAX, meaning that there is no upper bound and all commit logs will be replayed, and then Cassandra will handle it standard way: "the latest timestamp wins".

When a transaction is rolled back in timestamp ordering protocol why is it given a new timestamp?

When a transaction is rolled back in timestamp ordering protocol, why is it given a new timestamp?
Why don`t we retain the old timestamp?
If you are talking of a scheduler whose operation is timestamp-based, and a rolled-back transaction were allowed to "re-enter the scheduling queue" with its 'old' timestamp, then the net effect might be that the scheduler immediately gives the highest priority to any request coming from that transaction, and the net effect of THAT might be that whatever problem caused that transaction to roll back, re-appears almost instantaneously, perhaps causing a new rollback, which causes a new "re-entering the schedule queue", etc. etc.
Or the net effect of that "immediately re-entering the queue" could be that all other transactions are stalled.
Think of a queue of persons in the post office, and there is someone with a request which cannot be served, and that person were allowed to immediately re-enter the queue at the front (instead of at the back). How long would it then take before it gets to be your turn ?
Because there could be other transactions that had committed with the new timestamp
Initial timestamp is at X
Transaction T1 starts
T1 allocates timestamp increments it to value to X+1
Transaction T2 starts
T2 allocates timestamp increments it to value to X+2
T2 commits
T1 rolls back
If T1 would rollback the timestamp to X then a third transaction would generate a conflict with T2's allocated value. Same goes for increment and sequences. If you need monolithic sequence values (no gaps) then the transactions have to serialize and this happens at the price of dismal performance.
In a timestamp ordering protocol, the timestamp assigned to the transaction when starting is used to identify potential conflicts with other transactions. These could be transactions that updated an object this transaction is trying to read or transactions that read the value this transaction is trying to overwrite. As a result, when a transaction is aborted and restarted (i.e. to maintain serializability), then all the operations of the transaction will be executed anew and this is the reason a new timestamp needs to be assigned.
From a theoretical perspective, rerunning the operations again while the transaction is still using the old timestamp would be incorrect & unsafe, since it would be reading/overwriting new values while thinking it's situated in an older moment in time. From a practical perspective, if the transaction keeps using the old timestamp, most likely it will keep aborting & restarting continuously, since it will keep conflicting with the same transactions again and again.

Resources