Stream snowflake table updates - snowflake-cloud-data-platform

There is a huge ETL process on snowflake side that updates a table. I need to stream the changes made to it to other consumers-processors outside of snowflake. Not querying the table for updates but streaming (push pattern, not pull).
I see there is a kafka connector and examples to stream data into snowflake. Is there a way to stream it outside of it? Maybe using CDC + Streams + Tasks to some queue?

Related

Flink - Postgres CDC connnector - custom query

I am working on the Flink application with Postgres DB as a source to read certain configuration data, convert it into a data stream and then join it with an incoming real-time data stream.
I have tried using Postgres CDC connector, and I am able to read a single table and deserialize it into POJO and use it further.
However, my requirement is to read from multiple tables using a join condition in the CDC source itself and then convert it into a data stream. Can we write a custom query in the source? I could not find the possibility yet, the only solution I could think of is to then create multiple sources separately and then join those before finally joining with incoming real-time data. Can someone help here?
Regards,
Swapnil
Could you solve your problem the other way around, by reading your incoming real-time data stream and then perform a lookup to the Postgres DB via the JDBC connector? The CDC connector is meant for monitoring changes happening in tables and send each change into Flink. I don't think there's a possibility to perform any joining in the CDC connector upfront.

How to make Flink Job Dag in Serial Sequence instead of Parrallel

I am using Table/SQL API
Job operations:
Read from Kafka and lookup with SQL to enrich and then upsert SQL table.
After upserting the SQL table flows the same events to Kafka Sink.
But the upserting in MySQL and inserting in Kafka is happening parallelly but we want a serial order like below
The flow we want :
LOOKUP KAFKA AND SQL -> UPSERT SQL -> INSERT KAFKA
The flow we are getting :
LOOKUP KAFKA AND SQL -> UPSERT SQL// INSERT KAFKA (Parrallel)
That can't be done in a straightforward way. Sinks are terminal nodes in the job graph.
You could, however, use the async i/o operator to do the upsert and arrange for it to only emit events downstream to the kafka sink after the upsert has been completed.
Or you could have a second job that ingests a CDC stream from SQL and inserts to kafka.

How to have flink check point/savepoint backup on multiple data center

I have flink application which will run on a node in DC-1 (Data Center 1), we are planning to have savepoint and checkpoint state backup with HDFS or AMAZON-S3. The support in my org for both HDFS and S3 is that it does not replicate data written to DC-1 to DC-2 (They are working on it but time line is large). With this in mind, is there a way to have flink checkpoint/savepoint be written to both DC by flink itself somehow ?
Thanks
As far as I know there is no such mechanism in Flink. Usually, it's not the data processing pipelines responsibility to assert that data gets backed. The easiest workaround for that would be to create a CRON job that periodically copies checkpoints to DC-2.

Is it possible to BULK load data from a Kafka queue directly to SQL Server?

SQL Server offers bulk insert functionality. You can see that this file reads from e.g. a csv file and inserts to table.
I am understanding that this has a clear drawback when working with Kafka:
you would have to take the kafka message and transform it to CSV
you would have to take the kafka message, and after the transformation in the previous step, write it to disk, so that the BULK INSERT can access the file.
My question is about how to overcome the above drawbacks; something about this whole process looks wrong. What is most worrying to me is the 2nd drawback, writing to disk. Would I be able to write a file to memory, and then execute bulk insert over it?
Sure, it's "possible", but ideally you wouldn't use this BULK INSERT method from a CSV.
Instead, you can use Kafka Connect JDBC sink, which buffers records in memory, not as a file, as a Kafka Consumer, then uses regular INSERT INTO table VALUES query
If you only want to be able to query Kafka data with SQL functions, then you don't need to upload data to a relational database - you can use ksqlDB or PrestoDB, for example

Best practices for keeping a custom table up-to-date from data derived from other Snowflake DB's in our warehouse

(Submitting on behalf of a Snowflake User)
We have a database that stores raw data from all our local sources. My team has it's own environment in which we have full permissions to create standardized feeds and/or tables/views etc that is ready to consume through Power BI. A few additional details:
The final 'feed' tables are derived through SQL statements and most are pulling from more than one table of our 'raw' data.
Raw table data is updated daily.
My question is what is the best operation to keep the tables fully updated and what is the standard work flow for this operations? Our current understanding is one of these processes is the best:
Using COPY INTO <stage> then COPY INTO <table>.
Using STREAMS to add incremental data.
Using PIPES (may be the same as STREAMS)
Or simplify our feeds to one table sources and use a materialized view.
We'd ideally like to avoid views to improve consumption speed at the power bi level.
Tasks have been recommended as it seems like a good fit since they only need to update the final table once per day.
(https://docs.snowflake.net/manuals/sql-reference/sql/create-task.html)
Any other recommendations??? THANKS!
We have a similar scenario where we have our raw datalake tables being updated real time from files in S3. Those raw tables are loaded via snowpipe using the auto ingest feature.
In turn, we have a data mart which contains facts about the raw data. To update the data mart, we have created streams on top of the raw tables to track changes. We then use tasks run at a given frequency (every five minutes in our case) to update the data mart from the changed data in the raw tables. Using streams allows us to limit processing to only changed data, without having to track last update dates, etc.

Resources