Can I perform transformations using Snowflake Streams? - snowflake-cloud-data-platform

Currently I have a snowflake table being updated from a kafka connector in near-realtime, I want to be able to then in near-real time take these new data entries through something such as snowflake cdc / snowflake streams and append some additional fields. Some of these will be to track max values within a certain time period (window function probs) and others will be to receive values from static tables based on where static_table.id = realtime_table.id.
The final goal is to perform these transformations and transfer them to a new presentation level table, so I have both a source table and a presentation level table, with little latency between the two.
Is this possible with Snowflake Streams? Or is there a combination of tools snowflake offers that can be used to achieve this goal? Due to a number of outside constraints it is important that this can be done within the snowflake infrastructure.
Any help would be much appreciated :).
I have considered the use of a materialised view, but am concerned regarding costs / latency.

The goal of Streams - together with Tasks - is to get the transformations done that you are asking for.
This is a quickstart to start growing you Stream and Tasks abilities:
https://quickstarts.snowflake.com/guide/getting_started_with_streams_and_tasks/
On the 6th step you can see a task that would transform the data as it arrives:
create or replace task REFINE_TASK
USER_TASK_MANAGED_INITIAL_WAREHOUSE_SIZE = 'XSMALL'
SCHEDULE = '4 minute'
COMMENT = '2. ELT Process New Transactions in Landing/Staging Table into a more Normalized/Refined Table (flattens JSON payloads)'
when
SYSTEM$STREAM_HAS_DATA('CC_TRANS_STAGING_VIEW_STREAM')
as
insert into CC_TRANS_ALL (select
card_id, merchant_id, transaction_id, amount, currency, approved, type, timestamp
from CC_TRANS_STAGING_VIEW_STREAM);

Related

PyFlink: how to set parallelism when using SQL and Table API?

I have a processing topology using PyFlink and SQL where there is data skew: I'm splitting a stream of heterogenous data into separate streams based on the type of data that's in it and some of these substreams have very many more events than others and this is causing issues when checkpointing (checkpoints are timing out). I'd like to increase parallelism for these problematic streams, I'm just not sure how I do that and target just those elements. Do I need to use the datastream API here? What does this look like please?
I have a table defined and I duplicate a stream from that table, then filter so that my substream has only the events I'm interested in:
events_table = table_env.from_path(MY_SOURCE_TABLE)
filtered_table = events_table.filter(
col("event_type") == "event_of_interest"
)
table_env.create_temporary_view(MY_FILTERED_VIEW, filtered_table)
# now execute SQL on MY_FILTERED_VIEW
table_env.execute_sql(...)
The default parallelism of the overall table env is 1. Is there a way to increase the parallelism for just this stream?

Building CDC in Snowflake

My company is migrating to snowflake from SQL Server 2017 and am looking to build historical data tables that capture delta changes. In SQL, these would be in stored procedures, where old records would get expired (change to data) and insert the new row with updated data. This design allows dynamic retrieval of historical data at any point in time.
My question is, how would i migrate this design to snowflake? From what I read about procedures, they're more like UDTs or scalar functions (SQL equiv) , but in javascript lang...
Below is brief example of how we are doing CDC for tables in SQL
Would data pipeline cover this? If anyone knows good tutorial site for snowflake 101 (not snowflake offical documentation, its terrible). would be appreciated
thanks
update h
set h.expiration_date = t.effective_date
from data_table_A_history h
join data_table_A as t
on h.account_id = t.account_id
where h.expiration_date is null
and (
(isnull(t.person_name,'x') <> isnull(h.person_name,'x')) or
(isnull(t.person_age,0) <> isnull(h.person_age,0))
)
---------------------------------------------------------------------
insert into data_table_A_history (account_id,person_name,person_age)
select
account_id,person_name,person_age
from
data_table_A t
left join data_table_A_history h
on t.account_id = h.account_id
and h.expiration_date is null
where
h.account_id is null
Table streams are Snowflake's CDC solution
You can setup multiple streams on a single table and it will track changes to the table from a particular point in time. This point in time is changed once you consume the data in the stream, with the new starting point being from the time you consumed the data. Consumption here is when you either use the data to upsert another table or perhaps insert the data into a log table for example. Simply select statements do not consume the data
A pipeline could be something like this: Snowpipe->staging table->stream on staging table->task with SP->merge/upsert target table
If you wanted to keep a log of the changes then you could setup a 2nd stream on the staging table and consume that by inserting the data into another table
Another trick, if you didn't want to use a 2nd stream is to amend your SP so that before you consume the data, run a select on the stream and then immediately run
INSERT INTO my_table select * from table(result_scan(last_query_id()))
This does not consume the stream and change the offset and leaves the stream data available to be consumed by another DML operation

how to Copy from big table to another table in snowflake?

I have a 7TB+- table in snowflake, I want to pass half of that table to a new table. for example with a country filter. what technique would you recommend? insert into select * from TABLE where COUNTRY = 'A' or use snowpipe to send a parquet format to S3 an then copy into table into snowflake target table
I tried the first option. 5 hours after and the process was on 35%. I read a post where a guy had to scaling the cluster to XL instance. He read another post where snowpipe is the good option. my cluster is only a XS :(
by the way, I have Cluster key and the mission is segment the data by countries by company politics.
The original table is about events from the devices that have the app installed. 30 events per session minute, for example a Uber App or Lyft App
An MV will definitely be more performant than a standard view but there is an extra cost associated with that as Snowflake has to keep the MV in sync with the table. Sounds like the table will be rapidly changing so this cost will be continuous.
Another option is to create a stream on the source table and use a task to merge the stream data into the target table. Tasks require a running warehouse but I've found that an XS warehouse is very capable so minimum you're talking 24 credits per day. Tasks also have a minimum 1 minute interval so if you need bleeding edge, that might discount this option

Perform multiple inserts per POST request

We have a scenario, where each insert happen per id_2 given id_1, for below schema, in Cassandra:
CREATE TABLE IF NOT EXISTS my_table (
id_1 UUID,
id_2 UUID,
textDetails TEXT,
PRIMARY KEY (id_1, id_2)
);
A single POST request body has the details for multiple values of id_2. This triggers multiple inserts per single POST request on single table.
Each INSERT query is performed as shown below:
insertQueryString = "INSERT INTO my_table (id_1, id_2, textDetails) " + "VALUES (?, ?, ?) IF NOT EXISTS"
cassandra.Session.Query(insertQueryString,
id1,
id2,
myTextDetails).Exec();
1
Does Cassandra ensure data consistency on multiple inserts on a single table, per POST request? Each POST request is processed on a Go-routine(thread). Subsequent GET requests should ensure retrieving consistent data(inserted through POST)
Using BATCH statements is having "Batch too large" issues in staging & production. https://github.com/RBMHTechnology/eventuate/issues/166
2
We have two data centres(for Cassandra), with 3 replica nodes per data center.
What are the consistency levels need to set for write query operation(POST request) and ready query operation(GET request), to ensure full consistency
There are multiple problems here:
Batching should be used very carefully in Cassandra - only if you're inserting data into the same partition. If you insert data into multiple partitions, then it's better to use separate queries executed in parallel (but you can collect multiple entries per partition key and batch them).
you're using IF NOT EXISTS and it's done against the same partition - as result it leads to the conflicts between multiple nodes (see documentation on lightweight transactions) plus it requires reading data from disk, so it heavily increase the load onto the nodes. But do you really need to insert data only if the row doesn't exist? What is the problem if row exists already? It's easier just to overwrite data in Cassandra when doing INSERT because it won't require reading data from the disk.
Regarding consistency level - the QUORUM (or SERIAL for LWTs) will give you the strong consistency but at expense of the increased latency (because you need to wait for answer from another DC), and lack of fault tolerance - if you lose another DC, then all your queries will fail. In most cases the LOCAL_QUORUM is enough (LOCAL_SERIAL in case of LWTs), and it will provide fault tolerance. I recommend to read this whitepaper on best practices of build fault-tolerance applications on top of Cassandra.

Data import via Stored procedure or Triggers

We've a legacy system (MAS200 if you need to know) and there's an old vbs script which pull data from MAS and populates two staging tables in our production SQL database. And after some processing / cleanup that data goes into actual tables.
Data flow : MAS200 --> Staging tables --> Production table
To simplify consider there's an "Order" parent table and an "Items" child table. Order can have multiple items, each item record will have an FK OrderId. So, during import first we import the Order data and create an entry in the "Order" table and then fetch "Items" entries and import them.
Existing TRIGGER based approach -
At present we've two TRIGGERs - one on each staging table (Order & Items). So each new insert is tapped, and after processing data a new entry is inserted into actual production table. My only concern is that the trigger is executed for each Items entry instead of BULK insert. And it seems less manageable.
SP based approach -
If I remove both the TRIGGERs then import data into staging tables and finally execute an SP which will import Order data and then perform a BULK insert into the Items table. Could that be more efficient / faster?
Its not a comparison actually just a diff design. I'd like to know which one seems better or if there's a 3rd better approach to import from MAS to production SQL db.
EDIT 1 : Thanks. As asked by many - the data volume is not big or too frequent. Lets say 10-12 Orders (with 20-30 Items) every hour. Also with TRIGGERs, thought we don't get a TRANSACTION but only two simple TRIGGERs are suffice. I believe more scripting is needed with SP.
Goal : Need to keep it as simple, clean and efficient as possible.
Using Triggers:
Pros:
The data sync is real time. As you create data by data entry, the volume of data should not be big, so having bulk insert doesn't improve a lot. performance using trigger is good enough
Cons:
Data sync is not real time and if the connection breaks between MAS200 and production, you'll have a big problem. Also (as you mentioned) you can not have transaction, which is a big issue.
I suggest you use SP to transfer data in a time interval basis (if you can tolerate synchronization delay).
If you really want fast approach , you need :
1) to disable the FK on the ITEMS table for the duration of the load
2) then LOAD the ORDERS , and then enable the FK
All this should be done using SP , trigger approach is safe but very slow when its come to large bulks load
I hope you will find it useful
Thanks

Resources