I am using Snowflake.
All may raw data is already in a Snowflake raw table. I need to filter/deduplicate the data to insert into another bonze table. I am considering dbt snapshot, incremental model, Snowflake streams.
my raw layer is an S3 bucket from an API, then this Data is copy into Snowflake using snowpipe but is the same duplicate raw data. I need to apply upsert to another new Bronze table.
What actions do you recommend to write only the upserts in the last table?
Related
I have a few questions regarding the process of copying tables from S3 to Snowflake.
The plan is to copy some data from AWS/S3 onto snowflake and then perform some modeling by DataRobot
We have some tables that contain PII data and we would like to hide those columns from Datarobot, what suggestion do you have for this problem?
The schema in AWS needs to match the schema in Snowflake for the copying process.
Thanks,
Mali
Assuming you know the schema of the data you are loading, you have a few options for using Snowflake:
Use COPY INTO statements to load the data into the tables
Use SNOWPIPE to auto-load the data into the tables (this would be good for instances where you are regularly loading new data into Snowflake tables)
Use EXTERNAL TABLES to reference the S3 data directly as a table in Snowflake. You'd likely want to use MATERIALIZED VIEWS for this in order for the tables to perform better.
As for hiring the PII data from DataRobot, I would recommend leveraging Snowflake DYNAMIC DATA MASKING to establish rules that obfuscate the data (or null it out) for the role that DataRobot is using.
All of these features are well-documented in Snowflake documentation:
https://docs.snowflake.com/
Regarding hiding your PII elements, you can use 2 different roles, one would be say data_owner(the role that will create the table and load the data in it) and another say data_modelling (for using data robot)
Create masking policies using the data owner such that the data robot cannot see the column data.
About your question on copying the data, there is no requirement that AWS S3 folder need to be in sync with Snowflake. you can create the external stage with any name and point it to any S3 folder.
Snowflake documentation has good example which helps to get some hands on :
https://docs.snowflake.com/en/user-guide/data-load-s3.html
I want to automate the ingestion of data from a source into a SnowFlake Cloud Database. There is no way to extract only unique rows from the source. So the entire data will be extracted during every ingestion run. However, while adding to SnowFlake I only want to add the unique rows. How can this be achieved most optimally?
Further Information: Source is a DataStax Cassandra Graph.
Assuming there is a key that you can use to determine which records need to be loaded, the idea scenario would be to load the data to a stage table in Snowflake and then run a MERGE statement using the new data and apply to your target table.
https://docs.snowflake.com/en/sql-reference/sql/merge.html
If there is no key, you might want to consider running an INSERT OVERWRITE statement and just replacing the table with the new incoming data.
https://docs.snowflake.com/en/sql-reference/sql/insert.html#insert-using-overwrite
You will have to stage it to a table in snowflake for ingestion and then move it to the destination table using select distinct.
I am trying to find a tool, or methodology to store when an update is done against an specific table and column in AWS Redshift.
In PostgreSQL there is a way of doing this with triggers, but Redshift does not support these triggers.
Can we monitor updates statements and store the timestamp, the old value, the new one, and the table affected?
There is no in-built capability in Amazon Redshift to do change detection.
Amazon Redshift is intended as a Data Warehouse, which typically means that bulk information is loaded from external sources. It should be relatively rare for data to be updated within Amazon Redshift because it is not intended to be used as an OLTP database.
Thus, it would be better to put change detection in the source database or in the ETL pipeline, rather than Redshift.
Is there a way for StreamSets Data Collector to automatically create tables in the destination database based on the origin database in the case of cdc?
I am reading data from a source: mssql and writing to a destination, postgresql. If I am interested in 50 tables in the source, I do not want to manually create those tables in the destination db.
There is a (beta) Postgres Metadata processor for StreamSets Data Collector that will create and alter tables on the fly - more information at Drift Synchronization Solution for Postgres.
We are using Talend as ETL tool to extract data from a Hive database tables and load it to a different Hive database table. Can someone suggest the correct Talend components to do the above task?
Extract data from Hive Table A in Database D1 ---> Load data to Hive Table B in Database D2.
I used tELTHive components but there seems to be some restrictions in it. Also Is there a way to load data to Hive Tables without loading the extracted data to a file before writing it to a Hive table.