Add only unique rows in SnowFlake Cloud Database - snowflake-cloud-data-platform

I want to automate the ingestion of data from a source into a SnowFlake Cloud Database. There is no way to extract only unique rows from the source. So the entire data will be extracted during every ingestion run. However, while adding to SnowFlake I only want to add the unique rows. How can this be achieved most optimally?
Further Information: Source is a DataStax Cassandra Graph.

Assuming there is a key that you can use to determine which records need to be loaded, the idea scenario would be to load the data to a stage table in Snowflake and then run a MERGE statement using the new data and apply to your target table.
https://docs.snowflake.com/en/sql-reference/sql/merge.html
If there is no key, you might want to consider running an INSERT OVERWRITE statement and just replacing the table with the new incoming data.
https://docs.snowflake.com/en/sql-reference/sql/insert.html#insert-using-overwrite

You will have to stage it to a table in snowflake for ingestion and then move it to the destination table using select distinct.

Related

Update Google Cloud Data Fusion replication job to reflect the SQL Server table schema

I've created a Data Fusion replication job to replicate some tables on a test database.
It works well at the beginning if I don't change the tables schema. But I've added a new column and that column is ignored from the replication job. I guess that if I create a new table, even that table would be ignored.
Is there a way to include schema updates (new table, update column field, new column etc...) inside an already running Data Fusion replication job?
I guess a possible solution would be to stop the currently running job and create a new one including new tables, new columns etc... but I'd like to avoid that a new job would replicate all the database again.
Any possible solution?
Unfortunately, Data Fusion Replication for SQL server currently does not support DDL propagation during runtime; you will need to delete and recreate the replicator pipeline in order to propagate any changes to schema to the new BigQuery table.
One way to avoid replicating existing data with DDL change is that , you can manually modify the BigQuery table schema (But BigQuery also has limited support for schema changes) and create a new replication job and disable replicating existing data(there is an option that let you choose whether to replicate existing data, default is true)

SQL Server table daily sync of records from table A to table B

I want to create a daily process where I reload all rows from table A into table B. Over time table A rows will change due to changes in source system and also because of aging/deletion of records in the origin table. Table A gets truncated/reloaded daily in step 1. Table B is the master table that just gets new/updated rows.
From a historical point of view, I want to keep track of ALL the rows in table B and be able to do a point in time comparison for analytics purposes.
So I need to do two things, Daily insert rows from table A to table B if they don't exist and then also create a new record in Table B if the record already exists but ANY of the columns have changed. At one point I attempted to use temporal tables but I had too many false/positives on 'real' changes, basically certain columns were throwing off things because a date/time column was updated(only real change in row).
I'm using a Azure SQL Server Managed Instance database (Microsoft SQL Azure (RTM) - 12.0.2000.8).
At my disposal I have SSMS, SQL Server and also Azure Data Factory.
Any suggestions on the best way to do this or tools to help with this?
There are 2 concepts out of which you can implement any one.
Temporal table
Capture Data Change (CDC)
As CDC is the commonly used approach in which you can create an Azure data factory with a pipeline that loads delta data based on change data capture (CDC) information in the source Azure SQL Managed Instance database to an Azure blob storage.
To implement the CDC, you can you can follow this simple Microsoft tutorial Incrementally load data from Azure SQL Managed Instance to Azure Storage using change data capture (CDC)
Note: You also need to Create a storage account which is required but not given in above tutorial.

Is there a way to preserve indexes and keys of SQL table when performing a copy activity using azure data factory

Trying to perform a copy activity from onprem sql to azure sql.
The source database table has few indexes and keys and when I perform Copy activity to azure sql by Auto-generate new table, indexes and keys are missing on destination table.
Based on the parameter statements in this official document,there is no guarantee that the index and key will be transferred in ADF copy activity.
As you mentioned in your comment,you could have to create them by yourself,such as in the stored procedure which could be executed in ADF copy activity.
More clues,please refer to these threads:
1.https://www.sqlserverlogexplorer.com/copy-table-one-database-another-database/
2.How to copy indexes from one table to another in SQL Server

What the Process to transfer the staging table data to Fact tables in Snowflake by Custom Validations

good Day.
I need help. I want to transfer the data in Snowflake from Staging tables to Fact tables automatically, when data is available in Stage table. While moving data from Staging table to Fact tables, I have couple of Custom validations on each column and row.
Any idea how to do this in Snowflake.
If any one knows could you please suggest me...!
Thanks in Advance...!
There are many ways to do this and how you go about it depends on what tools you have available. The simplest way to do this without using tools outside of the Snowflake ecosystem would be:
On each of the staging tables you have, set up a stream on these tables (here is the Snowflake documentation on streams)
Create a task that runs on a schedule (here is the Snowflake doc on tasks) to pull from the streams and write into the fact table.
This is really a general data warehousing question rather than a Snowflake one. Here is some more documentation on building SCD type 2 dimensions also written by someone at Snowflake
Assuming "staging tables" refers to a Snowflake table and not a file in a Snowflake stage, I would recommend using a Stream and Task for this. A stream will identify the delta of data that needs to be loaded, and a Task can execute on a schedule and will only actually run something if there is data in the stream. Create a stored procedure that is executed in the Task to run your validations and Merge the outcome of those into your Fact.

preserve the data while dropping a hive internal table

I have loaded a huge table from SQL Server onto Hive. The mistake I made is I created the table as a Internal table in HIVE. Can anyone suggest any hack so that I can alter the table structure , without dropping the data.
The data is huge and I cant afford to export the data out of source again.
The problem right now, is that since the column orders don't match the SQL server table, a lot of columns display NULL.
Any help will be highly appreciated.
I do not see any problem to use an Alter Table on a internal table. (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Partition/Column)
Another - but not recommended - option would be to open your hive metastore(HCatalog) and apply the changes there. Hive reads out the schema information from a relational database (configured during the Hadoop setup, default is MySQL). In this MySQL you can try to change some settings. However, this is not recommended as with a mistake, you can screw your whole Hive databases.
The safest way is creating a new table and using the existing as a source
create table new_table
as
select
[...]
from existing_table

Resources