Snowflake duplicates - snowflake-cloud-data-platform

I have a major issue with duplicates in Snowflake.
Data flow:
From SAP I'm loading data to PSA layer (no duplicates here)
Calling procedures to load data to RAW layer (duplicates here)
Calling procedures to Business layer (duplicates here)
All of the procedures from 2,3 are loading in pararell.
Do you have idea how not have duplicates in Snowflake? Or how to delete one of the duplicates ex. I have two or more same rows I want only one row in my tables

Related

"master-slave" table replication in Oracle

Would there be something similar as the master-slave database but at the table level in the database?
For example, I have the following scenario:
I have a table with millions of records and the reason is because the system is more than 15 years old.
I only want to show the records of the last year (2019-2020).
I decided to create a view that only shows the records of that range (1 year) from the information of that table that contains millions of records.
Thanks to the view, the loading time of that system screen is lighter, thanks to the fact that I have less load of records.
The problem: What if the user adds a new record to the table that contains millions of records? how do I make my view modify when the other table are modified ...
I can use triggers to update the view I think, but, is there a functionality in oracle that allows me something similar to what I just asked (master-slave) where the "slave" table is updated as the "master" table suffers changes?
First of all, you misunderstood views. View is not a physical table, and does not store any information. If you insert data into view, you are actually inserting into the source table.
Since the view is not physical, you are just filtering the data. This does not have any performance benefits.
For the big tables, you can use partitioning which drastically improves performance. And if you still need archival you can archive the partitioned data.
Partitioning is generally the best method, because you can typically archive data by simply doing an "exchange" command to archive off old data.
Data doesn't "move" in that scenario, it simply gets 'detached' from the table via data dictionary manipulation.
Would there be something similar as the master-slave database but at the table level in the database
If you are asking about master/slave replication on a table level, then,
I suppose, table/materialized view relationship is appropriate to call as a master-slave. Quote from Oracle Docs:
A materialized view is a database object that contains the results of a query. The FROM clause of the query can name tables, views, and other materialized views. Collectively these objects are called master tables (a replication term)...
When you need to "update" or, more appropriately - refresh the mview, you can use different options:
update mview periodically and refresh it periodically
update mview each time the data in the master table is changed and commited.
update manually calling DBMS_MVIEW.REFRESH or DBMS_SNAPSHOT.REFRESH
Mview could be faster then view because each time you select from a mview you select from a different "table" which was replicated from the original one. Especially if you have complex logic in a sql, you can put the logic to mview definition.
The drawbacks are you need extra disk space for mview, and there will be a delay of refreshing the data.

Loading data from source table into multiple tables

I have a source table called table1, which populates every night with new data. I have to upload the data in 4 relational tables tblCus, tblstore, tblEmp, tblExp. Should I use a stored procedure or trigger to accomplish that?
Thanks
If it is a simple load, you can use a DataFlow Task to select from table1.
Assuming table1 is your source table for your 4 tables.
Then you can use a Conditional split task which acts like a where clause, where you can set your definitions for tblCus, tblstore, tblEmp, tblExp and then add 4 destinations for these.
Look at my example:
Conditional split:
In SQL Server, there is always more than one way to skin a cat. From your question, I am making the assumption that you are denormalizing 4 tables from an OLTP-style database, into a single dimension in a data warehouse style application.
If the databases are on separate instances, or if only simple transformations are required, you could use SSIS (SQL Server Integration Services).
If this is correct, and the databases reside on the same instance, then you could use a stored procedure.
If the transformation is part of a larger load, then you could combine the two methods, and use SSIS to orchestrate the transformations, but simply call-off stored procedures in the control flow.
The general rule I use, to decide if I should use a data flow, or a stored procedure, for a specific transformation is: Data Flow is my preference, but if I will require any asynchronous transformations within the data flow, I revert to a stored procedure. This general rule USUALLY gives the best performance profile.
I would avoid triggers, especially if there are a large number of DML operations against the 4 tables, as the trigger will fire for each modification and potentially cause performance degradation.

How to bulk insert and validate data against existing database data

Here is my situation, my client wants to bulk insert 100,000+ rows into the database from a csv file which is simple enough but the values need to be checked against data that is already in the database (does this product type exist? is this product still sold? etc.). To make things worse these files will also be uploaded into the live system during the day so I need to make sure I’m not locking any tables for long. The data that is inserted will also be spread across multiple tables.
I’ve been adding the date into a staging table which takes seconds, I then tried creating a WebService to start processing the table using Linq and marking any erroneous rows with an invalid flag (this can take some time). Once the validation is done I need take the valid rows and update/add the rows to the appropriate tables.
Is there a process for this that I am unfamiliar with?
For a smaller dataset I would suggest
IF EXISTS (SELECT blah FROM blah WHERE....)
UPDATE (blah)
ELSE
INSERT (blah)
You could do this in chunks to avoid server load, but this is by no means a quick solution, so SSIS would be preferable

Is it possible to use SSIS in order to fill multiple tables with inheritance at the same time?

I've got an MS SQL database with 3 inherent tables. A general one called "Tools" and two more specific ones "DieCastTool" and "Deburrer".
My task is it to get data out of an old MS Access "database". My Problem is, that I have to do a lot of search and filtering stuff untill I've got my data that I'd like to import into the new DB. So finally I don't want to do theses steps several times but populate the 3 tables at the same time.
Therefore I am using da Dataflowtarget in which I am not selecting a certain table, but using an sql select statement (with inner joins on the id columns) to receive all the fields of the 3 tables. Then I match the columns and in my opinion it should work. And it does as long as I only select the columns of the "Tools" table to be filled. When adding the columns of the child-tables, unfortunately it does not and returns me the ErrorCode -1071607685 => "No status is available"
I can think of 2 reasons why my solution does not work:
SSIS simply can't handle inheritance in SQL Tables and sees them as individual tables. (Maybe SSIS even can't handle filling multiple tables in one Dataflowtarget element?)
I am using SSIS in a wrong way.
Would be nice if someone could confirm/decline reason 1 because I have not found anything on this topic.
Yes, SSIS is not aware of table relationships. Think of it as: SSIS is aware of physical objects only, not your logical model.
I didn't understand how You got that error. Anyway, here are solutions:
if there are no FK constraints between those tables, You can use one source component, one multicast and 3 destinations in one dataflow
if there are FK constraints and You can disable them, then use
Execute SQL Task to disable (or drop) constraints, add the same
dataflow as in 1. and add another Execute SQL Task to enable (or
create) constraints
if there are FK constraints and You can't disable them, You can use
one dataflow to read all data and pass it to subsequent dataflows to
fill table by table. See
SSIS Pass Datasource Between Control Flow Tasks.

How do I sync the results of a Web service with a DB?

I am looking for a way to quickly compare the state of a database table with the results of a Web service call.
I need to make sure that all records returned by the Web service call exist in the database, and any records in the database that are no longer in the Web service response are removed from the table.
I have to problems to solve:
How do I quickly compare a data
structure with the results of a
database table?
When I find a
difference, how do I quickly add
what's new and remove what's gone?
For number 1, I was thinking of doing an MD5 of a data structure and storing it in the database. If the MD5 is different, then I'd move to step 2. Are there better ways of comparing response data with the state of a database?
I need more guidance on number 2. I can easily retrieve all records from a table (SELECT * FROM users WHERE user_id = 1) and then loop through an array adding what's not in the DB and creating another array of items to be removed in a subsequent call, but I'm hoping for a better (faster) was of doing this. What is the best way to compare and sync a data structure with a subset of a database table?
Thanks for any insight into these issues!
I've recently been caught up in a similar problem. Our--very simple--solution was to load the web service data into a table with the same structure as the DB table. The DB table keeps a hash of its most important columns, and the same hash function is applied to the corresponding columns in the web service table.
The "sync" logic then goes like this:
Delete any rows from the web service table with hashes that do exist in the DB table. This is duplicate data that doesn't need synchronizing.
DELETE FROM ws_table WHERE hash IN (SELECT hash from db_table);
Delete any rows from the DB table with hashes not found in the web service table.
DELETE FROM db_table WHERE hash NOT IN (SELECT hash FROM ws_table);
Anything left over in the web service table is new data, and should now be inserted into the DB table.
INSERT INTO db_table SELECT ... FROM ws_table;
It's a pretty brute-force approach, and if done transactionally (even just steps 2 and 3) locks up the DB table for the duration, but it's very simple.
One refinement would be to deal with changed records using UPDATE statements, but that adds a good deal of complexity, and may not be any faster than a DELETE followed by an INSERT.
Another possible optimization would be to set a flag instead of deleting rows. The rows could then be deleted later on. However, any logic using the DB table would have to ignore rows with a set flag.
Don't kill yourself doing premature optimization. Go with the simple approach of inserting each row one at a time. If you find your having transactional issues like locking of the table is to long while looping you could insert the rows first into a temporary table then do a single insert into the real destination table.
If you were using SQL Server you could do bulk inserts, or package the data into XML, But I'd still highly recommend implement it the easy way first, then test it and if you can test with production data (or the same quantity of data), then look to optimize only if you need to.

Resources