I have a table in snowflake and multiple jobs would need to merge records to the snowflake table. how can duplicates be avoided during this process? How to make sure that multiple merges happen one after another.
Use dependent task trees to guarantee they are executed in the order you want
https://docs.snowflake.com/en/user-guide/tasks-intro.html#simple-tree-of-tasks
Related
Environment: Oracle 12C
Got a table with about 10 columns which include few clob and date columns. This is a very busy table for an ETL process as described below-
Flat files are loaded into the table first, then updated and processed. The insert and updates happen in batches. Millions of records are inserted and updated.
There is also a delete process to delete old data based on a date field from the table. The delete process runs as a pl/sql procedure and deletes from the table in a loop fetching first n records only based on date field.
I do not want the delete process to interfere with the regular insert/update . What is the best practice to code the delete so that it has minimal impact on the regular insert/update process ?
I can also partition the table and delete in parallel since each partition uses its own rollback segment but am looking for a simpler way to tune the delete process.
Any suggestions on using a special rollback segment or other tuning tips ?
The first thing you should look for is to decouple various ETL processes so that you need not do all of them together or in a particular sequence. Thereby, removing the dependency of the INSERTS/UPDATES and the DELETES. While a insert/update you could manage in single MERGE block in your ETL, you could do the delete later by simply marking the rows to be deleted later, thus doing a soft delete. You could do this as a flag in your table column. And use the same in your application and queries to filter them out.
By doing the delete later, your critical path of the ETL should minimize. Partitioning the data based on date range should definitely help you to maintain the data and also make the transactions efficient if it's date driven. Also, look for any row-by-row thus slow-by-slow transactions and make them in bulk. Avoid context switching between SQL and PL/SQL as much as possible.
If you partition the table as a date range, then you could look into DROP/TRUNCATE partition which will discard the rows stored in that partition as a DDL statement. This cannot be rolled back. It executes quickly and uses few system resources (Undo and Redo). You can read more about it in the documentation.
I have a source table called table1, which populates every night with new data. I have to upload the data in 4 relational tables tblCus, tblstore, tblEmp, tblExp. Should I use a stored procedure or trigger to accomplish that?
Thanks
If it is a simple load, you can use a DataFlow Task to select from table1.
Assuming table1 is your source table for your 4 tables.
Then you can use a Conditional split task which acts like a where clause, where you can set your definitions for tblCus, tblstore, tblEmp, tblExp and then add 4 destinations for these.
Look at my example:
Conditional split:
In SQL Server, there is always more than one way to skin a cat. From your question, I am making the assumption that you are denormalizing 4 tables from an OLTP-style database, into a single dimension in a data warehouse style application.
If the databases are on separate instances, or if only simple transformations are required, you could use SSIS (SQL Server Integration Services).
If this is correct, and the databases reside on the same instance, then you could use a stored procedure.
If the transformation is part of a larger load, then you could combine the two methods, and use SSIS to orchestrate the transformations, but simply call-off stored procedures in the control flow.
The general rule I use, to decide if I should use a data flow, or a stored procedure, for a specific transformation is: Data Flow is my preference, but if I will require any asynchronous transformations within the data flow, I revert to a stored procedure. This general rule USUALLY gives the best performance profile.
I would avoid triggers, especially if there are a large number of DML operations against the 4 tables, as the trigger will fire for each modification and potentially cause performance degradation.
I'm looking for an efficient way of detecting deleted records in production and updating the data warehouse to reflect those deletes because the table is > 12M rows and contains transactional data used for accounting purposes.
Originally, everything was done in a stored procedure by somebody before me and I've been tasked with moving the process to SSIS.
Here is what my test pattern looks like so far:
Inside the Data Flow Task:
I'm using MD5 hashes to speed up the ETL process as demonstrated in this article.
This should give a huge speed boost to the process by not having to store so many rows in memory for comparison purposes and by removing the bulk of conditional split processing at the same time.
But the issue is it doesn't account for records that are deleted in production.
How should I go about doing this? It may be simple to you but I'm new to SSIS so I'm not sure how to ask correctly.
Thank you in advance.
The solution I ended up using was to add another Data Flow Task and use the Lookup transformation to find records that didn't exist in production when compared to our fact table. This task comes after all of the inserts and updates as shown in my question above.
Then we can batch delete missing records in an execute SQL task.
Inside Data Flow Task:
Inside Lookup Transformation:
(note the Redirect rows to no match output)
So, if the ID's don't match those rows will be redirected to the no match output which we set to go to our staging table. Then, we will join staging to the fact table and apply the deletions as shown below inside an execute SQL task.
I think you'll need to adopt you dataflow to use a merge join instead of a lookup.
That way you can see whats new/changed & deleted.
You'll need to sort both Flows by the same joining key (in this case your hash column).
Personally i'm not sure I'd bother and Instead I'd simply stage all my prod data and then do a 3-way SQL merge statement to handle Inserts updates & deletes in one pass. You can keep your hash column as a joining key if you like.
I've got an MS SQL database with 3 inherent tables. A general one called "Tools" and two more specific ones "DieCastTool" and "Deburrer".
My task is it to get data out of an old MS Access "database". My Problem is, that I have to do a lot of search and filtering stuff untill I've got my data that I'd like to import into the new DB. So finally I don't want to do theses steps several times but populate the 3 tables at the same time.
Therefore I am using da Dataflowtarget in which I am not selecting a certain table, but using an sql select statement (with inner joins on the id columns) to receive all the fields of the 3 tables. Then I match the columns and in my opinion it should work. And it does as long as I only select the columns of the "Tools" table to be filled. When adding the columns of the child-tables, unfortunately it does not and returns me the ErrorCode -1071607685 => "No status is available"
I can think of 2 reasons why my solution does not work:
SSIS simply can't handle inheritance in SQL Tables and sees them as individual tables. (Maybe SSIS even can't handle filling multiple tables in one Dataflowtarget element?)
I am using SSIS in a wrong way.
Would be nice if someone could confirm/decline reason 1 because I have not found anything on this topic.
Yes, SSIS is not aware of table relationships. Think of it as: SSIS is aware of physical objects only, not your logical model.
I didn't understand how You got that error. Anyway, here are solutions:
if there are no FK constraints between those tables, You can use one source component, one multicast and 3 destinations in one dataflow
if there are FK constraints and You can disable them, then use
Execute SQL Task to disable (or drop) constraints, add the same
dataflow as in 1. and add another Execute SQL Task to enable (or
create) constraints
if there are FK constraints and You can't disable them, You can use
one dataflow to read all data and pass it to subsequent dataflows to
fill table by table. See
SSIS Pass Datasource Between Control Flow Tasks.
Is it be possible to update two tables writing a single query?
So that i do not have to execute two queries and to track whether both are successful?
You can't do it in a query but you can do it as a transaction when all queries within the transaction will either succeed or fail.
You can write a stored procedure that updates the two tables and returns whatever you need it to in order to determine success. This stored proc can then be called from a single command. However, it will still have to contain two queries.
No, that is not possible AFAIK.
EDIT: What is the reason for you to achieve this in a single query?
You could use transactions, however you are still required to update the tables separately and check the results before committing or rolling back.
Ofcourse you can by using triggers