How to use the pre-copy script from the copy activity to remove records in the sink based on the change tracking table from the source? - sql-server

I am trying to use change tracking to copy data incrementally from a SQL Server to an Azure SQL Database. I followed the tutorial on Microsoft Azure documentation but I ran into some problems when implementing this for a large number of tables.
In the source part of the copy activity I can use a query that gives me a change table of all the records that are updated, inserted or deleted since the last change tracking version. This table will look something like
PersonID Age Name SYS_CHANGE_OPERATION
---------------------------------------------
1 12 John U
2 15 James U
3 NULL NULL D
4 25 Jane I
with PersonID being the primary key for this table.
The problem is that the copy activity can only append the data to the Azure SQL Database so when a record gets updated it gives an error because of a duplicate primary key. I can deal with this problem by letting the copy activity use a stored procedure that merges the data into the table on the Azure SQL Database, but the problem is that I have a large number of tables.
I would like the pre-copy script to delete the deleted and updated records on the Azure SQL Database, but I can't figure out how to do this. Do I need to create separate stored procedures and corresponding table types for each table that I want to copy or is there a way for the pre-copy script to delete records based on the change tracking table?

You have to use a LookUp activity before the Copy Activity. With that LookUp activity you can query the database so that you get the deleted and updated PersonIDs, preferably all in one field, separated by comma (so its easier to use in the pre-copy script). More information here: https://learn.microsoft.com/en-us/azure/data-factory/control-flow-lookup-activity
Then you can do the following in your pre-copy script:
delete from TableName where PersonID in (#{activity('MyLookUp').output.firstRow.PersonIDs})
This way you will be deleting all the deleted or updated rows before inserting the new ones.
Hope this helped!

In the meanwhile the Azure Data Factory provides the meta-data driven copy task. After going through the dialogue driven setup, a metadata table is created, which has one row for each dataset to be synchronized. I solved this UPSERT problem by adding a stored procedure as well as a table type for each dataset to be synchronized. Then I added the relevant information in the metadata table for each row like this
{
"preCopyScript": null,
"tableOption": "autoCreate",
"storedProcedure": "schemaname.UPSERT_SHOP_SP",
"tableType": "schemaname.TABLE_TYPE_SHOP",
"tableTypeParameterName": "shops"
}
After that you need to adapt the sink properties of the copy task like this (stored procedure, table type, table type parameter name):
#json(item().CopySinkSettings).storedProcedure
#json(item().CopySinkSettings).tableType
#json(item().CopySinkSettings).tableTypeParameterName
If the destination table does not exist, you need to run the whole task once before adding the above variables, because auto-create of tables works only as long as no stored procedure is given in the sink properties.

Related

AzureSynapse pipeline how to add guid to raw data

I am new to AzureSynapse and am technically a Data Scientist whos doing a Data Engineering task. Please help!
I have some xlsx files containing raw data that I need to import into an SQL database table. The issue is that the raw data does not have a uniqueidentifer column and I need to add one before inserting the data into my SQL database.
I have been able to successfully add all the rows to the table by adding a new column on the Copy Data command and setting it to be #guid(). However, this sets the guid of every row to the same value (not unique for each row).
GUID mapping:
DB Result:
If I do not add this mapping, the pipeline throws an error stating that it cannot import a NULL Id into the column Id. Which makes sense as this column does not accept NULL values.
Is there a way to have AzureSynapse analystics read in a raw xlsx file and then import it into my DB with a unique identifier for each row? If so, how can I accomplish this?
Many many thanks for any support.
Giving dynamic content to a column in this way would generate the same value for entire column.
Instead, you can generate a new guid for each row using a for each activity.
You can retrieve the data from your source excel file using a lookup activity (my source only has name column). Give the output array of lookup activity to for each activity.
#activity('Lookup1').output.value
Inside for each, since you already have a linked service, create a script activity. In this script activity, you can create a query with dynamic content to insert values into the destination table. The following is the query I built using dynamic content.
insert into demo values ('#{guid()}','#{item().name}')
This allows you to iterate through source rows, insert each row individually while generating new guid every time
You can follow the above procedure to build a query to insert each row with unique identifier value. The following is an image where I used copy data to insert first 2 rows (same as yours) and inserted the next 2 rows using the above procedure.
NOTE: I have taken Azure SQL database for demo, but that does not affect the procedure.

SSIS Move Data Between Databases - Maintain Referential Integrity

I need to move data between two databases and wanted to see if SSIS would be a good tool. I've pieced together the following solution, but it is much more complex than I was hoping it would be - any insight on a better approach to tackling this problem would be greatly appreciated!
So what makes my situation unique; we have a large volume of data, so to keep the system performant we have split our customers into multiple database servers. These servers have databases with the same schema, but are each populated with unique data. Occasionally we have the need to move a customer's data from one server to another. Because of this, simple recreating the tables and moving the data in place won't work as in the database on server A there could be 20 records, but there could be 30 records in the same table for the database on server B. So when moving record 20 from A to B, it will need to be assigned ID 31. Getting past this wasn't difficult, but the trouble comes when needing to move the tables which have a foreign key reference to what is now record 31....
An example:
Here's a sample schema for a simple example:
There is a table to track manufacturers, and a table to track products which each reference a manufacturer.
Example of data in the source database:
To handle moving this data while maintaining relational integrity, I've taken the approach of gathering the manufacturer records, looping through them, and for each manufacturer moving the associated products. Here's a high level look at the Control Flow in SSDT:
The first Data Flow grabs the records from the source database and pulls them into a Recordset Destination:
The OLE DB Source pulls from the source databases manufacturer table while pulling all columns, and places it into a record set:
Back in the control flow, I then loop through the records in the Manufacturer recordset:
For each record in the manufacturer recordset I then execute a SQL task which determines what the next available auto-incrementing ID will be in the destination database, inserts the record, and then returns the results of a SELECT MAX(ManufacturerID) in the Execute SQL Task result set so that the newly created Manufacturer ID can be used when inserting the related products into the destination database:
The above works, however once you get more than a few layers deep of tables that reference one another, this is no longer very tenable. Is there a better way to do this?
You could always try this:
Populate you manufacturers table.
Get your products data (ensure you have a reference such as name etc. to manufacturer)
Use a lookup to get the ID where your name or whatever you choose matches.
Insert into database.
This will keep your FK constraints and not require you to do all that max key selection.

Storing/creating local table from linked SQL table

I have a linked table in my access database (dbo_Billing_denied (DSN=WTSTSQL05_BB;DATABASE=DEPTFINANCE), etc.) and I want to create a table that will store the data from this linked into local table, so I can use it to run other queries. Currently I can use this because it tells me that it can not make connection (ODB--connection to 'WTSTSQL05_BB' failed.
Do I have to create a table first and assign all the fields before I can do this (create a table and fields that are same as what's in the linked table and than create append query to do this...)?
It sounds like you might have two problems. I will address the second one. You will need to reestablish connection to the linked table before this will work.
You can use a "make table query" in Access to make a local copy of the linked table. You can use the GUI for this, or you can structure the SQL something like this:
SELECT <list of various fields, or * for all fields>
INTO <name of new local table>
FROM <name of linked table(s) on the server>
WHERE <any other conditions you want to put on which records are included>;
I mentioned that there might be more than one table. You can also do this with joined tables or unions etc. The "where" clause is optional. Removing it will copy the entire data set.
You will get a warning when you try to execute this query in Access. It will tell you that you are about to write (or overwrite) a table. If you are trying to write a cleaner application with fewer nuisance messages for the end user, call this query from a macro. The macro would need to turn the warnings off, execute the query, then turn the warnings back on.
Microsoft Access does not require you to create this table before you write it; if the table does not exist Access will create this table for you, based on the field definitions in the source data. If a table of the same name does exist, Access will drop this table from your database and then create a new table of that name.
This also implies that the local table you are generating will need a unique name. If your query tries to overwrite the linked table by using the same name, the first thing Access will do is drop the linked table. It will then look for field definitions and input data in the linked table that it just dropped.
Since the new local table will have a new name, queries developed for the linked table will not work with the new local table. One possible work-around would be to rename the linked table in your local Access database. The table name in Access does not need to equal the name in the database it's linking to. The query could then write to a table with the correct name, and previous queries should work. Still, keep in mind that these queries would no longer be working on live data.

Change tracking -- simplest scenario

I am coding in ASP.NET C# 4. The database is SQL Server 2012.
I have a table that has 2000 rows and 10 columns. I want to load this table in memory and if the table is updated/inserted in any way, I want to refresh the in-memory copy from the DB.
I looked into SQL Server Change Tracking, and while it does what I need, it appears I have to write quite a bit of code to select from the change functions -- more coding than I want to do for a simple scenario that I have.
What is the best (simplest) solution for this problem? Do I go with CacheDependency?
I currently have a similar problem: I'm implementing a rest service that returns a table with 50+ columns and I want to cache the data on the client to reduce trafic.
I'm thinking about this implementation:
All my tables have the fields
ID AutoIncrement (primary key)
Version RowVersion (a numeric value that will be incremented
every time the record is updated)
To calculate a "fingerprint" of the table I use the select
select count(*), max(id), sum(version) from ...
Deleting records changes the first value, inserting the second value and updating the third value.
So if one of the three values changes, i have to reload the table.

A generic sql query to archive more than one table

How could we manage the archive process without writing separate stored procedures
in SQL Server 2000?
For example,
There are two tables in current db-student and employee.
The objective is to archive the data in these tables-
student table - data older than 1 year
employee table - data older than 2 years
The date field to be compared in the student table is field CreatedDate and that of employee
is DOJ
In addition, I have kept a configuration table with columns ConfigtableName, ConfigColumnName , ConfigCutoffdate.
a) How can I write a generic query so that it dynamically takes the table name as well as column
name from the configuration table and insert the data to the archive dbs' tables?
Something like this....
INSERT INTO <ArchiveDb>.Dbo.<Table name obtained from config table>
SELECT *
FROM <CurrentDb>.Dbo.<Table name obtained from config table>
WHERE
<ConfigColumnName obtained from config table> < <Cutoffdate obtained from config table>
b) How to manage the identify field set option?
c) Is it possible if an error occur in the nth iteration, it could save the error detail to a log?
The only way to construct such a dynamic query in a stored procedure is by using the sp_executesql stored procedure. Read the documentation I linked. Is pretty straight forward.
I am not sure I understand what you mean "identity field set option", but if you are concerned about duplicate values in a column that should have unique values (PK), I'd recommend that you disable the unique indexes in the archive tables since they are for archiving. I don't assume there will be a major issue with duplicate values in an id column but most importantly, that situation should never arise anyway if the archiving tables are identical copies of the source tables.
If you want to catch errors in the nth iteration, you are going to have to enclose every iteration in a begin tran/commit tran block and check for errors. If there's one, you can log to any other table you choose; if not, then you commit the transaction. Read this for an example (scroll all the way down to the Transactions section).

Resources