AzureSynapse pipeline how to add guid to raw data

AzureSynapse pipeline how to add guid to raw data - sql-server

I am new to AzureSynapse and am technically a Data Scientist whos doing a Data Engineering task. Please help!
I have some xlsx files containing raw data that I need to import into an SQL database table. The issue is that the raw data does not have a uniqueidentifer column and I need to add one before inserting the data into my SQL database.
I have been able to successfully add all the rows to the table by adding a new column on the Copy Data command and setting it to be #guid(). However, this sets the guid of every row to the same value (not unique for each row).
GUID mapping:
DB Result:
If I do not add this mapping, the pipeline throws an error stating that it cannot import a NULL Id into the column Id. Which makes sense as this column does not accept NULL values.
Is there a way to have AzureSynapse analystics read in a raw xlsx file and then import it into my DB with a unique identifier for each row? If so, how can I accomplish this?
Many many thanks for any support.

Giving dynamic content to a column in this way would generate the same value for entire column.
Instead, you can generate a new guid for each row using a for each activity.
You can retrieve the data from your source excel file using a lookup activity (my source only has name column). Give the output array of lookup activity to for each activity.
#activity('Lookup1').output.value
Inside for each, since you already have a linked service, create a script activity. In this script activity, you can create a query with dynamic content to insert values into the destination table. The following is the query I built using dynamic content.
insert into demo values ('#{guid()}','#{item().name}')
This allows you to iterate through source rows, insert each row individually while generating new guid every time
You can follow the above procedure to build a query to insert each row with unique identifier value. The following is an image where I used copy data to insert first 2 rows (same as yours) and inserted the next 2 rows using the above procedure.
NOTE: I have taken Azure SQL database for demo, but that does not affect the procedure.

Related

Datafactory - dynamically copy subsection of columns from one database table to another

I have a database on SQL Server on premises and need to regularly copy the data from 80 different tables to an Azure SQL Database. For each table the columns I need to select from and map are different - example, TableA - I need columns 1,2 and 5. For TableB I need just column 1. The tables are named the same in the source and target, but the column names are different.
I could create multiple Copy data pipelines and select the source and target data sets and map to the target table structures, but that seems like a lot of work for what is ultimately the same process repeated.
I've so far created a meta table, which lists all the tables and the column mapping information. This table holds the following data:
SourceSchema, SourceTableName, SourceColumnName, TargetSchema, TargetTableName, TargetColumnName.
For each table, data is held in this table to map the source tables to the target tables.
I have then created a lookup which selects each table from the mapping table. It then does a for each loop and does another lookup to get the source and target column data for the table in the foreach iteration.
From this information, I'm able to map the Source table and the Sink table in a Copy Data activity created within the foreach loop, but I'm not sure how I can dynamically map the columns, or dynamically select only the columns I require from each source table.
I have the "activity('LookupColumns').output" from the column lookup, but would be grateful if someone could suggest how I can use this to then map the source columns to the target columns for the copy activity. Thanks.

In your case, you can use the expression in the mapping setting.
It needs your provide an expression and it's data should like this:{"type":"TabularTranslator","mappings":[{"source":{"name":"Id"},"sink":{"name":"CustomerID"}},{"source":{"name":"Name"},"sink":{"name":"LastName"}},{"source":{"name":"LastModifiedDate"},"sink":{"name":"ModifiedDate"}}]}
So you need to add a column named as Translator in your meta table, and it's value should be like the above JSON data. Then use this expression to do mapping:#item().Translator
Reference: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping#parameterize-mapping

SQL Server / SSIS: Create new ID for inserted row of merge

I am kind of stuck:
Scenario:
I have a SSIS-Package, which loads data incrementally. In the first step, I load rows from a source which have a) been inserted or b) updated into a staging table. I do this by using the last timestamp of the source table.
In the next step, I am trying to use a MERGE-Statement to update the data in another database (similar to a data warehouse). I have no control over this other database, otherwise my task would be quite easy.
Problem:
The data warehouse table includes an ID-Column ([cId], BIGINT), which it does not set by itself. I have tried to create a sequence, from which I pull a value whenever I insert a new row into the data warehouse (not when I update a row, since that row will already have an ID). However, as specified here, SQL Server will not let me use the next value from my sequence for the target of a MERGE-Statement. Since I have no control over the data warehouse, I cannot change this.
Another solution would be to get the next value from my sequence when I load the data into the staging table. This, however, will result in "holen" in my ID-sequence, because when I update a row in the data warehouse from my staging table, the [cId] column would not be updated, since that row already has an ID.
Does anyone have an idea how to solve this? I am basically trying to pull a new, unique BIGINT, whenever I do an insert inside my MERGE-Statement.
Thanks!

How to use the pre-copy script from the copy activity to remove records in the sink based on the change tracking table from the source?

I am trying to use change tracking to copy data incrementally from a SQL Server to an Azure SQL Database. I followed the tutorial on Microsoft Azure documentation but I ran into some problems when implementing this for a large number of tables.
In the source part of the copy activity I can use a query that gives me a change table of all the records that are updated, inserted or deleted since the last change tracking version. This table will look something like
PersonID Age Name SYS_CHANGE_OPERATION
---------------------------------------------
1 12 John U
2 15 James U
3 NULL NULL D
4 25 Jane I
with PersonID being the primary key for this table.
The problem is that the copy activity can only append the data to the Azure SQL Database so when a record gets updated it gives an error because of a duplicate primary key. I can deal with this problem by letting the copy activity use a stored procedure that merges the data into the table on the Azure SQL Database, but the problem is that I have a large number of tables.
I would like the pre-copy script to delete the deleted and updated records on the Azure SQL Database, but I can't figure out how to do this. Do I need to create separate stored procedures and corresponding table types for each table that I want to copy or is there a way for the pre-copy script to delete records based on the change tracking table?

You have to use a LookUp activity before the Copy Activity. With that LookUp activity you can query the database so that you get the deleted and updated PersonIDs, preferably all in one field, separated by comma (so its easier to use in the pre-copy script). More information here: https://learn.microsoft.com/en-us/azure/data-factory/control-flow-lookup-activity
Then you can do the following in your pre-copy script:
delete from TableName where PersonID in (#{activity('MyLookUp').output.firstRow.PersonIDs})
This way you will be deleting all the deleted or updated rows before inserting the new ones.
Hope this helped!

In the meanwhile the Azure Data Factory provides the meta-data driven copy task. After going through the dialogue driven setup, a metadata table is created, which has one row for each dataset to be synchronized. I solved this UPSERT problem by adding a stored procedure as well as a table type for each dataset to be synchronized. Then I added the relevant information in the metadata table for each row like this
{
"preCopyScript": null,
"tableOption": "autoCreate",
"storedProcedure": "schemaname.UPSERT_SHOP_SP",
"tableType": "schemaname.TABLE_TYPE_SHOP",
"tableTypeParameterName": "shops"
}
After that you need to adapt the sink properties of the copy task like this (stored procedure, table type, table type parameter name):
#json(item().CopySinkSettings).storedProcedure
#json(item().CopySinkSettings).tableType
#json(item().CopySinkSettings).tableTypeParameterName
If the destination table does not exist, you need to run the whole task once before adding the above variables, because auto-create of tables works only as long as no stored procedure is given in the sink properties.

Need help in a MS SQL DB design

Hi I have 2 type of data entry which needs to be stored in db so it can be used for calculations later. Each entry has a unique id for it. The data entry are -
1.
2.
So I have to save this data in DB. With my understanding I thought of the following -
Create 3 tables - Common, Entry1 and Entry2(multiple tables with unique id as name)
The Common table will have a unique entry of each data and which table refer to for the value (Entry1/Entry2).
The Entry1 data is a single line so it can be inserted. But the Entry2 data will require a complete table because of its structure. So whenever we add a type 2 entry then a new table has to be created, which will create a lot of tables.
Or I could save the type2 values in another database and fetch the values from there. So please suggest me a way which is better than this.

I believe that you have 2 entry types with identical structure, but one containing a single row and one containing many.
In this case, I would suggest a single table containing the data for all entries, wtih a second table grouping them together. Even if your input contains a single row, it should still gain an EntryID. Perhaps something like the below:

generate script to create only a single row

I was referencing this question How to export all data from table to an insertable sql format?
while looking for a way to create an insert statement for a single row from a table without having to manually write it since the table has many columns. In my case I simply followed the steps listed then performed a ctrl-f search in the resulting script for the record I wanted then copied and pasted that single line to another query window but this would be terrible if I had hundreds of millions of rows. Is there a way to get the same functionality but tell the script generator I only want rows where id = value? Is there a better way to do this using only the out of the box Microsoft tools?

There is no way to do this, but you can do it by using a temp table
Create a new table by inset into and select those records which you want to insert.
Create the script and change the table name by using find and replace.
finally drop that temporary table.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight