I have created a pipeline in Azure ADF which does ETL to produce a 15GB clean file "Final_Data_2023.csv" in an Azure storage container and this clean file will get copied into SQL Server table dbo.Final_Table.
This process happens every month and next month we will prepare the Final_Data_2023.csv clean file again. I need to truncate dbo.Final_Table and again push new data into it. But I am concerned that my new data has completely wrong values and for a quick fix I need my old Final_Data_2023.csv to be in table dbo.Final_Table. Since I am truncating all data from the table, it won't be possible to get it back.
How should I design my architecture so that I can quickly access or maintain the previous month table and revert it if something goes wrong?
It doesn't have to be a small workaround.
Azure Storage container is mainly used for storing massive data. So, you can design a pipeline to store all backup data in container and copy only the latest file to SQL server table.
Approach:
You can store the filename along with date while copying to Azure storage container.
Copy the same file from container to database.
Below are the detailed steps.
Store the file name with current date in a variable using set variable activity.
#concat(substring(utcnow(),0,10),'_filename.csv')
Add a copy activity to copy data from SQL server to storage container.
In copy activity, you can take the source dataset and in sink dataset, create a parameter for filename.
Add the parameter #dataset().fileName in the file path as in below image.
In Sink dataset, pass the variable value to the dataset parameter.
Add another copy activity to copy from container to DB and keep the same source dataset that used in copy activity1 and pass the variable in dataset parameter.
In sink dataset, give the dataset for SQL server.
By this way, you can store the backup data in azure container and copy the new clean file to database. If needed to rollback to previous versions of file, we can copy those data from container.
Related
I frequently need to validate CSVs submitted from clients to make sure that the headers and values in the file meet our specifications. Typically I do this by using the Import/Export Wizard and have the wizard create the table based on the CSV (file name becomes table name, and the headers become the column names). Then we run a set of stored procedures that checks the information_schema for said table(s) and matches that up with our specs, etc.
Most of the time, this involves loading multiple files at a time for a client, which becomes very time consuming and laborious very quickly when using the import/export wizard. I tried using an xp_cmshell sql script to load everything from a path at once to have the same result, but xp_cmshell is not supported by AzureSQL DB.
https://learn.microsoft.com/en-us/azure/azure-sql/load-from-csv-with-bcp
The above says that one can load using bcp, but it also requires the table to exist before the import... I need the table structure to mimic the CSV. Any ideas here?
Thanks
If you want to load the data into your target SQL db, then you can use Azure Data Factory[ADF] to upload your CSV files to Azure Blob Storage, and then use Copy Data Activity to load that data in CSV files into Azure SQL db tables - without creating those tables upfront.
ADF supports 'auto create' of sink tables. See this, and this
I would like to know the steps on how to restore data dumped from an Oracle database to a SQL Server database?
Our purpose is to get data from an external Oracle database out of our organization. Due to security concern, the team that manages data source refused us to transfer data through ODBC server link. They dumped the selected tables that we need so we can restore the data in our organization. Each table's data files include .sql file to create table and constraints, a ".ctl" file, one or multiple ".ldr" files.
An extra trouble is: one of the tables contains a blob column, which stores a lot of binary data files, such as PDF etc.. This column takes most of the size of our dumped files. Otherwise I could ask them to send us data in excel directly.
Can someone give me a suggestion about what route we should take?
Either get them to export the data in an open format, or load it into an Oracle instance you have full control over. .ctl and .ldr files looks like they used the old SQL*Loader.
I am moving data within folder from Azure Data Lake to a SQL Server using Azure Data Factory (ADF).
The folder contains hundreds of .csv files. However, one inconsistent problem with these csv's is that some (not all) have a final row that contains a special character, which when trying to load to a sql table with datatypes other than NVARCHAR(MAX) will fail. To get around this, I have to first use ADF to load the data into staging tables where all columns are set to NVARCHAR(MAX), then I insert those rows that do not contain a special character into tables that have the appropriate data type.
This is a weekly process, and is over a terabyte of data and it takes forever to move the data so I am looking into ways to import into my final tables rather than having a staging component.
I notice that there is a 'pre-copy script' field that can execute before the load to sql server. I want to add code that will allow me to parse out special characters OR null rows before loading to sql server.
I am unsure of how to approach this since the csv's would not be stored in a table, so SQL code wouldn't work. Any guidance on how I can utilize the pre-copy script to clean my data before loading it into sql server?
The pre-copy script is a script that you run against the database before copying new data in, not to modify the data you are ingesting.
I already answered this on another question, providing a possible solution using an intermediate table: Pre-copy script in data factory or on the fly data processing
Hope this helped!
You could consider stored procedure. https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-sql-database#invoking-stored-procedure-for-sql-sink
First thing first. I'm totally new to SSIS and trying to figure out its potential when it comes to ETL and eventually go to SSAS. I have the following scenario:
I have an Intersystems Database which I can connect via ADO .NET
I want to take data from this db and make inserts into MS SQL through incremental loads
My proposed solution/target is:
Have table in the MS SQL that stores the last pointer read or date/time snapshot. (irrevevant at this stage). Let's keep it simple and say we are going to use the record ID that exists in the Intersystems Database
Get the pointer from this table and use it as a parameter through ODBC to read the source database and then make inserts into the target MS SQL db
Update the pointer with the last record read so that next time we continue from there. (I don't want to get into the complications of updates/deletes. let's keep it simple)
Progress so far:
I have succeed to make a connection with MS SQL to read the pointer from there and place it in a variable
I have managed to use the [Execute SQL task] using parameters to read data from Intersystems Db and I'm placing that into a variable using FullResultSet
I have managed to use the [ForEach Loop Container] using the [Foreach ADO Enumerator] to go through each record and each field (yeeeey!)
Now. I can use a [Script task] that makes inserts into the MS SQL database using VB.NET code (theoretically) and then update the counter with the last record read from the source database. I have spent endless hours looking for solutions using ODBC parameters and the above is the only way forward I could see working.
My question is this:
Is this the only way and best practise? Isn't there some easy way that I can plug this resultset into some dataflow components which does the inserts and updates the record pointer for me??
Please assume that I do not have rights access to write into Intersystems Db and thus I cannot make any changes there to the tables structures. But I can only read data so that I can place it into MS SQL.
Over to you guys (or gals?)
I would suggest using a dataflow to improve your design for both efficiency (bulk loading vs row by row in script) and ease of use (no need for scripting).
You should use an execute SQL to get your pointer and save it into a variable.
You should build a sql variable using dynamic sql and above variable.
Make a data connection in manager to Source
Add a dataflow and go into it
Add a source manager and select your source from popup
Choose sql from variable and choose your variable
At this point you should have all the data you want and you can continue to transform or directly load to your target.
Edit: Record Pointer part
Add a multicast (this makes as many copies as you want)
Add an Aggregate Object and max(whatever your pointer is)
OleDBSQL Object (Allows live SQL and used mainly for updates
9a. UPDATE "YourPointerTable" SET "PointerField in DB" = ? (? is actually what you need to enter.
9b. Map to whatever you named in step 8.
This will allow you to handle insert/updates
From Multicast flow a new stream to a lookup object and map your key to the key of destination table
Specify no matches to redirect to no match output
Your matches map to an UPDATE
Your no matches map to an Insert
I am trying to copy data from views on a trusted SQL Server 2012 to tables on a local instance of SQL Server on a scheduled transfer. What would be the best practice for this situation?
Here are the options I have come up with so far:
Write an executable program in C# or VB to delete existing local table, query the data from remote database and then write results to tables in the local database. The executable would run on a scheduled task.
Use BCP to copy data to a file and then upload into local table.
Use SSIS
Note: The connection between local and remote SQL Server is very slow.
Since the transfers are scheduled, so I suppose you want this data to be up-to-date.
My recommendation would be to use SSIS and schedule it using SQL Agent. If you wrote a C# program, I think the best outcome you will gain is a program imitating SSIS. Moreover, SSIS will be a very easy to amend the workflow anytime.
Either way, to make such program/package up-to-date, you will have to answer an important question: Is the source table updatable or is it like a log (inserts only)?
This question is so important because it will determine how you will fetch the new updates from the source table. For example, if the table represents logs, you will most probably use the Primary Key to detect new records, if not, you might want to seek a column representing update date/time. If you have the authority to alter the source table, you might want to add timestamp column which represent the row version (timestamp differs than datetime)
For building an SSIS package, it will mainly contain the following components:
Execute SQL Task to get the maximum value from source table.
Execute SQL Task to get the last value where it should start from at the destination table. You can get this value either by selecting the maximum value from the destination table or if the table is pretty large you can store that value in another table (configuration table for example).
Data Flow which moves the data from source table starting after the value fetched in step 2 to the value fetched in step 1.
Execute SQL Task for updating the new maximum value back to the configuration table if you chose this technique.
BCP can be used to export the data compress and transfer over network which can be then imported into local instance of SQL.
Also with BCP data exports can be contained with smaller batches of data for easier management of data.
https://msdn.microsoft.com/en-us/library/ms191232.aspx
https://technet.microsoft.com/en-us/library/ms190923(v=sql.105).aspx