Azure datafactory multiple tables - database

I have a one business scenario , we have to pull all the tables from the one database let say adventure-work and put all the tables information in separate csv in data lake. suppose in adventure works db if we have 20 tables I need to pull all the table paralleling and each table contains one csv i.e 20 tables will contain 20 csv in azure data lake. How to do using Azure data factory.Kindly don't use for-each activity it takes files sequentially and time consuming.

In Data Factory, there are two ways can help you create 20 csv files from 20 tables in one pipeline: for-each activity and Data Flow.
In Data Flow , add 20 Sources and Sink, for example:
No matter which way, the copy active must be sequentially and take some time.
What you should do is to think about how to improve the copy data performance like Thiago Gustodio said in comment, it can help your same the time.
For example, set more DTUs of your database, using more DIU for your copy active.
Please reference these Data Factory documents:
Mapping data flows performance and tuning guide
Copy activity performance and scalability guide
They all provide performance supports for you.
Hope this helps.

Related

Looping Through Tables in a DB in Informatica

I am looking for a way in Informatica to pull data from a table in a database, load it in Snowflake, and then move on to the next table in that same DB and repeating that for the remaining tables in the database.
We currently have this set up running in Matillion where there is an orchestration that grabs all of the names of a table of a database, and then loops through each of the tables in that database to send the data into Snowflake.
My team and I have tried to ask Informatica Global Support, but they have not been very helpful for us to figure out how to accomplish this. They have suggested things like Dynamic Mapping, which I do not think will work for our particular case since we are in essence trying to get data from one database to a Snowflake database and do not need to do any other transformations.
Please let me know if any additional clarification is needed.
Dynamic Mapping Task is your answer. You create one mapping. With, or without any transformations - as you need. Then you set up Dynamic Mapping Task to execute the mapping across whole set of your 60+ different sources and targets.
Please note that this is available as part of Cloud Data Integration module of IICS. It's not available in PowerCenter.

How to move data from S3 to Snowflake

I have a few questions regarding the process of copying tables from S3 to Snowflake.
The plan is to copy some data from AWS/S3 onto snowflake and then perform some modeling by DataRobot
We have some tables that contain PII data and we would like to hide those columns from Datarobot, what suggestion do you have for this problem?
The schema in AWS needs to match the schema in Snowflake for the copying process.
Thanks,
Mali
Assuming you know the schema of the data you are loading, you have a few options for using Snowflake:
Use COPY INTO statements to load the data into the tables
Use SNOWPIPE to auto-load the data into the tables (this would be good for instances where you are regularly loading new data into Snowflake tables)
Use EXTERNAL TABLES to reference the S3 data directly as a table in Snowflake. You'd likely want to use MATERIALIZED VIEWS for this in order for the tables to perform better.
As for hiring the PII data from DataRobot, I would recommend leveraging Snowflake DYNAMIC DATA MASKING to establish rules that obfuscate the data (or null it out) for the role that DataRobot is using.
All of these features are well-documented in Snowflake documentation:
https://docs.snowflake.com/
Regarding hiding your PII elements, you can use 2 different roles, one would be say data_owner(the role that will create the table and load the data in it) and another say data_modelling (for using data robot)
Create masking policies using the data owner such that the data robot cannot see the column data.
About your question on copying the data, there is no requirement that AWS S3 folder need to be in sync with Snowflake. you can create the external stage with any name and point it to any S3 folder.
Snowflake documentation has good example which helps to get some hands on :
https://docs.snowflake.com/en/user-guide/data-load-s3.html

What the Process to transfer the staging table data to Fact tables in Snowflake by Custom Validations

good Day.
I need help. I want to transfer the data in Snowflake from Staging tables to Fact tables automatically, when data is available in Stage table. While moving data from Staging table to Fact tables, I have couple of Custom validations on each column and row.
Any idea how to do this in Snowflake.
If any one knows could you please suggest me...!
Thanks in Advance...!
There are many ways to do this and how you go about it depends on what tools you have available. The simplest way to do this without using tools outside of the Snowflake ecosystem would be:
On each of the staging tables you have, set up a stream on these tables (here is the Snowflake documentation on streams)
Create a task that runs on a schedule (here is the Snowflake doc on tasks) to pull from the streams and write into the fact table.
This is really a general data warehousing question rather than a Snowflake one. Here is some more documentation on building SCD type 2 dimensions also written by someone at Snowflake
Assuming "staging tables" refers to a Snowflake table and not a file in a Snowflake stage, I would recommend using a Stream and Task for this. A stream will identify the delta of data that needs to be loaded, and a Task can execute on a schedule and will only actually run something if there is data in the stream. Create a stored procedure that is executed in the Task to run your validations and Merge the outcome of those into your Fact.

SSIS copy multiple tables

I have more than 200 MSSQL tables, and want to transfer data to Azure Data Lake Storage.
One approach I consider is to use SSIS with dynamic data flows, i.e create table name variable and do foreach loop over table names and for each table run dataflow. However this approach seems wrong ,though files are created in Data Lake storage with correct schemes data is not transferred due to wrong mappings.
Is there any generic way to create one dynamic data flow and transfer huge number of table's data?
The scenario you are describing can now be achieved in ADF V2 - V2 added a set of rich control flow enhancements including the lookup activity, parameter passing, and foreach looping. You can see a tutorial of how to accomplish this here: https://learn.microsoft.com/en-us/azure/data-factory/tutorial-bulk-copy

One Database with 20 million records or 51 databases with 50,000-300,000 records in each database?

I've bought a CSV United States business database with ~20 million records, which is divided to 51 databases, every database represents a state.
I need to write an ASP.NET MVC Web Application that will query this database, by state and more arguments. Should I create a SQL Server database and import all the records in the all 51 csv files? Or maybe should I query directly to the csv files? What will be fastest? Feel free to suggest and other solutions.
Thanks.
Create a single database, where you put all those records in. But, do it in a structured fashion offcourse.
For instance, you could create a table 'State', and a table called 'Business'. Create a relationship between those 2 tables.
Normalize your database further.
When you want to have a performant database, it starts by defining a good, normalized DB schema.
Add the necessary indexes, and you should be fine.
A database is designed to be able to handle a large amount of records.
One table, with appropriate indexes. 20 million records is peanuts.
I would import the data into one big database. As long as the table is correctly indexed it will offer better performance when querying as instead of having to scan each file it should be able to use the correct indexes to speed things up.

Resources