Data pipeline for big data transfer - database

Context:
Need to extract data from dB owned by another team to run some modeling. Frequency of extraction is biweekly. Data size is around 500k-1million rows.
Question:
I did direct connection by asking the other team to create a dB role for my extraction or get a dump file from them.
What are some of the ways that we can extract the data? Web services is good?
Thank you in advanced

Related

I have functions inside of a Databrick notebook that pulls from Snowflake and S3, should the data be staged in Databricks or Snowflake?

I am creating a system which pulls data from S3 buckets and Snowflake tables (I also have access to this SF portal). I will be running data quality/data validations against this incoming data inside of a Databricks notebook. My question is when I pull this data in Ill have to stage it somehow to run those DQ checks. Does it make more sense to stage this data inside the Databricks portal or Snowflake portal?
Thanks
What I've researched: databricks + snowflake stage and architecture
In general, it’s normally a good idea to hold data as close to where it is being processed as possible. If Databricks is going to be directly processing the data then hold the data in Databricks; if Databricks is going push down processing to Snowflake then hold the data in Snowflake

Create a Data Warehouse with the database on SQL Developer

I have a database in SQL Developer which pull data from an ERP tool and I would like to create a Data warehouse in order to connect it then to PowerBI.
It's my first time that I am doing all this process from the beginning so I am not so experienced.
So where are you suggesting to create the Data Warehouse (I was thinking on SSMS) and how can I connect it with PowerBI ?
My Data Warehouse will consist from some View of my tables and some Joins to get some data in the structure that I want since it is not possible to change anything in the DB.
Thanks in advance.
A "data warehouse" is just a database. The distinction is really more about the commonly used schema design, in the sense that a warehouse is often built along the lines of a star or snowflake design.
So if you already have a database that is extracting data from your ERP, there is nothing to stop you from pointing PowerBI directly at that and performing some analytics etc. If your intention is to start with this database, and then clone/extract/load this data into a new database which is a star/snowflake schema, then that's a much bigger exercises.

Looping Through Tables in a DB in Informatica

I am looking for a way in Informatica to pull data from a table in a database, load it in Snowflake, and then move on to the next table in that same DB and repeating that for the remaining tables in the database.
We currently have this set up running in Matillion where there is an orchestration that grabs all of the names of a table of a database, and then loops through each of the tables in that database to send the data into Snowflake.
My team and I have tried to ask Informatica Global Support, but they have not been very helpful for us to figure out how to accomplish this. They have suggested things like Dynamic Mapping, which I do not think will work for our particular case since we are in essence trying to get data from one database to a Snowflake database and do not need to do any other transformations.
Please let me know if any additional clarification is needed.
Dynamic Mapping Task is your answer. You create one mapping. With, or without any transformations - as you need. Then you set up Dynamic Mapping Task to execute the mapping across whole set of your 60+ different sources and targets.
Please note that this is available as part of Cloud Data Integration module of IICS. It's not available in PowerCenter.

Azure datafactory multiple tables

I have a one business scenario , we have to pull all the tables from the one database let say adventure-work and put all the tables information in separate csv in data lake. suppose in adventure works db if we have 20 tables I need to pull all the table paralleling and each table contains one csv i.e 20 tables will contain 20 csv in azure data lake. How to do using Azure data factory.Kindly don't use for-each activity it takes files sequentially and time consuming.
In Data Factory, there are two ways can help you create 20 csv files from 20 tables in one pipeline: for-each activity and Data Flow.
In Data Flow , add 20 Sources and Sink, for example:
No matter which way, the copy active must be sequentially and take some time.
What you should do is to think about how to improve the copy data performance like Thiago Gustodio said in comment, it can help your same the time.
For example, set more DTUs of your database, using more DIU for your copy active.
Please reference these Data Factory documents:
Mapping data flows performance and tuning guide
Copy activity performance and scalability guide
They all provide performance supports for you.
Hope this helps.

Copying on premise SQL server database data to Azure in Parquet format

Architectural/perf question here.
I have a on premise SQL server database which has ~200 tables of ~10TB total.
I need to make this data available in Azure in Parquet format for Data Science analysis via HDInsight Spark.
What is the optimal way to copy/convert this data to Azure (Blob storage or Data Lake) in Parquet format?
Due to manageability aspect of task (since ~200 tables) my best shot was - extract data locally to file share via sqlcmd, compress it as csv.bz2 and use data factory to copy file share (with 'PreserveHierarchy') to Azure. Finally run pyspark to load data and then save it as .parquet.
Given table schema, I can auto-generate SQL data extract and python scripts
from SQL database via T-SQL.
Are there faster and/or more manageable ways to accomplish this?
ADF matches your requirement perfectly with one-time and schedule based data move.
Try copy wizard of ADF. With it, you can directly move on-prem SQL to blob/ADLS in Parquet format with just couple clicks.
Copy Activity Overview

Resources