Data pipeline - dumping large files from API responses into AWS then with final destination being on premises SQL Server - sql-server

I'm new to building data pipelines where dumping files in the cloud is one or more steps in the data flow. Our goal is to store large, raw sets of data from various APIs in the cloud then only pull what we need (summaries of this raw data) and store that in our on premises SQL Server for reporting and analytics. We want to do this in the most easy, logical and robust way. We have chosen AWS as our cloud provider but since we're at the beginning phases are not attached to any particular architecture/services. Because I'm no expert with the cloud nor AWS, I thought I'd post my thought for how we can accomplish our goal and see if anyone has any advice for us. Does this architecture for our data pipeline make sense? Are there any alternative services or data flows we should look into? Thanks in advance.
1) Gather data from multiple sources (using APIs)
2) Dump responses from APIs into S3 buckets
3) Use Glue Crawlers to create a Data Catalog of data in S3 buckets
4) Use Athena to query summaries of the data in S3
5) Store data summaries obtained from Athena queries in on-premises SQL Server
Note: We will program the entire data pipeline using Python (which seems like a good call and easy no matter what AWS services we utilize as boto3 is pretty awesome from what I've seen thus far).

You may use glue jobs (pyspark) for #4 and #5. You may automate flow using Glue triggers

Related

Which database to choose in order to store data coming from flat files CSV, html

I need to design a scalable database architecture in order to store all the data coming from flat files - CSV, html etc. These files come from elastic search. most of the scripts are created in python. This data architecture should be able to automate most of the daily manual processing performed using excel, csv, html and all the data will be retrieved from this database instead of relying on populating within csv, html.
Database requirements:
Database must have a better performance to retrieve data on day to day basis and it will be queried by multiple teams.
ER model, schema will be developed for the data with logical relationship.
The database can be within cloud storage.
The database must be highly available and should be able to retrieve data faster.
This database will be utilized to create multiple dashboards.
The ETL jobs will be responsible for storing data in the database.
There will be many reads from the database and multiple writes each day with lots of data coming from Elastic Search and some of the cloud tools.
I am considering RDS, Azure SQL, DynamoDB, Postgres or Google Cloud. I would want to know which database engine would be a better solution considering these requirements. I also want to know how ETL process should be designed- lambda or kappa architecture.
To store the relational data like CSV and excel files, you can use relational database. For flat files like HTML, which doesn't required to be queried, you can simply use Storage account in any cloud service provider, for example Azure.
Azure SQL Database is a fully managed platform as a service (PaaS) database engine that handles most of the database management functions such as upgrading, patching, backups, and monitoring without user involvement. Azure SQL Database is always running on the latest stable version of the SQL Server database engine and patched OS with 99.99% availability. You can restore the database at any point of time. This should be the best choice to store relational data and perform SQL query.
Azure Blob Storage is Microsoft's object storage solution for the cloud. Blob storage is optimized for storing massive amounts of unstructured data. Your HTML files can be stored here.
The ETL jobs can be performed using Azure Data Factory (ADF). It allows you to connect almost any data source (including outside Azure) to transform the stored dataset and store it into desired destination. Data flow transformation in ADF is capable to perform all the ETL related tasks.

Copying tables from databases to a database in AWS in simplest and most reliable way

I have some tables from three databases that I want to copy their data to another database in an automated way and these data are quite large. My servers are running on AWS. What is the simplest and most reliable way to do so?
Edit
I want them to stay on-sync (automation process as DevOps engineer)
The databases are all MySQL and all moved between AWS EC2. The data is in range between 100GiB and 200GiB
Currently, Maxwell to take the data from the tables then moved to Kafka and then a script written in Java to feed the other database.
I believe you can use AWS Database Migration Service (DMS) to replicate tables from each source into a single target. You would have a single target endpoint and three source endpoints. You would have three replication tasks that would take data from each source and put it into your target. DMS can keep data in sync via ongoing replication. Be sure to read up on the documentation before proceeding as it isn't the most intuitive service to use, but it should be able to do what you are asking.
https://docs.aws.amazon.com/dms/latest/userguide/Welcome.html

Is there an option to move data directly to Snowflake without going through S3 and other cloud storage

We choosed Snowflake as our DWH and we would like to connect different data sources like (Salesforce, Hubspot and Zendesk).
Is there a way to extract data from these sources and store them in Snowflake in a staging schema without having to store the data in cloud storage like S3 then reading the data into Snowflake?
Many thanks in advance.
You can use any of the connectors Snowflake provide (odbc, jdbc, python, etc) and any tool that can use one of these connectors. However they wont perform well compared to the COPY INTO approach that is optimised for bulk loading.
There are ETL tools, such as Matillion, that use the stage/copy into approach but do it in the background so that it appears that you are loading directly into Snowflake.

How do I replicate data from the Common Data Service to SQL Server on Azure?

I have data in Microsoft's Common Data Service (from Microsoft Dynamics for Talent). I can't use the Data Management Framework as the data in question is in entities that are not available through the DMF.
How do I replicate the data in the CDS back a SQL database?
What I've tried so far is to create a logic app (and flow, neither worked) that grabs data using the CDS connector and pushes it into an SQL database, but there are several problems with this:
It's a maintenance burden
It's extremely error tedious to add new tables, etc. I have written a somehwat horendous stored proc that tries to create a table based on the data given to it from the json-ified data from the flow, but this is very error prone.
It doesn't work at all, since the size of the data exceeds some kind of limitation in the SQL connector and I get spurious errors.
Rather than trying to push through with these issues, I'd rather ask whether there's a better way to achieve this. With the Data Management Framework in Dynamics it was simply a matter of scheduling these sync jobs, which worked pretty well. Is there something similar with CDS?
I've also tried looking at the Data Integration projects in Powerapps, but these only seem to allow me to get data into Powerapps/CDS, not back out...
Common Data Service for Apps provides access to the data using the user interfaces or API, there is no direct access to the underlying database. This architecture has certain limitations when it comes to processing large volumes of data, for example for the purposes of data warehousing, reporting, or using Azure machine learning and analytics tools. Replicating CDS data using Extract, Transform, Load (ETL) tools is possible but inherently complex to maintain.
Data Export Service is a service made available on Microsoft AppSource that adds the ability to replicate Dynamics 365 for Customer Engagement apps data to an Azure SQL Database store in a customer-owned Azure subscription.
Note: The Data Export Service requires Dynamics 365 for Customer Engagement apps subscription, it is not available on Common Data Service for Apps plans.

Presto integration with MSSQL

I'm looking for a tutorial or something that allow me to learn Presto step by step.
The idea is to start integrating file's and MSSQL, which is my knowledge area.
Unfortunately, since it is a relatively new area, I didn't find anything more than Facebook page or the Presto.io page, however it is not good enough for someone that want to start knowing the big data world from scratch.
I will appreciate your help and/or orientation in this area.
Presto has 2 primary use cases:
querying data stored in a cluster (on Hadoop's HDFS) or in a cloud (e.g. Amazon S3)
data federation, i.e. querying (and joining) data from multiple data sources (e.g. HDFS, S3, traditional RDBMS like PostgreSQL or SQL Server)
As far as SQL Server support is concerned -- Presto supports connecting to SQL Server since https://github.com/prestosql/presto/commit/072440cbb2c8df2a689c4c903dd325013eae41a0.
When it comes to querying files -- Presto uses Hive's Metastore to keep track of metadata (everything besides actually reading the data). Thus the files must reside on HDFS or S3 to be accessible (other cloud data stores like Azure's Blob are, AFAIK, not supported yet).

Resources