Which database to choose in order to store data coming from flat files CSV, html - database

I need to design a scalable database architecture in order to store all the data coming from flat files - CSV, html etc. These files come from elastic search. most of the scripts are created in python. This data architecture should be able to automate most of the daily manual processing performed using excel, csv, html and all the data will be retrieved from this database instead of relying on populating within csv, html.
Database requirements:
Database must have a better performance to retrieve data on day to day basis and it will be queried by multiple teams.
ER model, schema will be developed for the data with logical relationship.
The database can be within cloud storage.
The database must be highly available and should be able to retrieve data faster.
This database will be utilized to create multiple dashboards.
The ETL jobs will be responsible for storing data in the database.
There will be many reads from the database and multiple writes each day with lots of data coming from Elastic Search and some of the cloud tools.
I am considering RDS, Azure SQL, DynamoDB, Postgres or Google Cloud. I would want to know which database engine would be a better solution considering these requirements. I also want to know how ETL process should be designed- lambda or kappa architecture.

To store the relational data like CSV and excel files, you can use relational database. For flat files like HTML, which doesn't required to be queried, you can simply use Storage account in any cloud service provider, for example Azure.
Azure SQL Database is a fully managed platform as a service (PaaS) database engine that handles most of the database management functions such as upgrading, patching, backups, and monitoring without user involvement. Azure SQL Database is always running on the latest stable version of the SQL Server database engine and patched OS with 99.99% availability. You can restore the database at any point of time. This should be the best choice to store relational data and perform SQL query.
Azure Blob Storage is Microsoft's object storage solution for the cloud. Blob storage is optimized for storing massive amounts of unstructured data. Your HTML files can be stored here.
The ETL jobs can be performed using Azure Data Factory (ADF). It allows you to connect almost any data source (including outside Azure) to transform the stored dataset and store it into desired destination. Data flow transformation in ADF is capable to perform all the ETL related tasks.

Related

How os data loaded or synced in SnowFlake

We are considering to use snowflake. I tried looking into the documentation and google, but without luck. How does snowflake query/store data? As an example if I have a CSV file, database, datalake ... is it like real time querying vs the sources, or does it replicate data to snowflake? If replication, how often does it update?
Maybe an introduction to the Snowflake Architecture is helping you here: https://docs.snowflake.com/en/user-guide/intro-key-concepts.html
Let's split up your query in two parts:
How does Snowflake store data? Basically Snowflake is storing data in it's own proprietary file format. The files are are called micro partitions, are in hybrid columnar format and are stored in for example S3 in case you are using Snowflake on AWS.
How does Snowflake query data? For this Snowflake is leveraging compute instances called Virtual Warehouses, which correspond to compute instances of your cloud provider underneath. With them, the files are accessed and queried.

Historical data migration from Teradata to Snowflake

What are the steps to be taken to migrate historical data load from Teradata to Snowflake?
Imagine there is 200TB+ of historical data combined from all tables.
I am thinking of two approaches. But I don't have enough expertise and experience on how to execute them. So looking for someone to fill in the gaps and throw some suggestions
Approach 1- Using TPT/FEXP scripts
I know that TPT/FEXP scripts can be written to generate files for a table. How can I create a single script that can generate files for all the tables in the database. (Because imagine creating 500 odd scripts for all the tables is impractical).
Once you have this script ready, how is this executed in real-time? Do we create a shell script and schedule it through some Enterprise scheduler like Autosys/Tidal?
Once these files are generated , how do you split them in Linux machine if each file is huge in size (because the recommended size is between 100-250MB for data loading in Snowflake)
How to move these files to Azure Data Lake?
Use COPY INTO / Snowpipe to load into Snowflake Tables.
Approach 2
Using ADF copy activity to extract data from Teradata and create files in ADLS.
Use COPY INTO/ Snowpipe to load into Snowflake Tables.
Which of these two is the best suggested approach ?
In general, what are the challenges faced in each of these approaches.
Using ADF will be a much better solution. This also allows you to design DataLake as part of your solution.
You can design a generic solution that will import all the tables provided in the configuration. For this you can choose the recommended file format (parquet) and the size of these files and parallel loading.
The challenges you will encounter are probably a poorly working ADF connector to Snowflake, here you will find my recommendations on how to bypass the connector problem and how to use DataLake Gen2:
Trouble loading data into Snowflake using Azure Data Factory
More about the recommendation on how to build Azure Data Lake Storage Gen2 structures can be found here: Best practices for using Azure Data Lake Storage Gen2

Copying tables from databases to a database in AWS in simplest and most reliable way

I have some tables from three databases that I want to copy their data to another database in an automated way and these data are quite large. My servers are running on AWS. What is the simplest and most reliable way to do so?
Edit
I want them to stay on-sync (automation process as DevOps engineer)
The databases are all MySQL and all moved between AWS EC2. The data is in range between 100GiB and 200GiB
Currently, Maxwell to take the data from the tables then moved to Kafka and then a script written in Java to feed the other database.
I believe you can use AWS Database Migration Service (DMS) to replicate tables from each source into a single target. You would have a single target endpoint and three source endpoints. You would have three replication tasks that would take data from each source and put it into your target. DMS can keep data in sync via ongoing replication. Be sure to read up on the documentation before proceeding as it isn't the most intuitive service to use, but it should be able to do what you are asking.
https://docs.aws.amazon.com/dms/latest/userguide/Welcome.html

How to regularly download Geo-Replicated Azure Database (PaaS) data to On-Premise database

We have a geo-replicated database in Azure SQL (Platform as a Service). This is a master/slave type arrangement, so the geo-replicated database is read-only.
We want to download data regularly from this Azure SQL database to a SQL Server database on-premise that has the same schema, without it impacting performance too much (the Azure Database is the main database used by the application).
We originally looked at Azure SQL Data Sync, to hopefully read data from the geo-replicated data and pull that down to on-premise, but it needs to create triggers + tracking tables. I don't feel overly comfortable with this, because it won't be possible to run this against the read-only slave database, and so it must be setup on the transactional master database (impacting application performance), which in turn will re-create these extra data-sync artifacts on the geo-replicated database. It seems messy, with bloated data (we have a large number of tables and data, and Azure PaaS databases are limited in size as it is) and we also use Redgate database lifecycle management, which can potentially blow these schema objects and tracking tables away every time we perform a release, as they're not created by us and are not in our source control.
What other viable options are there (other then moving away from PaaS and making a clustered IaaS VM environment across on-prem and cloud, with SQL Server installed, patched, etc). Please keep in mind, we are resource stretched in terms of staff, which is why PaaS was an ideal place for our database originally.
I should mention, we want the On-Premise database to be 'relatively' in sync with the Azure database, but the data on-premise can be up to an hour old data.
Off the top of my head, some options may be SSIS packages? Or somehow regularly downloading a Bacpac of the database and restoring it on-premise every 30 mins (but it's a very large database).
Note, it only needs to be one-directional at this stage (Azure down to on-premise).
You can give it a try to Azure Data Factory since it allows you to append data to a destination table or invoke a stored procedure with custom logic during copy when SQL Server is used as a "sink". You can learn more here.
Azure Data Factory allows you to incrementally load data (delta) after an initial full data load by using a watermark column that has the last updated time stamp or an incrementing key. The delta loading solution loads the changed data between an old watermark and a new watermark. You can learn more how to do that with Azure Data Factory on this article.
Hope this helps.

can fuzzy lookup and fuzzy grouping operations be performed in azure data factory

I am new to data factory. For a while I worked on Azure SQL database. Till now all data transformation operations (which includes data movement, processing, modification of data, fuzzy grouping and fuzzy lookup) are performed manually on my system through SSIS. Now we want to automate all the packages. For that we want to schedule these packages on Azure. I know that Azure SQL has no support for SSIS and someone suggested data factory. Let me know if data factory can perform all my requirements mentioned above.
Thanks in advance...
Data Factory is not a traditional ETL tool but a tool to orchestrate, schedule and monitor data pipelines that compose existing storage, movement, and processing services. When you transform data with ADF the actual transformation is done by another service (Hive/Pig script running on HDInsight, Azure Batch, USQL running on Azure Data Lake Analytics, SQL Server Stored Proc, etc.) and ADF manages and orchestrates the complex scheduling and cloud resources. ADF doesn't have traditional 'out of the box' ETL transforms (like fuzzy lookup). You can write your own scripts or custom .NET code for your business logic, or run stored procedures. You can compose all of this into a recurring scheduled data pipeline(s) and monitor all in one place.

Resources