Can you suggest an approach to load data from one Snowflake(SF) database into another SF database within same cluster?
I have to -
Do data transformation, incremental load while loading into destination SF table
Schedule the load like ETL job
Thanks,
Nikhil
Related
I'm implementing a SSIS package where I want to incrementally load data from SQL Server to D365. From source to staging, we are loading data to tables with 15 minutes frequency. This incremental data is based on the DATELASTMAINT (Last Date Maintenance). We have created a few views on top of these tables. We are loading data from these views into D365 entities.
But in this workflow, we just want to incrementally load data into D365 as it's taking a long time for INSERT and UPDATE. We are using KingswaySoft for D365 data connection.
I tried couple of scenarios to incrementally get data, but couldn't succeed. What is the best way to incrementally fetch data from views (which are based on multiple tables) and push that data into D365?
Our team is trying to create an ETL into Redshift to be our data warehouse for some reporting. We are using Microsoft SQL Server and have partitioned out our database into 40+ datasources. We are looking for a way to be able to pipe the data from all of these identical data sources into 1 Redshift DB.
Looking at AWS Glue it doesn't seem possible to achieve this. Since they open up the job script to be edited by developers, I was wondering if anyone else has had experience with looping through multiple databases and transfering the same table into a single data warehouse. We are trying to prevent ourselves from having to create a job for each database... Unless we can programmatically loop through and create multiple jobs for each database.
We've taken a look at DMS as well, which is helpful for getting the schema and current data over to redshift, but it doesn't seem like it would work for the multiple partitioned datasource issue as well.
This sounds like an excellent use-case for Matillion ETL for Redshift.
(Full disclosure: I am the product manager for Matillion ETL for Redshift)
Matillion is an ELT tool - it will Extract data from your (numerous) SQL server databases and Load them, via an efficient Redshift COPY, into some staging tables (which can be stored inside Redshift in the usual way, or can be held on S3 and accessed from Redshift via Spectrum). From there you can add Transformation jobs to clean/filter/join (and much more!) into nice queryable star-schemas for your reporting users.
If the table schemas on your 40+ databases are very similar (your question doesn't clarify how you are breaking your data down into those servers - horizontal or vertical) you can parameterise the connection details in your jobs and use iteration to run them over each source database, either serially or with a level of parallelism.
Pushing down transformations to Redshift works nicely because all of those transformation queries can utilize the power of a massively parallel, scalable compute architecture. Workload Management configuration can be used to ensure ETL and User queries can happen concurrently.
Also, you may have other sources of data you want to mash-up inside your Redshift cluster, and Matillion supports many more - see https://www.matillion.com/etl-for-redshift/integrations/.
You can use AWS DMS for this.
Steps:
set up and configure DMS instance
set up target endpoint for redshift
set up source endpoints for each sql server instance see
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.SQLServer.html
set up a task for each sql server source, you can specify the tables
to copy/synchronise and you can use a transformation to specify
which schema name(s) on redshift you want to write to.
You will then have all of the data in identical schemas on redshift.
If you want to query all those together, you can do that by wither running some transformation code inside redsshift to combine and make new tables. Or you may be able to use views.
I need help in moving a database from one Redshift cluster to another Redshift cluster. Here, I'm not copying a table. I want to copy a database. Could some one help me on this.
Using s3 as temp storage
If both clusters in the same region then
UNLOAD to s3 from cluster 1
then
COPY from S3 to cluster2.
Using cluster snapshot.
Create a snapshot of the source cluster, then restore the snapshot as the destination cluster.
If I’ve made a bad assumption please comment and I’ll refocus my answer.
You can take a snapshot of the current database. And restore it once you create the new database.
You have two options:
Option 1: Take a snapshot and create a new cluster by restoring the snapshot and drop the databases which you do not require. You can have a cross region if required.
Option 2: Unload the table data from the current database to S3 and Copy the data to new database after you create a new one, here you need to manually create all the dependencies like users, groups, access, etc..
Architectural/perf question here.
I have a on premise SQL server database which has ~200 tables of ~10TB total.
I need to make this data available in Azure in Parquet format for Data Science analysis via HDInsight Spark.
What is the optimal way to copy/convert this data to Azure (Blob storage or Data Lake) in Parquet format?
Due to manageability aspect of task (since ~200 tables) my best shot was - extract data locally to file share via sqlcmd, compress it as csv.bz2 and use data factory to copy file share (with 'PreserveHierarchy') to Azure. Finally run pyspark to load data and then save it as .parquet.
Given table schema, I can auto-generate SQL data extract and python scripts
from SQL database via T-SQL.
Are there faster and/or more manageable ways to accomplish this?
ADF matches your requirement perfectly with one-time and schedule based data move.
Try copy wizard of ADF. With it, you can directly move on-prem SQL to blob/ADLS in Parquet format with just couple clicks.
Copy Activity Overview
we have an Oracle DB that cannot take up any additional insert/update load. Is it possible to extract such commands from the .arc files and update another non-oracle DB so that I can run reports off the new DB? Once that is done, I can reduce the load of all queries and reports from the main DB!
I understand that it is these very .arc files that are used for replicating to another oracle DB and that is what I want to do - except that the target DB is not oracle.