Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 months ago.
Improve this question
I have to do a Proof of Concept (POC) with Snowflake and I am new to Snowflake and looking for advice.
Use cases:
Have to load data for 7 tables (5 dimension tables and 2 fact tables) from Microsoft Dynamics AX (On Premise) to snowflake
Two tables are big in size and having more than 150 million records
Once data loaded to Snowflake, have to create star schema model in Snowflake around 7 tables
Going to read data from Snowflake using either SSRS, Power BI or Excel.
Need to gauge:
Time taken to load the data from Source to Snowflake (Time, Resource Utilized etc)
and how the read performance is?
Row Level Security: Area manager browses Power BI Report, should only see his data and not other area manager data
Please somebody can explain steps involved to achieve above? Would be great if you can provide some supportive links and scripts.
Do I need to do following:
Load data from AX tables to a files (I think there is a limitation on file size: https://docs.snowflake.net/manuals/user-guide-getting-started.html)
Upload these files either on Amazon S3 or Azure Blob Storage and from there can load each file to Snowflake
How about this one from Snowflake:
How to Craft Your Data Warehouse POC
You must register to get this eBook from 2019...
Also, I would highly recommend doing both in your "Do I need to do?" section.
4 Data loading Options available with Snowflake:
Small datasets:
1) Snowflake Web UI/Interface: to load limited data (small datasets)
Bulk load:
2) SnowSQL (CLI Client): SnowSQL command line interface to load Bulk Data from files on clouds to snowflake. SnowSQL is the next-generation command line client for connecting to Snowflake to execute SQL queries and perform all DDL and DML operations, including loading data into and unloading data out of database tables. You got to install and configure SnowSQL on Client Machine.
3) Snowpipe: Snowpipe is Snowflake’s continuous data ingestion service. Snowpipe loads data within minutes after files are added to a stage and submitted for ingestion
4) 3rd Party ETL tools: like Matillion (SaaS), SSIS (IaaS & on-premise), Talend (SaaS) etc. Create your own Data integration packages to load data to snowflake.
Steps:
1. Load data from Source AX to Snowflake
i) As we are a MS Shop, create a SSIS Package to load data from AX to CSV Files (Max size of each file should be 100 MB) and place the files on Azure Blob Storage or AWS S3
ii) Use SnowSQL to load data from file (Azure Blob Storage) to Snowflake
OR
iii) Use 3rd party ETL tool SSIS to load data directly from Source to Snowflake without any transformation and once data is dumped to Snowflake you can do transformation.
Related
Can you suggest an approach to load data from one Snowflake(SF) database into another SF database within same cluster?
I have to -
Do data transformation, incremental load while loading into destination SF table
Schedule the load like ETL job
Thanks,
Nikhil
I am creating a system which pulls data from S3 buckets and Snowflake tables (I also have access to this SF portal). I will be running data quality/data validations against this incoming data inside of a Databricks notebook. My question is when I pull this data in Ill have to stage it somehow to run those DQ checks. Does it make more sense to stage this data inside the Databricks portal or Snowflake portal?
Thanks
What I've researched: databricks + snowflake stage and architecture
In general, it’s normally a good idea to hold data as close to where it is being processed as possible. If Databricks is going to be directly processing the data then hold the data in Databricks; if Databricks is going push down processing to Snowflake then hold the data in Snowflake
I'm moving an application from an Access database to a SQL Server database.
The current Access database contains 5 'linked' Excel files (reports that come from SAP) which are refreshed daily by overwriting each file with the new SAP report. In this way, through a bunch of transformation/queries, the data ends up in the appropriate table in the way we want to store the data.
Is a setup similar to this possible using SSIS? I've watched tutorials about uploading Excel files into a table, but essentially I need to do this:
SAP Export > Save (overwrite) to Network File > MS Access link exists and new data is transformed through many 'stored procedures (action queries)' > Data moved to appropriate table.
Appreciate any YouTube/Google/Links/Reading about doing this in SSIS. Regards!
Our team is trying to create an ETL into Redshift to be our data warehouse for some reporting. We are using Microsoft SQL Server and have partitioned out our database into 40+ datasources. We are looking for a way to be able to pipe the data from all of these identical data sources into 1 Redshift DB.
Looking at AWS Glue it doesn't seem possible to achieve this. Since they open up the job script to be edited by developers, I was wondering if anyone else has had experience with looping through multiple databases and transfering the same table into a single data warehouse. We are trying to prevent ourselves from having to create a job for each database... Unless we can programmatically loop through and create multiple jobs for each database.
We've taken a look at DMS as well, which is helpful for getting the schema and current data over to redshift, but it doesn't seem like it would work for the multiple partitioned datasource issue as well.
This sounds like an excellent use-case for Matillion ETL for Redshift.
(Full disclosure: I am the product manager for Matillion ETL for Redshift)
Matillion is an ELT tool - it will Extract data from your (numerous) SQL server databases and Load them, via an efficient Redshift COPY, into some staging tables (which can be stored inside Redshift in the usual way, or can be held on S3 and accessed from Redshift via Spectrum). From there you can add Transformation jobs to clean/filter/join (and much more!) into nice queryable star-schemas for your reporting users.
If the table schemas on your 40+ databases are very similar (your question doesn't clarify how you are breaking your data down into those servers - horizontal or vertical) you can parameterise the connection details in your jobs and use iteration to run them over each source database, either serially or with a level of parallelism.
Pushing down transformations to Redshift works nicely because all of those transformation queries can utilize the power of a massively parallel, scalable compute architecture. Workload Management configuration can be used to ensure ETL and User queries can happen concurrently.
Also, you may have other sources of data you want to mash-up inside your Redshift cluster, and Matillion supports many more - see https://www.matillion.com/etl-for-redshift/integrations/.
You can use AWS DMS for this.
Steps:
set up and configure DMS instance
set up target endpoint for redshift
set up source endpoints for each sql server instance see
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.SQLServer.html
set up a task for each sql server source, you can specify the tables
to copy/synchronise and you can use a transformation to specify
which schema name(s) on redshift you want to write to.
You will then have all of the data in identical schemas on redshift.
If you want to query all those together, you can do that by wither running some transformation code inside redsshift to combine and make new tables. Or you may be able to use views.
Architectural/perf question here.
I have a on premise SQL server database which has ~200 tables of ~10TB total.
I need to make this data available in Azure in Parquet format for Data Science analysis via HDInsight Spark.
What is the optimal way to copy/convert this data to Azure (Blob storage or Data Lake) in Parquet format?
Due to manageability aspect of task (since ~200 tables) my best shot was - extract data locally to file share via sqlcmd, compress it as csv.bz2 and use data factory to copy file share (with 'PreserveHierarchy') to Azure. Finally run pyspark to load data and then save it as .parquet.
Given table schema, I can auto-generate SQL data extract and python scripts
from SQL database via T-SQL.
Are there faster and/or more manageable ways to accomplish this?
ADF matches your requirement perfectly with one-time and schedule based data move.
Try copy wizard of ADF. With it, you can directly move on-prem SQL to blob/ADLS in Parquet format with just couple clicks.
Copy Activity Overview