Is there an option to move data directly to Snowflake without going through S3 and other cloud storage - snowflake-cloud-data-platform

We choosed Snowflake as our DWH and we would like to connect different data sources like (Salesforce, Hubspot and Zendesk).
Is there a way to extract data from these sources and store them in Snowflake in a staging schema without having to store the data in cloud storage like S3 then reading the data into Snowflake?
Many thanks in advance.

You can use any of the connectors Snowflake provide (odbc, jdbc, python, etc) and any tool that can use one of these connectors. However they wont perform well compared to the COPY INTO approach that is optimised for bulk loading.
There are ETL tools, such as Matillion, that use the stage/copy into approach but do it in the background so that it appears that you are loading directly into Snowflake.

Related

Is there any way to load data directly from snowflake to oracle?

I am trying to load data from snowflake to oracle directly.
Tried with feature like s3 integration. But it's not supporting to load data from .csv/.txt/.xls files to target table. Let me know if there is a way to load data directly without any third party tools.
Source is snowflake and target is Oracle RDS.
Let me know if any possible way to achieve this connection.
Thanks

How os data loaded or synced in SnowFlake

We are considering to use snowflake. I tried looking into the documentation and google, but without luck. How does snowflake query/store data? As an example if I have a CSV file, database, datalake ... is it like real time querying vs the sources, or does it replicate data to snowflake? If replication, how often does it update?
Maybe an introduction to the Snowflake Architecture is helping you here: https://docs.snowflake.com/en/user-guide/intro-key-concepts.html
Let's split up your query in two parts:
How does Snowflake store data? Basically Snowflake is storing data in it's own proprietary file format. The files are are called micro partitions, are in hybrid columnar format and are stored in for example S3 in case you are using Snowflake on AWS.
How does Snowflake query data? For this Snowflake is leveraging compute instances called Virtual Warehouses, which correspond to compute instances of your cloud provider underneath. With them, the files are accessed and queried.

Historical data migration from Teradata to Snowflake

What are the steps to be taken to migrate historical data load from Teradata to Snowflake?
Imagine there is 200TB+ of historical data combined from all tables.
I am thinking of two approaches. But I don't have enough expertise and experience on how to execute them. So looking for someone to fill in the gaps and throw some suggestions
Approach 1- Using TPT/FEXP scripts
I know that TPT/FEXP scripts can be written to generate files for a table. How can I create a single script that can generate files for all the tables in the database. (Because imagine creating 500 odd scripts for all the tables is impractical).
Once you have this script ready, how is this executed in real-time? Do we create a shell script and schedule it through some Enterprise scheduler like Autosys/Tidal?
Once these files are generated , how do you split them in Linux machine if each file is huge in size (because the recommended size is between 100-250MB for data loading in Snowflake)
How to move these files to Azure Data Lake?
Use COPY INTO / Snowpipe to load into Snowflake Tables.
Approach 2
Using ADF copy activity to extract data from Teradata and create files in ADLS.
Use COPY INTO/ Snowpipe to load into Snowflake Tables.
Which of these two is the best suggested approach ?
In general, what are the challenges faced in each of these approaches.
Using ADF will be a much better solution. This also allows you to design DataLake as part of your solution.
You can design a generic solution that will import all the tables provided in the configuration. For this you can choose the recommended file format (parquet) and the size of these files and parallel loading.
The challenges you will encounter are probably a poorly working ADF connector to Snowflake, here you will find my recommendations on how to bypass the connector problem and how to use DataLake Gen2:
Trouble loading data into Snowflake using Azure Data Factory
More about the recommendation on how to build Azure Data Lake Storage Gen2 structures can be found here: Best practices for using Azure Data Lake Storage Gen2

Which database to choose in order to store data coming from flat files CSV, html

I need to design a scalable database architecture in order to store all the data coming from flat files - CSV, html etc. These files come from elastic search. most of the scripts are created in python. This data architecture should be able to automate most of the daily manual processing performed using excel, csv, html and all the data will be retrieved from this database instead of relying on populating within csv, html.
Database requirements:
Database must have a better performance to retrieve data on day to day basis and it will be queried by multiple teams.
ER model, schema will be developed for the data with logical relationship.
The database can be within cloud storage.
The database must be highly available and should be able to retrieve data faster.
This database will be utilized to create multiple dashboards.
The ETL jobs will be responsible for storing data in the database.
There will be many reads from the database and multiple writes each day with lots of data coming from Elastic Search and some of the cloud tools.
I am considering RDS, Azure SQL, DynamoDB, Postgres or Google Cloud. I would want to know which database engine would be a better solution considering these requirements. I also want to know how ETL process should be designed- lambda or kappa architecture.
To store the relational data like CSV and excel files, you can use relational database. For flat files like HTML, which doesn't required to be queried, you can simply use Storage account in any cloud service provider, for example Azure.
Azure SQL Database is a fully managed platform as a service (PaaS) database engine that handles most of the database management functions such as upgrading, patching, backups, and monitoring without user involvement. Azure SQL Database is always running on the latest stable version of the SQL Server database engine and patched OS with 99.99% availability. You can restore the database at any point of time. This should be the best choice to store relational data and perform SQL query.
Azure Blob Storage is Microsoft's object storage solution for the cloud. Blob storage is optimized for storing massive amounts of unstructured data. Your HTML files can be stored here.
The ETL jobs can be performed using Azure Data Factory (ADF). It allows you to connect almost any data source (including outside Azure) to transform the stored dataset and store it into desired destination. Data flow transformation in ADF is capable to perform all the ETL related tasks.

Data pipeline - dumping large files from API responses into AWS then with final destination being on premises SQL Server

I'm new to building data pipelines where dumping files in the cloud is one or more steps in the data flow. Our goal is to store large, raw sets of data from various APIs in the cloud then only pull what we need (summaries of this raw data) and store that in our on premises SQL Server for reporting and analytics. We want to do this in the most easy, logical and robust way. We have chosen AWS as our cloud provider but since we're at the beginning phases are not attached to any particular architecture/services. Because I'm no expert with the cloud nor AWS, I thought I'd post my thought for how we can accomplish our goal and see if anyone has any advice for us. Does this architecture for our data pipeline make sense? Are there any alternative services or data flows we should look into? Thanks in advance.
1) Gather data from multiple sources (using APIs)
2) Dump responses from APIs into S3 buckets
3) Use Glue Crawlers to create a Data Catalog of data in S3 buckets
4) Use Athena to query summaries of the data in S3
5) Store data summaries obtained from Athena queries in on-premises SQL Server
Note: We will program the entire data pipeline using Python (which seems like a good call and easy no matter what AWS services we utilize as boto3 is pretty awesome from what I've seen thus far).
You may use glue jobs (pyspark) for #4 and #5. You may automate flow using Glue triggers

Resources