How to move data from S3 to Snowflake - snowflake-cloud-data-platform

I have a few questions regarding the process of copying tables from S3 to Snowflake.
The plan is to copy some data from AWS/S3 onto snowflake and then perform some modeling by DataRobot
We have some tables that contain PII data and we would like to hide those columns from Datarobot, what suggestion do you have for this problem?
The schema in AWS needs to match the schema in Snowflake for the copying process.
Thanks,
Mali

Assuming you know the schema of the data you are loading, you have a few options for using Snowflake:
Use COPY INTO statements to load the data into the tables
Use SNOWPIPE to auto-load the data into the tables (this would be good for instances where you are regularly loading new data into Snowflake tables)
Use EXTERNAL TABLES to reference the S3 data directly as a table in Snowflake. You'd likely want to use MATERIALIZED VIEWS for this in order for the tables to perform better.
As for hiring the PII data from DataRobot, I would recommend leveraging Snowflake DYNAMIC DATA MASKING to establish rules that obfuscate the data (or null it out) for the role that DataRobot is using.
All of these features are well-documented in Snowflake documentation:
https://docs.snowflake.com/

Regarding hiding your PII elements, you can use 2 different roles, one would be say data_owner(the role that will create the table and load the data in it) and another say data_modelling (for using data robot)
Create masking policies using the data owner such that the data robot cannot see the column data.
About your question on copying the data, there is no requirement that AWS S3 folder need to be in sync with Snowflake. you can create the external stage with any name and point it to any S3 folder.

Snowflake documentation has good example which helps to get some hands on :
https://docs.snowflake.com/en/user-guide/data-load-s3.html

Related

Looping Through Tables in a DB in Informatica

I am looking for a way in Informatica to pull data from a table in a database, load it in Snowflake, and then move on to the next table in that same DB and repeating that for the remaining tables in the database.
We currently have this set up running in Matillion where there is an orchestration that grabs all of the names of a table of a database, and then loops through each of the tables in that database to send the data into Snowflake.
My team and I have tried to ask Informatica Global Support, but they have not been very helpful for us to figure out how to accomplish this. They have suggested things like Dynamic Mapping, which I do not think will work for our particular case since we are in essence trying to get data from one database to a Snowflake database and do not need to do any other transformations.
Please let me know if any additional clarification is needed.
Dynamic Mapping Task is your answer. You create one mapping. With, or without any transformations - as you need. Then you set up Dynamic Mapping Task to execute the mapping across whole set of your 60+ different sources and targets.
Please note that this is available as part of Cloud Data Integration module of IICS. It's not available in PowerCenter.

Snowflake Sample data set

I unable to create objects (views, file format, stage etc.. ) in a shared sample database (SNOWFLAKE_SAMPLE_DATA).
Kindly let me know, what is the possible way to get access the data?
Regards,
DB
The SNOWFLAKE_SAMPLE_DAT database contains a schema for each data set, with the sample data stored in the tables in each schema. You can execute queries on the tables in these databases just as you would any other databases in your account.
The database and schemas do not utilize any data storage so they do not incur storage charges for your account.
however, just as with other databases, executing queries requires a running, current warehouse for your session, which consumes credits.
You can refer to snowflake documentation: DOCS » USING SNOWFLAKE » SAMPLE DATASETS.
Hope this helps answer your question.
Shared databases are read-only. Users in a consumer account can view/query data, but cannot insert or update data, or create any objects in the database. This is why you can not create any objects on the shared database (SNOWFLAKE_SAMPLE_DATA).
https://docs.snowflake.com/en/user-guide/data-share-consumers.html#general-limitations-for-shared-databases
You can query the data in shared database like any other database.
https://docs.snowflake.com/en/user-guide/data-share-consumers.html#querying-a-shared-database

What the Process to transfer the staging table data to Fact tables in Snowflake by Custom Validations

good Day.
I need help. I want to transfer the data in Snowflake from Staging tables to Fact tables automatically, when data is available in Stage table. While moving data from Staging table to Fact tables, I have couple of Custom validations on each column and row.
Any idea how to do this in Snowflake.
If any one knows could you please suggest me...!
Thanks in Advance...!
There are many ways to do this and how you go about it depends on what tools you have available. The simplest way to do this without using tools outside of the Snowflake ecosystem would be:
On each of the staging tables you have, set up a stream on these tables (here is the Snowflake documentation on streams)
Create a task that runs on a schedule (here is the Snowflake doc on tasks) to pull from the streams and write into the fact table.
This is really a general data warehousing question rather than a Snowflake one. Here is some more documentation on building SCD type 2 dimensions also written by someone at Snowflake
Assuming "staging tables" refers to a Snowflake table and not a file in a Snowflake stage, I would recommend using a Stream and Task for this. A stream will identify the delta of data that needs to be loaded, and a Task can execute on a schedule and will only actually run something if there is data in the stream. Create a stored procedure that is executed in the Task to run your validations and Merge the outcome of those into your Fact.

SSIS copy multiple tables

I have more than 200 MSSQL tables, and want to transfer data to Azure Data Lake Storage.
One approach I consider is to use SSIS with dynamic data flows, i.e create table name variable and do foreach loop over table names and for each table run dataflow. However this approach seems wrong ,though files are created in Data Lake storage with correct schemes data is not transferred due to wrong mappings.
Is there any generic way to create one dynamic data flow and transfer huge number of table's data?
The scenario you are describing can now be achieved in ADF V2 - V2 added a set of rich control flow enhancements including the lookup activity, parameter passing, and foreach looping. You can see a tutorial of how to accomplish this here: https://learn.microsoft.com/en-us/azure/data-factory/tutorial-bulk-copy

What steps are performed by a 'Rescan'?

To automatically warehouse documents from Cloudant to dashDB, there is a schema discovery process (SDP) that automates the data migration for you. When using the SDP to warehouse documents from Cloudant to dashDB, there is an option 'Rescan'.
I have used 'Rescan' a number of times, but am unclear on the steps it actually performs. What steps are performed by a 'Rescan'? E.g.
Drop tables in the dashDB target schema? Which tables?
Scan Cloudant source database?
Recreate the target schema?
...
...
The steps are pretty much as you suggested. Rescan will
Inspect the previously discovered JSON schema and remove all tables from the dashDB instance created for that load (leaving any user defined tables untouched)
Re-discover the JSON schema again using the current settings (including sample size, type of discovery algorithm etc.)
Create the new tables into the same dashDB target
Ingest the newly created tables with data from Cloudant
Subscribe to the _changes feed from Cloudant to continuously synchronize document changes with dashDB
All steps (except for the first) are identical for the initial load as well as the rescan function.
The main motivation for a rescan is to support schema evolution. Whenever the document structure in a Cloudant source database changes, a user can make a conscious decision to drop and re-create the dashDB tables using this rescan function. SDP won't automate that process to avoid potential conflicts with applications depending on the existing dashDB tables.

Resources