What is the efficient way to load data into Snowflake database? - snowflake-cloud-data-platform

What is the efficient way to load data into Snowflake database?-
using External table or directly files from S3. If files then format is suggested Parquet or avro?

Of course it depends, but this Snowflake post summarizes it well, I think:
Conclusion
Loading data into Snowflake is fast and flexible. You get the greatest
speed when working with CSV files, but Snowflake’s expressiveness in
handling semi-structured data allows even complex partitioning schemes
for existing ORC and Parquet data sets to be easily ingested into
fully structured Snowflake tables.

Related

How to move data from S3 to Snowflake

I have a few questions regarding the process of copying tables from S3 to Snowflake.
The plan is to copy some data from AWS/S3 onto snowflake and then perform some modeling by DataRobot
We have some tables that contain PII data and we would like to hide those columns from Datarobot, what suggestion do you have for this problem?
The schema in AWS needs to match the schema in Snowflake for the copying process.
Thanks,
Mali
Assuming you know the schema of the data you are loading, you have a few options for using Snowflake:
Use COPY INTO statements to load the data into the tables
Use SNOWPIPE to auto-load the data into the tables (this would be good for instances where you are regularly loading new data into Snowflake tables)
Use EXTERNAL TABLES to reference the S3 data directly as a table in Snowflake. You'd likely want to use MATERIALIZED VIEWS for this in order for the tables to perform better.
As for hiring the PII data from DataRobot, I would recommend leveraging Snowflake DYNAMIC DATA MASKING to establish rules that obfuscate the data (or null it out) for the role that DataRobot is using.
All of these features are well-documented in Snowflake documentation:
https://docs.snowflake.com/
Regarding hiding your PII elements, you can use 2 different roles, one would be say data_owner(the role that will create the table and load the data in it) and another say data_modelling (for using data robot)
Create masking policies using the data owner such that the data robot cannot see the column data.
About your question on copying the data, there is no requirement that AWS S3 folder need to be in sync with Snowflake. you can create the external stage with any name and point it to any S3 folder.
Snowflake documentation has good example which helps to get some hands on :
https://docs.snowflake.com/en/user-guide/data-load-s3.html

Is it possible to replicate data transforms into snowflake, in addition to the usual data loading?

For data replication we obvioulsy look to replicate data from raw sources and land them into something like Snowflake. There are also data transformations that must be handled on that data, and these transformations can be handled within Snowflake. Is there a way to "replicate" the actual transformations? Maybe it's a replication of SQL statements, or perhaps it's something else (some metadata representation of the transformations?).

What the Process to transfer the staging table data to Fact tables in Snowflake by Custom Validations

good Day.
I need help. I want to transfer the data in Snowflake from Staging tables to Fact tables automatically, when data is available in Stage table. While moving data from Staging table to Fact tables, I have couple of Custom validations on each column and row.
Any idea how to do this in Snowflake.
If any one knows could you please suggest me...!
Thanks in Advance...!
There are many ways to do this and how you go about it depends on what tools you have available. The simplest way to do this without using tools outside of the Snowflake ecosystem would be:
On each of the staging tables you have, set up a stream on these tables (here is the Snowflake documentation on streams)
Create a task that runs on a schedule (here is the Snowflake doc on tasks) to pull from the streams and write into the fact table.
This is really a general data warehousing question rather than a Snowflake one. Here is some more documentation on building SCD type 2 dimensions also written by someone at Snowflake
Assuming "staging tables" refers to a Snowflake table and not a file in a Snowflake stage, I would recommend using a Stream and Task for this. A stream will identify the delta of data that needs to be loaded, and a Task can execute on a schedule and will only actually run something if there is data in the stream. Create a stored procedure that is executed in the Task to run your validations and Merge the outcome of those into your Fact.

What's is the faster way to extract 1 terabyte of data from tables in SQL Server to Parquet files without hadoop

I need to extract 2 tables from a SQL Server Database to files in Apache Parquet (I don't use Hadoop, only parquet files). The options I know of are:
Load data to a dataframe in Pandas and save to parquet file. But, this method don't stream the data from SQL Server to Parquet, and i have 6 GB of RAM memory only.
Use TurboODBC to query SQL Server, convert the data to Apache Arrow on the fly and then convert to Parquet. Same problem that above, TurboODBC doesn't stream currently.
Does a tool or library exist that can easily and "quickly" extract the 1 TB of data from tables in SQL Server to parquet files?
The missing functionality you are looking for is the retrieval of the result in batches with Apache Arrow in Turbodbc instead of the whole Table at once: https://github.com/blue-yonder/turbodbc/issues/133 You can either help with the implementation of this feature or use fetchnumpybatches to retrieve the result in a chunked fashion meanwhile.
In general, I would recommend you to not export the data as one big Parquet file but as many smaller ones, this will make working with them much easier. Mostly all engines/artifacts that can consume Parquet will be able to handle multiple files as one big dataset. You can then also split your query into multiple ones that write out the Parquet files in parallel. If you limit the export to chunks that are smaller than your total main memory, you should also be able to use fetchallarrow to write to Parquet at once.
I think the odbc2parquet command line utility might be what you are looking for.
Utilizes odbc bulk queries to retrieve data from SQL Server fast (like turbodbc).
It only keeps one batch in memory at a time, so you can write parquet
files which are larger than your system memory.
Allows you to split the result into multiple files if desired.
Full disclosure, I am the author, so I might be biased towards the tool.

Is it possible to feed a database table from a XML file?

We have some (stable) data that is saved in some generic database (database that contains a database structure and its data). To be used, this data must be re-written. Currently, we have an application that export this data to XML files to some very specific location.
We need to add this data to some databases. I know it's possible to load XML inside tables, but we'd like a direct link between the XML files and the database tables (reducing data duplication and risk of seeing people update the generated tables instead of using proper methods).
Is that possible?
Would it be very slow?
You can use SSIS to import XML files into database tables. This will work well if the xml files conform to a schema.
https://www.mssqltips.com/sqlservertip/3141/importing-xml-documents-using-sql-server-integration-services/

Resources