I perform data unload form Snowflake to s3 or by using Snowql localy.
I'd like to know if there's any kind of data tracing (for data governance) to always record or tag and save somewhere in Snowflake that a data was unloaded.
Thanks
COPY INTO <location> used for data unloading will leave trace in
ACCESS_HISTORY:
This Account Usage view can be used to query the access history of Snowflake objects (e.g. table, view, column) within the last 365 days (1 year).
This view supports write operations of the following type:
Data unloading statements:
COPY INTO internalStage FROM TABLE
COPY INTO externalStage FROM TABLE
COPY INTO externalLocation FROM TABLE
and QUERY_HISTORY
For data metering perspective DATA_TRANSFER_HISTORY:
This Account Usage view can be used to query the history of data transferred from Snowflake tables into a different cloud storage provider’s network (i.e. from Snowflake on AWS, Google Cloud Platform, or Microsoft Azure into the other cloud provider’s network) and/or geographical region within the last 365 days (1 year).
Related
I have a client for which I did a portal which manages documents used by their company. The documents are stored in SQL Server. This all works great.
Now, a few years later, there are 130,000 documents, most of which are no longer needed. The database is up to 200 gigs, and the cost of Azure Db gets costly above 250 gigs.
However, they don't want to just delete the old documents, as on occasion, they are needed. So what are my choices? They are creating about 50,000 documents per year.
Just let the database grow larger and pay the price?
Somehow save them to a disk somewhere? Seems like 130,000 documents in storage is going to be a task to manage in itself.
Save the current database somewhere offline? But accessing the documents off the database would be difficult.
Rewrite the app to NOT store the files in SQL Server, and instead save/retrieve from a storage location.
Any ideas welcome.
Export the backup file in Azure Blob storage in archive mode which will cost less and easy to import back to database when required. Delete the records from database afterwards.
Click on the Export option for your SQL database in Azure SQL Server.
Select the Storage account in Azure where you want to store the backup file.
Once the backup is available in storage account, change the access tier to archive by following this tutorial - Archive an existing blob.
I have created a external table in Snowflake using S3 Staging. I wants to know,what is the data retention period for the given table. I have referred Snowflake documentation but didn't able to find any answers.
Time travel for external tables is not available, since the data lies in individual Storage Account and is not maintained by snowflake.
I have a few questions regarding the process of copying tables from S3 to Snowflake.
The plan is to copy some data from AWS/S3 onto snowflake and then perform some modeling by DataRobot
We have some tables that contain PII data and we would like to hide those columns from Datarobot, what suggestion do you have for this problem?
The schema in AWS needs to match the schema in Snowflake for the copying process.
Thanks,
Mali
Assuming you know the schema of the data you are loading, you have a few options for using Snowflake:
Use COPY INTO statements to load the data into the tables
Use SNOWPIPE to auto-load the data into the tables (this would be good for instances where you are regularly loading new data into Snowflake tables)
Use EXTERNAL TABLES to reference the S3 data directly as a table in Snowflake. You'd likely want to use MATERIALIZED VIEWS for this in order for the tables to perform better.
As for hiring the PII data from DataRobot, I would recommend leveraging Snowflake DYNAMIC DATA MASKING to establish rules that obfuscate the data (or null it out) for the role that DataRobot is using.
All of these features are well-documented in Snowflake documentation:
https://docs.snowflake.com/
Regarding hiding your PII elements, you can use 2 different roles, one would be say data_owner(the role that will create the table and load the data in it) and another say data_modelling (for using data robot)
Create masking policies using the data owner such that the data robot cannot see the column data.
About your question on copying the data, there is no requirement that AWS S3 folder need to be in sync with Snowflake. you can create the external stage with any name and point it to any S3 folder.
Snowflake documentation has good example which helps to get some hands on :
https://docs.snowflake.com/en/user-guide/data-load-s3.html
We know show create table in hive gives storage path. Checking how to find a storage path for a snowflake table. I don’t see show create or desc table giving a storage path for a table.
One of the main advantages of Snowflake Data Platform is automatic storage handling:
Key Concepts & Architecture
Database Storage
When data is loaded into Snowflake, Snowflake reorganizes that data into its internal optimized, compressed, columnar format. Snowflake stores this optimized data in cloud storage.
Snowflake manages all aspects of how this data is stored — the organization, file size, structure, compression, metadata, statistics, and other aspects of data storage are handled by Snowflake. The data objects stored by Snowflake are not directly visible nor accessible by customers; they are only accessible through SQL query operations run using Snowflake
I am currently working on a project which involves the following:
The application I am working on is connected to a SQlserver
database.
SAP loads information into multiple tables (in a daily
and also hourly basis) into a MASTER database
There are 5 other databases(hosted on the same server) that access this information via synonyms and stored procedure calls to the MASTER database
The MASTER database purely used for storing the data and routing it to the other databases)
Master Database -
Tables:
MASTER_TABLE1 <------- SAP inserts data into this table.Triggers are used to process the valid data & insert into secondary staging tables -say MASTER_TABLE1_SEC
MASTER_TABLE1_SEC -- Holds processed data coming into MASTER_TABLE1
FIVE other databases ( for each manufacturing facility) are present in the same server. My application is connected to the facility databases ( not the Master)
FACILITY1
Facility2
....
FACILITY5
Synonyms of MASTER_TABLE1_SEC are created in each of these 5 facility databases
Stored procedures are again called from the Facility databases- in order to load data from the MASTER_TABLE1_SEC into the respective tables( within EACH facility) based on the business logic.
Is there a better architecture to handle this kind of a project? I am a beginner when it comes to advanced data management. Can anyone suggest a better architecture or tools to handle this?
There are a lot of patterns that would actually meet the needs described here. It serves that you are working with a type of Data Warehouse. I use Data Vault for my Enterprise Data Warehouses. It is an Ensemble Modeling technique designed for integration and master data preparation. You can think of it as a way to house all data from all time. You would then generate Data Marts (Kimball Method) for each of the Facilities containing only thei or whatever is required for their needs.