Snowflake sharing to Data Consumer - snowflake-cloud-data-platform

Snowflake sharing to Data Consumer - snowflake-cloud-data-platform

In snowflake share Data Providers can share data with only the Data Consumer and does Snowflake charges additional fees to Data providers for each share they create?
Also is it possible that after sharing the Data Consumer can extend that data shared to them to other Data Consumers?

does Snowflake charges additional fees to Data providers for each share they create?
No, there is no additional charge. This is because, when you share data, no actual data is copied, so there is no additional storage required and therefore no associated additional cost. Even the process of creating the share and granting privileges for databases and other supported database objects (schemas, UDFs, tables, and views) to a share does not incur any direct cost since these are metadata operations and are thus part of Cloud Services. Snowflake credits are used to pay for the usage of the cloud services that exceeds 10% of the daily usage of the compute resources. Usage for cloud services is charged only if the daily consumption of cloud services exceeds 10% of the daily usage of the compute resources.
If data is shared outside the region where the Snowflake account is hosted then will be additional storage costs for replication.
is it possible that after sharing the Data Consumer can extend that data shared to them to other Data Consumers?
It is not possible to share a consumed share.

Related

Regarding the burden on Snowflake's database storage layer

Snowflake has an architecture consisting of the following three layers.
・Database storage
・Query processing
・Cloud service
I understand that it is possible to create a warehouse for each process in query processing, and scale up and scale out on a per process basis.
However, when the created warehouses (processes) are processed in parallel, I am worried about the burden on the database storage.
Even though the query processing process can be load-balanced, since there is only one database storage, wouldn't there be a lot of parallel processing running in the database storage and an error occurring in the database storage layer?
Sorry if I don't understand the architecture.

The storage is immutable, thus the query read load is just IO against cloud provider IO layers, so for all purposes infinitely scalable.
When any node updates a table, the new set of file partitions are known, and any warehouse without the new set of partition parts, does remote IO to read them.
The only downsides to this pattern is it does not scale well for transactional write patterns, thus why that is not the targeted at those markets.

Snowflake Virtual Warehouse replication options

Going through documentation I did find cross-region replication for storage layer, but not about compute layer of Snowflake. I did not see any mentions about availability options for Virtual Warehouses. In case whole AWS Region goes down, database will still be available for serving queries, but what about virtual warehouse? do I need to create a new one in case region is still down or is there a way to have a "back-up" virtual warehouse in different AWS region?

Virtual warehouse is a essentially a compute server (for example, AWS EC2 if hosted on AWS). Virtual warehouses are not persistent, i.e. when you suspend a warehouse, it is returned to the AWS/Azure/GCP pool and when you resume, it is allocated from the pool.
When a cloud region goes down, virtual warehouses will be allocated and created from AWS/Azure/GCP pool in the backup region.

The documentation here states clearly what can, and what can't, be replicated:
Snowflake Replication
For example, it states:
Currently, replication is supported for databases only. Other types of objects in an account cannot be replicated. This list includes:
Users
Roles
Warehouses
Resource monitors
Shares
When you set up an environment (into which you are going to replicate a database from another account) you also need to set up the roles, users, warehouses, etc.

The 'The Snowflake Elastic Data Warehouse'(2016) paper at the paragraph '4.2.1 Fault Resilience' reports:
Virtual Warehouses (VWs) are not distributed across AZs. This choice is for performance reasons. High network throughput is critical for distributed query execution, and network throughput is significantly higher within the same AZ. If one of the worker nodes fails during query execution, the query fails but is transparently re-executed, either with the node immediately replaced, or with a temporarily reduced number of nodes. To accelerate node replacement, Snowflake maintains a small pool of standby nodes. (These nodes are also used for fast VW provisioning.)
If an entire AZ becomes unavailable though, all queries running on a given VW of that AZ will fail, and the user needs to actively re-provision the VW in a different AZ.
With full-AZ failures being truly catastrophic and exceedingly rare events, we today accept this one scenario of partial system unavailability, but hope to address it in the future.

Partition level Access on Data stored in snowdfl

I am new to snowflake and was exploring on snowflake on AWS. When the data is stored in snowflake , i understood that we can create and manage data in partitions similar to what we do in hive. Hive doesn't allow me to have partition level user access management. Can I do that with snowflake ? if yes , how do we do and how its managed on storage layer on AWS?

With Snowflake, you have no direct access to the underlying storage, you can only use the access mechanisms that Snowflake provides. Snowflake manages all the provision, management and layout of your data on the underlying storage entirely transparently. So you can't "create and manage data in partitions similar to what we do in hive"
If you want to understand more about how this storage works you can read about micro-partitioning here
In the vast majority of cases there is no need to interfere with how Snowflake is laying out your data but there is the functionality available to force how the data is clustered - though Snowflake suggests that this is only ever useful on multi-terabyte tables. You can read about clustering tables here
Snowflake does have the concept of "External Tables" - these are tables that appear in the Snowflake DBs as normal tables but their data is actually held in S3 (or Azure Blob or GCP storage) that you own and manage rather than Snowflake. These tables can be convenient to create/use but perform significantly worse than tables held directly in Snowflake: when the data is loaded into Snowflake it might be still ultimately stored on S3 but it is compressed, converted into columnar format and held in micro-partitions - so very different in structure to the files you can see in your S3 buckets

If I want to use Amazon DynamoDB as database for my mobile app, which Ec2 instance I can opt for?

I want to use dynamoDB as the database for my mobile application. If EC2 instance will perform well, if my mobile application has following:
100k daily active users
1 million daily active users
10 million active users
I am new to AWS ecosystem and i am unable to figure out which instance to choose.

DynamoDb is serverless service, which means that you do not need to provision any EC2 instances to host the database. Its all managed by AWS.
DynamoDB lets you offload the administrative burdens of operating and scaling a distributed database so that you don't have to worry about hardware provisioning, setup and configuration, replication, software patching, or cluster scaling.
The only thing that you have to consider is setting up its read/write capacity. However, if you are not sure on this as well, you can use on-demand capacity mode:
Amazon DynamoDB on-demand is a flexible billing option capable of serving thousands of requests per second without capacity planning. DynamoDB on-demand offers pay-per-request pricing for read and write requests so that you pay only for what you use.

Best storage in Azure for very large data sets with fast update options

We have around 300 millions documents on a drive. We need to delete around 200 millions of them. I am going to write the 200 million paths to a storage so I can keep track of deleted documents. My current thought's is an Azure SQL database is properly not very suited for this amount. Cosmos DB is to expensive. Storing csv files is bad, because I need to do updates everytime I delete a file. Table storage seems to be a pretty good match, but does not offer groups by operations that could come in handy when doing status reports. I dont know much about data lake, if you can do fast updates or it is more like an archive. All input is welcome for choosing the right storage for this kind of reporting.
Thanks in advance.

According to your need, you can use Azure Cosmos DB or Azure table storage.
Azure Table Storage offers a NoSQL key-value store for semi-structured data. Unlike a traditional relational database, each entity (such as a row - in relational database terminology) can have a different structure, allowing your application to evolve without downtime to migrate between schemas.
Azure Cosmos DB is a multimodal database service designed for global use in mission-critical systems. Not only does it expose a Table API, it also has a SQL API, Apache Cassandra, MongoDB, Gremlin and Azure Table Storage. These allow you to easily swap out existing dbs with a Cosmos DB implementation.
Their differences are as below:
Performance
Azure Table Storage has no upper bound on latency. Cosmos DB defines
latency of single-digit milliseconds for reads and writes along with
operations at sub-15 milliseconds at the 99th percentile worldwide.
(That was a mouthful) Throughput is limited on Table Storage to 20,000
operations per second. On Cosmos DB, there is no upper limit on
throughput, and more than 10 million operations per second are
supported.
Global Distribution
Azure Table Storage supports a single region with an optional
read-only secondary region for availability. Cosmos DB supports
distribution from 1 to more than 30 regions with automatic failovers
worldwide.
Billing
Azure Table Storage uses storage volume to determine billing.Pricing
is tiered to get progressively cheaper per GB the more storage you
use. Operations incur a charge measured per 10,000 transactions.
For Cosmos DB, It has tow billing nodule : Provisioned throughput and
Consumed Storage.
Provisioned Throughput: Provisioned throughput (also called reserved throughput) guarantees high performance at any scale. You
specify the throughput (RU/s) that you need, and Azure Cosmos DB
dedicates the resources required to guarantee the configured
throughput. You are billed hourly for the maximum provisioned
throughput for a given hour.
Consumed Storage: You are billed a flat rate for the total amount of storage (GBs) consumed for data and the indexes for a given
hour.
For more details, please refer to the document.