Crowdsourcing - Snowflake differentiators from Azure's DW offering - snowflake-cloud-data-platform

I'm curious as to the differentiating factors between Snowflake and Azure SQL data warehouse (recently rebranded as Synapse).
What are the factors that Snowflake uses to emphasize its superiority over Azure's DW offering?
Why did or would you choose Snowflake over Azure?

What I've found so far:
SF = Snowflake
Sy = Azure Synapse
Only one setting per DB for compute power on Sy, while SF allows
multiple virtual warehouses. Also the scaling capabilities (up and
out) of SF virtual warehouses allow for greater control and tuning vs
the single concept of data warehouse units (DWU) in Sy.
Sy requires manual maintenance of statistics and doesn't seem to have the SF
advantages of the Columnar storage
I'm not finding any matching
functionality in Sy to compete with SF's semi-structured
functionality.
Sy is attempting to include a Hadoop flavored
architecture and functionality, but all of that is still listed as
"Preview" at this point.
Sy seems about 3 years behind SF.

The big differences are in scalability. There are limits to the size and compute capacity available to Azure Synapse Analytics. Azure SA has a limit of 128 concurrent queries and for row store type tables it has a size limit of 60TB compressed for one table (No limit for column store tables). There's also a limit to the amount of compute resources you can apply to a particular database. The limitations are listed here..
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-service-capacity-limits
For Snowflake there are no practical limits on database or compute size. You can allocate Warehouses(compute) or clusters of warehouses to accommodate any workload.

Related

Best storage in Azure for very large data sets with fast update options

We have around 300 millions documents on a drive. We need to delete around 200 millions of them. I am going to write the 200 million paths to a storage so I can keep track of deleted documents. My current thought's is an Azure SQL database is properly not very suited for this amount. Cosmos DB is to expensive. Storing csv files is bad, because I need to do updates everytime I delete a file. Table storage seems to be a pretty good match, but does not offer groups by operations that could come in handy when doing status reports. I dont know much about data lake, if you can do fast updates or it is more like an archive. All input is welcome for choosing the right storage for this kind of reporting.
Thanks in advance.
According to your need, you can use Azure Cosmos DB or Azure table storage.
Azure Table Storage offers a NoSQL key-value store for semi-structured data. Unlike a traditional relational database, each entity (such as a row - in relational database terminology) can have a different structure, allowing your application to evolve without downtime to migrate between schemas.
Azure Cosmos DB is a multimodal database service designed for global use in mission-critical systems. Not only does it expose a Table API, it also has a SQL API, Apache Cassandra, MongoDB, Gremlin and Azure Table Storage. These allow you to easily swap out existing dbs with a Cosmos DB implementation.
Their differences are as below:
Performance
Azure Table Storage has no upper bound on latency. Cosmos DB defines
latency of single-digit milliseconds for reads and writes along with
operations at sub-15 milliseconds at the 99th percentile worldwide.
(That was a mouthful) Throughput is limited on Table Storage to 20,000
operations per second. On Cosmos DB, there is no upper limit on
throughput, and more than 10 million operations per second are
supported.
Global Distribution
Azure Table Storage supports a single region with an optional
read-only secondary region for availability. Cosmos DB supports
distribution from 1 to more than 30 regions with automatic failovers
worldwide.
Billing
Azure Table Storage uses storage volume to determine billing.Pricing
is tiered to get progressively cheaper per GB the more storage you
use. Operations incur a charge measured per 10,000 transactions.
For Cosmos DB, It has tow billing nodule : Provisioned throughput and
Consumed Storage.
Provisioned Throughput: Provisioned throughput (also called reserved throughput) guarantees high performance at any scale. You
specify the throughput (RU/s) that you need, and Azure Cosmos DB
dedicates the resources required to guarantee the configured
throughput. You are billed hourly for the maximum provisioned
throughput for a given hour.
Consumed Storage: You are billed a flat rate for the total amount of storage (GBs) consumed for data and the indexes for a given
hour.
For more details, please refer to the document.

Containers for database and scalability

Consider TiDB and the TiDB Operator as examples for this question.
TiDB
TiDB ("Ti" stands for Titanium) is an open-source NewSQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability.
TiDB Operator
The TiDB Operator automatically deploys, operates, and manages a TiDB cluster in any Kubernetes-enabled cloud environment.
Once the database is live, there are broadly two scenarios ever.
Very high rate of read only queries.
Very high rate of write queries.
In either of the scenarios, which component of the containerized database scales? Read replicas? Database 'engine' itself? Persistent volumes? All of the above?
Containerized infrastructure abstracts storage and computing resources
(consider PV and Pod in k8s), and these resources scale as the database scales. So the form of scaling depends on the database itself.
For TiDB, while it offers MySQL compatible SQL interface, its architecture is is very different from MySQL and other traditional relational databases:
The SQL layer(TiDB) serves SQL queries and interacts with the storage layer based on the calculated query plan. It is stateless and scales on demand for both read and write queries. Typically, you scale out/up the SQL layer to get more compute resources for query plan calculation, join, aggregation and serving more connections.
The Storage layer(TiKV) is responsible for storing data and serving KV APIs for the SQL layer. The most interesting part of TiKV is the Multi-raft replication, The storage layer automatically splits data into pieces and distributes them to containers evenly. Each pieces is a raft group whose leader serves read and write queries. Upon scale in/out, the storage layer will automatically migrates data pieces to balance the load. So, scale out the storage layer will give you better read/write throughput and large data capacity.
Back to the question, all of the components mentioned in the question scales. The read/write replicas serving SQL queries can scale, the database "engine"(storage layer) serving KV queries can scale, and the PV is also scaled out along with the scaling process of the storage layer.

Azure SQL Database Pricing

I am unable to locate the cost per transaction for a Azure SQL Database.
https://learn.microsoft.com/en-us/azure/sql-database/sql-database-single-databases-manage
I know the SQL Server database is about 5$ per month but how much for the transactions?
If I go to the Azure Pricing Calculator (https://azure.microsoft.com/en-us/pricing/calculator/) they do not seem to have the info. They list the price for a single database as $187.77 so that is not the same service as they one you create if you use the link above.
TL;DR:
Azure SQL pricing is "flat": first you choose a performance level for your database which has a fixed cost (e.g. S6 for $580/mo or S1 for $30/mo), and this is billed by the second. Azure does not bill your account for actual IO/CPU usage.
The rest:
There is no single "cost per transaction" because a "transaction" is not a single uniform amount of work for a database server (e.g. a single SELECT over a small table with indexes is significantly less IO and CPU intensive compared to a MERGE over millions of rows).
There three different types of Azure-SQL deployment in Azure, with their own different formulas for determining monthly cost:
Single database (DTU)
Single database (vCore)
Elastic pool
Managed Instance
I assume you're interested in the "single database" deployment types, as "Managed instance" is a rather niche application and "Elastic pool" is to save money if you have lots (think: hundreds or thousands) of smaller databases. If you have a small number (e.g. under 100) of larger databases (in terms of disk space) then "Single database" is probably right for you. I won't go into detail on the other deployment types.
If you go with DTU-based Single Database deployment (which most users do), then the pricing follows this general formula:
Monthly-price = ( Instances * Performance-level )
Where Performance-level is the selected SKU for the minimum level of performance you need. You can change this level up or down at will at any point in time as you're billed by the second and not per month (but per-second pricing is difficult to work into a monthly price estimate)
A "DTU" (Database Throughput Unit) is a unit of measure that represents the actual cost to Microsoft of running your database, which is then passed on to you somewhat transparently (disregarding whatever profit-margin Microsoft has per-DTU, of course).
When determining what Performance-level to get for your database you should select the performance level that offers the minimum number of DTUs that your application actually needs (you determine this through profiling and estimating, usually by starting off with a high-performance database for a few hours (which won't cost more than a few dollars) and running your application code - if the actual DTU usage numbers are low (e.g. you get an "S6" 400 DTU (~$580/mo) database and see that you only use 20 DTUs under normal loads then you can safely leave it on the "S1" 20DTU (~$30/mo) performance level
The question about what a DTU actually is has been asked before, so to avoid creating a duplicate answer please read that QA here: Azure SQL Database "DTU percentage" metric
It is based on your requirement, I am using a single instance Azure SQL Database, so basically based on your cpu cost and your transaction limit and space called 'DTU'. For this totally based on your requirement.
If it is in VM (Virtual machine), that applied your vm cost and your sqlserver cost (if you do not have licence of sqlserver).
Cost https://azure.microsoft.com/en-us/pricing/calculator/

Azure DTU Calculation

We have some DB instances on Azure, I am trying to optimize DB Performance on Azure. Can any explain what is Azure DTU & How can we calculate Azure DTU ?
1.What is Azure DTU:
Azure SQL Database provides two purchasing models: vCore-based purchasing model and DTU-based purchasing model.
DTU-based purchasing model is based on a bundled measure of compute, storage, and IO resources. Compute sizes are expressed in terms of Database.Transaction Units (DTUs) for single databases and elastic Database Transaction Units (eDTUs) for elastic pools.
The Database Transaction Unit (DTU) represents a blended measure of CPU, memory, reads, and writes. The DTU-based purchasing model offers a set of preconfigured bundles of compute resources and included storage to drive different levels of application performance: Basic, Standard and Premium.
For more details, please see: https://learn.microsoft.com/en-us/azure/sql-database/sql-database-service-tiers-dtu
2.How to optimize DB Performance:
Since your DB instances are already on Azure, you can monitor your DB instances and improve your DB Performance with Azure Portal by troubleshot or change the service tiers. Please see: Monitoring and performance tuning: https://learn.microsoft.com/en-us/azure/sql-database/sql-database-monitor-tune-overview#improving-database-performance-with-more-resources
3.How to calculate Azure DTU:
Here is a link about Azure SQL Database DTU Calculator.This calculator will help us determine the number of DTUs for our existing SQL Server database(s) as well as a recommendation of the minimum performance level and service tier that we need before we migrate to Azure SQL Database.
If you still keep the database backup file, i think you can try this calculator.
Please see: http://dtucalculator.azurewebsites.net/
The significant consideration of the Azure SQL Database is to meet the performance requirement of the deployed database against the minimum cost. Undoubtedly, nobody wants to pay money for the redundant resources or features that they do not use or plan to use.
At this point, Microsoft Azure offers two different purchasing models to provide cost-efficiency:
Database Transaction Unit (DTU)-Based purchasing model.
Virtual Core (vCore)-Based purchasing model
A purchasing model decision directly affects the database performance and the total amount of the bills. In my thought, If the deployed database will not consume too many resources the DTU-Based purchase model will be more suitable.
Now, we will discuss the details about these two purchasing models in the following sections.
Database Transaction Unit (DTU)-Based purchasing model
In order to understand the DTU-Based purchase model more clearly, we need to clarify what does make a sense DTU in Azure SQL Database. DTU is an abbreviation for the “Database Transaction Unit” and it describes a performance unit metric for the Azure SQL Database. We can just like the DTU to the horsepower in a car because it directly affects the performance of the database. DTU represents a mixture of the following performance metrics as a single performance unit for Azure SQL Database:
CPU
Memory
Data I/O and Log I/O
Elastic Pool
Briefly, Elastic Pool helps us to automatically manage and scale the multiple databases that have unpredictable and varying resource demands upon a shared resource pool. Through the Elastic Pool, we don’t need to scale the databases continuously against resource demand fluctuation. The databases which take part in the pool consumes the Elastic Pool resources when they are needed but they can not exceed the Elastic Pool resource limitations so that it provides a cost-effective solution.
Properly Estimation of the DTU for Azure SQL Database
After deciding to use DTU-based purchase model, we have to find out the following question-answer with logical reasons:
Which service tier and how much DTUs are required for my workload when migrating to Azure SQL?
DTU Calculator will be the main solution to estimate the DTUs requirement when we are migrating on-premise databases to Azure SQL Database. The main idea of this tool is capturing the various metrics utilization from the existing SQL Server that affects the DTUs and then it tries to estimate approximately DTUs and service tier in the light of the collected performance utilizations. DTU calculator collects the following metrics through the either Command-Line Utility or PowerShell Script and saves these metrics to a CSV file.
Processor - % Processor Time
Logical Disk - Disk Reads/sec
Logical Disk - Disk Writes/sec
Database - Log Bytes Flushed/sec
Extracted from https://www.spotlightcloud.io/blog/what-is-dtu-in-azure-sql-database-and-how-much-do-we-need.
Check this well-written article on calculating DTU
Credit to original author : Esat Erkec

SQL server scalability question

We are trying to build an application which will have to store billions of records. 1 trillion+
a single record will contain text data and meta data about the text document.
pl help me understand about the storage limitations. can a databse SQL or oracle support this much data or i have to look for some other filesystem based solution ? What are my options ?
Since the central server has to handle incoming load from many clients, how will parallel insertions and search scale ? how to distribute data over multiple databases or tables ? I am little green to database specifics for such scaled environment.
initally to fill the database the insert load will be high, later as the database grows, search load will increase and inserts will reduce.
the total size of data will cross 1000 TB.
thanks.
1 trillion+
a single record will contain text data
and meta data about the text document.
pl help me understand about the
storage limitations
I hope you have a BIG budget for hardware. This is big as in "millions".
A trillion documents, at 1024 bytes total storage per document (VERY unlikely to be realistic when you say text) is a size of about 950 terabyte of data. Storage limitations means you talk high end SAN here. Using a non-redundant setup of 2tb discs that is 450 discs. Make the maths. Adding redundancy / raid to that and you talk major hardware invesment. An this assumes only 1kb per document. If you have on average 16kg data usage, this is... 7200 2tb discs.
THat is a hardware problem to start with. SQL Server does not scale so high, and you can not do that in a single system anyway. The normal approach for a docuemnt store like this would be a clustered storage system (clustered or somehow distributed file system) plus a central database for the keywords / tagging. Depending on load / inserts possibly with replciations of hte database for distributed search.
Whatever it is going to be, the storage / backup requiments are terrific. Lagre project here, large budget.
IO load is gong to be another issue - hardware wise. You will need a large machine and get a TON of IO bandwidth into it. I have seen 8gb links overloaded on a SQL Server (fed by a HP eva with 190 discs) and I can imagine you will run something similar. You will want hardware with as much ram as technically possible, regardless of the price - unless you store the blobs outside.
SQL row compression may come in VERY handy. Full text search will be a problem.
the total size of data will cross 1000
TB.
No. Seriously. It will be a bigger, I think. 1000tb would assume the documents are small - like the XML form of a travel ticket.
According to the MSDN page on SQL Server limitations, it can accommodate 524,272 terabytes in a single database - although it can only accommodate 16TB per file, so for 1000TB, you'd be looking to implement partitioning. If the files themselves are large, and just going to be treated as blobs of binary, you might also want to look at FILESTREAM, which does actually keep the files on the file system, but maintains SQL Server notions such as Transactions, Backup, etc.
All of the above is for SQL Server. Other products (such as Oracle) should offer similar facilities, but I couldn't list them.
In the SQL Server space you may want to take a look at SQL Server Parallel Data Warehouse, which is designed for 100s TB / Petabyte applications. Teradata, Oracle Exadata, Greenplum, etc also ought to be on your list. In any case you will be needing some expert help to choose and design the solution so you should ask that person the question you are asking here.
When it comes to database its quite tricky and there can be multiple components involved to get performance like Redis Cache, Sharding, Read replicas etc.
Bellow post describes simplified DB scalability.
http://www.cloudometry.in/2015/09/relational-database-scalability-options.html

Resources