Part1 #
As per the pricing policy of snowflakes ,we will be paying based on the usage and we will not be charged if we won't use resources..This is clear.However I Am trying to understand ,is there any chance for reducing the cost if we drop the unused or rarely used warehouses? users and roles that are not been used any more ?I was looking some cost savings in terms of reducing the cloud services cost.
Part 2#
which is the most cost effective way .
1)Allocating separate warehouse for each team who uses the warehouse at specific times
(or)
2)Allocating single warehouse for all them and monitor warehouse load closely,such that if we notice queued load on warehouse then opt scale out option(multi cluster)(S+S)?
Please suggest the best way so that we can reduce overall cost.
there are only two things major things you are charged for disk and cpu, and a couple of minor things like compile time, and inter region IO charges. But users, warehouses, & roles are just access control lists in the end, that are to control cpu and disk usage.
prior to per second billing we found using one warehouse for a couple of teams meant less wasted CPU billing, and to some degree that almost is the case with the min 60 second billing, but we have a shared x-small most teams do dev on, and then spin-up bigger warehouses to run one-off loads (and then shut down) or have auto-scaling clusters to handle "normal load" which we also use cron jobs to limit "max size" just so in the off-peek times we intentionally increase latency of total load, to shift expenditure budget to peek times. and compared to the always running clusters, our dev instances are single digit percentages, so 1 or 2 warehouses is a round error.
The way we found the most value for reducing cost, was to look at the bill and see what seemed more $$ then we expected for the bang we where getting, and then we experimented, to see if there were lower cost ways to reach the same end goal. Be it different shaped tables that we multi inserted into, or finding queries that had long execution times, or pruned lots of rows (which might lead to the first point).. if you are want to save dollars you have to whach/care how you are spending them, and make trade-offs.
Part #1
Existence of multiple Warehouse will not incur any cost, cost will only come when it will be utilized as part of compute. However dropping unused objects will certainly ease the operational effort. Also if user exists and not being used it should fall under your security audit and it is always better to disable a user instead of dropping. Validate all downstream application ETL jobs/BI reports (If any) before dropping any users/roles
Cloud service cost is entirely different ball game , it follows 10% rule. One need to pay this amount when cloud service usage exceeds 10% of the warehouse usage on that day.
Part #2
Snowflake always suggest warehouse should be created based on your activity. Please do not create warehouse to segregate teams/user group. Create user and roles for that.
What we observed
During development keeping only one virtual Warehouse, until real requirement pops up (Project team wise segregation for cost sharing or budgeting or credit assessment) there is no need to have multiple warehouse created.
Even for Prod activity wise segregation is ideal, one for ETL load/BI reporting / Data analytics team
Thanks
Palash Chatterjee
Related
I use snowflake for turning out if it can use for DWH, and I am concern with the query behavior when the warehouse somehow fails.
https://docs.snowflake.com/en/user-guide/warehouses-considerations.html#multi-cluster-warehouses-improve-concurrency
According to the above page, if the minimum cluster is set to higher than 1, it helps ensure availability and continuity.
I have questions about it.
1.If we set it to 1 and the warehouse fails, the proceeded query come to fail?
2.If we set it to 2 or more and a cluster of the warehouse fails, the proceeded query come to fail and start automatically by another cluster?
When a warehouse fails, a new warehouse is automatically started and the query is retired. In the 6 years at my prior job that we ran snowflake, there was less that a dozen times where we experienced warehouse failure.
It was often around releases being rolled out. One thing that does happen on failure is the release is pushed back. So we noticed blips in our processing rate, or increased total time, and during the trouble periods the query profile might show 1-3 tabs of query plan, for each retry.
At least one of those outages was a failure to bring up new warehouses, class of problem, and in that incident, I don't think we where impacted as we have stuff just always running.
A side note also is you get billed for those failures, so that can bite if you are doing large computations and it fails and retries. We have had refunds (of the extra cost) when we can show there was a cost increase due to a known failure event.
But if you are looking at run Medium and smaller warehouses those normally start same second, so you might not notice a "failure" but if you are running a really large instance size, it can take longer to bring that capacity online.
I can understand the provisioned DB cost but there are few questions regd on-demand nodes.
does OnDemand pricing only considers the sum of WRU used by each partition or the overall WRU for the table based on the usage pattern which will be shared by each partition.
when there is a hot partition, does OnDemand increase WRU only for that partition or increases the overall WRU of the table.
does adaptive capacity work with OnDemand DB
ex:
OnDemand DB with 10 partitions and current peak at 1000WRU.
if 2 hot partitions require more than 300WRU will it use from adaptive capacity or increase the overall WRU to 3000WRU resulting in high cost?
I'm not a DynamoDB insider, so I can only answer from what I understand from their documentation.
In on-demand pricing (see https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadWriteCapacityMode.html#HowItWorks.OnDemand) you pay exactly by the number and size of your requests. If you make one million requests, you will pay the same whether these requests were to a million different partitions, or they all went to the same partition.
You might wonder, then, why there was such an issue of load imbalance pricing in provisioned-capacity pricing - or at least why is the Web full of stories of such an issue. There should never have been such an issue, but in the past there was. But since recently, this isn't an issue any more. Here is the story:
In the provisioned pricing page, Amazon claims that if you reserve 1000 WCU, you should be able to use this number of write units that you paid for, per second, and if you try to use more, you'll be throttled. There is no mention or warning of imbalanced loads or hot partitions... But people discovered that this wasn't quite true - Amazon had a bug in their throttling code: The usage counting wasn't done across the entire cluster. Instead, if your data was spread over 10 nodes, your reservation of 1000 was evenly split among them, so each of the 10 nodes would start to throttle you after 100 (1000/10) requests per second. This split worked well for well-balanced loads, but with hot partitions, it didn't work well. People were paying for a reservation of 1000 but when they measured how much they were getting, they saw throttling after just 800 (for example) requests per second. Amazon acknowledge this was a bug, and fixed it with their "adaptive capacity" technique where each of the nodes picks a different throttling limit, modified until the user's total usage approaches what he had paid for. This technique is explained in this excellent official talk
https://www.youtube.com/watch?v=yvBR71D0nAQ - see time 19:38. Until very recently this "adaptive capacity" was a very blunt instrument, which only worked well if your workload doesn't change quickly, but since then, this issue was fixed too - as described in
https://aws.amazon.com/blogs/database/how-amazon-dynamodb-adaptive-capacity-accommodates-uneven-data-access-patterns-or-why-what-you-know-about-dynamodb-might-be-outdated/
I am at the beginning of a project where we will need to manage a near real-time flow of messages containing some ids (e.g. sender's id, receiver's id, etc.). We expect a throughput of about 100 messages per second.
What we will need to do is to keep track of the number of times these ids appeared in a specific time frame (e.g. last hour or last day) and store these values somewhere.
We will use the values to perform some real time analysis (i.e. apply a predictive model) and update them when needed while parsing the messages.
Considering the high throughput and the need to be in real time what DB solution would be the better choice?
I was thinking about a key-value in memory DB that will persist data on disk periodically (like Redis).
Thanks in advance for the help.
The best choice depends on many factors we don’t know, like what tech stack is your team already using, how open are they to learning new things, how much operational burden are you willing to take on, etc.
That being said, I would build a counter on top of DynamoDB. Since DynamoDB is fully managed, you have no operational burden (no database server upgrades, etc.). It can handle very high throughput, and it has single-digit millisecond latency for writes and reads to a single row. AWS even has documentation describing how to use DynamoDB as a counter.
I’m not as familiar with other cloud platforms, but you can probably find something in Azure or GCP that offers similar functionality.
I'm building a web service, consisting of many different components, all of which could conceivably be bottlenecks. I'm currently trying to figure out what metrics I should be looking for, when deciding whether or not my database (on AWS RDS) is the bottleneck in the chain.
Looking at AWS Cloudwatch, I see a number of RDS metrics given. Full list:
CPUCreditBalance
CPUCreditUsage
CPUUtilization
DatabaseConnections
DiskQueueDepth
FreeStorageSpace
FreeableMemory
NetworkReceiveThroughput
NetworkTransmitThroughput
ReadIOPS
ReadLatency
ReadThroughput
SwapUsage
WriteIOPS
WriteLatency
WriteThroughput
The key metrics that I think I should be paying attention to:
Read/Write Latency
CPU-Utilization
Freeable Memory
With the latency metrics, I'm thinking that I should set up alerts if it exceeds >300ms (for fast website responsiveness), though I recognize that this is very much workload dependent.
With the CPU/memory-util, I have no idea what numbers to set these to. I'm thinking I should set an alert for 75% CPU-utilization, and 75% drop in Freeable Memory.
Am I on the right track with the metrics I've shortlisted above, and the thresholds I have guessed? Are there any other metrics I should be paying attention to?
The answer is totally dependent on your application. Some applications will require more CPU, some will need more RAM. There is no definitive answer.
The best thing is to monitor your database (with the metrics you list above). Then, when performance is below desired, take a look at which metrics are showing problems. These should be the first ones you track for scaling your database.
The key idea that if your customers are experiencing problems, it should be appearing in your metrics somewhere. If this isn't the case, then you're not collecting sufficient metrics.
I think you are on the right track - especially with the latency metrics; for a typical application with database back-end, the read/write latency is going to be what the user notices most if it degrades. Sure the memory or cpu usage may spike, but does any user care? No, not unless it then causes the latency to go up.
I'd start with the metrics you listed as the low-hanging fruit and adjust accordingly.
We're being asked to spec out production database hardware for an ASP.NET web application that hasn't been built yet.
The specs we need to determine are:
Database CPU
Database I/O
Database RAM
Here are the metrics I'm currently looking at:
Estimated number of future hits to
website - based on current IIS logs.
Estimated worst-case peak loads to
website.
Estimated number of DB queries per
page, on average.
Number of servers in web farm that
will be hitting database.
Cache polling traffic from database
(using SqlCacheDependency).
Estimated data cache misses.
Estimated number of daily database transactions.
Maximum acceptable page render time.
Any other metrics we should be taking into account?
Also, once we have all those metrics in place, how do they translate into hardware requirements?
What I have been doing lately for server planning is using some free tools that HP provides, which are collectively referred to as the "server sizers". These are great tools because they figure out the optimal type of RAID to use, and the correct number of disk spindles to handle the load (very important when planning for a good DB server) and memory processor etc. I've provided the link below I hope this helps.
http://h71019.www7.hp.com/ActiveAnswers/cache/70729-0-0-225-121.html?jumpid=reg_R1002_USEN
What I am missing is a measure for the needed / required / defined level of reliability.
While you could probably spec out a big honking machine to handle all the load, depending on your reliabiltiy requirements, you might rather want to invest in smaller, but multiple machines, and into safer disk subsystems (RAID 5).
Marc
In my opinion, estimating hardware for an application that hasn't been built and designed yet is more of a political issue than a scientific issue. By the time you finish the project, current hardware capability and their price, functional requirements, expected number of concurrent users, external systems and all other things will change and this change is beyond your control.
However this question comes up very often since you need to put numbers in a proposal or provide a report to your manager. If it is a proposal, what you are trying to accomplish is to come up with a spec that can support the proposed sofware system. The only trick is to propose a system that will not increase your cost for competiteveness while not puting yourself at the risk of a low performance system.
If you can characterize your current workload in terms of hits to pages, then you can then:
1) calculate the typical type of query that will be done for each page
2) using the above 2 pieces of information, estimate the workload on the database server
You also need to determine your performance requirements - what is the max and average response time you want for your website?
Given the workload, and performance requirements, you can then calculate capacity. The best way to make this estimate is to use some existing hardware, run a simulated database workload on a database on that hardware, and then extrapolate your hardware requirements based on your data from the first steps.