Dynamodb ondemand cost and scaling during hot partition(adaptive scaling)

Dynamodb ondemand cost and scaling during hot partition(adaptive scaling) - database

I can understand the provisioned DB cost but there are few questions regd on-demand nodes.
does OnDemand pricing only considers the sum of WRU used by each partition or the overall WRU for the table based on the usage pattern which will be shared by each partition.
when there is a hot partition, does OnDemand increase WRU only for that partition or increases the overall WRU of the table.
does adaptive capacity work with OnDemand DB
ex:
OnDemand DB with 10 partitions and current peak at 1000WRU.
if 2 hot partitions require more than 300WRU will it use from adaptive capacity or increase the overall WRU to 3000WRU resulting in high cost?

I'm not a DynamoDB insider, so I can only answer from what I understand from their documentation.
In on-demand pricing (see https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadWriteCapacityMode.html#HowItWorks.OnDemand) you pay exactly by the number and size of your requests. If you make one million requests, you will pay the same whether these requests were to a million different partitions, or they all went to the same partition.
You might wonder, then, why there was such an issue of load imbalance pricing in provisioned-capacity pricing - or at least why is the Web full of stories of such an issue. There should never have been such an issue, but in the past there was. But since recently, this isn't an issue any more. Here is the story:
In the provisioned pricing page, Amazon claims that if you reserve 1000 WCU, you should be able to use this number of write units that you paid for, per second, and if you try to use more, you'll be throttled. There is no mention or warning of imbalanced loads or hot partitions... But people discovered that this wasn't quite true - Amazon had a bug in their throttling code: The usage counting wasn't done across the entire cluster. Instead, if your data was spread over 10 nodes, your reservation of 1000 was evenly split among them, so each of the 10 nodes would start to throttle you after 100 (1000/10) requests per second. This split worked well for well-balanced loads, but with hot partitions, it didn't work well. People were paying for a reservation of 1000 but when they measured how much they were getting, they saw throttling after just 800 (for example) requests per second. Amazon acknowledge this was a bug, and fixed it with their "adaptive capacity" technique where each of the nodes picks a different throttling limit, modified until the user's total usage approaches what he had paid for. This technique is explained in this excellent official talk
https://www.youtube.com/watch?v=yvBR71D0nAQ - see time 19:38. Until very recently this "adaptive capacity" was a very blunt instrument, which only worked well if your workload doesn't change quickly, but since then, this issue was fixed too - as described in
https://aws.amazon.com/blogs/database/how-amazon-dynamodb-adaptive-capacity-accommodates-uneven-data-access-patterns-or-why-what-you-know-about-dynamodb-might-be-outdated/

Related

Why is RDS burst balance seemingly random, never below 99%, and not correlated to system traffic?

I'm trying make sense of the RDS "Burst Balance" metric that I'm seeing over the past couple months. According to the metrics reported in CloudWatch, my burst balance never seems to drop lower than 99%, but sometimes it stays near 99% for weeks and sometimes it recovers back to 100% rather quickly. The graph below is Burst Balance between 5/1/2020 and 7/23/2020.
I'm confused because there seems to be no correlation between this metric and any other metric, such as system traffic reported by the load balancer. Also - why does it never drop below 99% and why does it sometimes take weeks to return to 100% even though there is very little traffic on my system past 5:00 PM every day, and virtually none on weekends? Here is my system traffic (load balancer request count) for that same time period:
Why is the burst balance all over the place when the system traffic is very predictable and repetitive?

The burst balance relates to disk usage on your RDS instance.
This assume you're using a GP2 volume which is bustable (it has credits that can be depleted too). Just like burstable instance classes (T2/T3) GP2 has a balance of available credits in which its IOPs can reach upto 3000 PIOPs.
GP2 provides dedicated IOPs of 3 times the number of GiB storage of the instance (For example if you have 100 GiB of storage then you will 300 PIOPs guaraneteed, anything over this will use your credits).
Once it reaches over this number you will start depleting this balance, but when under it will replenish. When its on the cusp of PIOPs usage then you're likely not to either replenish or gain balance (as you see).
To see more information over the usage check read and write Ops of your RDS instance from within CloudWatch metrics.You can combine this with performance insights to find any specific actions that might be leading to this.
More information about I/O Credits and Burst Performance is available from within the documentation.

Thought on reducing snowflakes cost by dropping unused objects

Part1 #
As per the pricing policy of snowflakes ,we will be paying based on the usage and we will not be charged if we won't use resources..This is clear.However I Am trying to understand ,is there any chance for reducing the cost if we drop the unused or rarely used warehouses? users and roles that are not been used any more ?I was looking some cost savings in terms of reducing the cloud services cost.
Part 2#
which is the most cost effective way .
1)Allocating separate warehouse for each team who uses the warehouse at specific times
(or)
2)Allocating single warehouse for all them and monitor warehouse load closely,such that if we notice queued load on warehouse then opt scale out option(multi cluster)(S+S)?
Please suggest the best way so that we can reduce overall cost.

there are only two things major things you are charged for disk and cpu, and a couple of minor things like compile time, and inter region IO charges. But users, warehouses, & roles are just access control lists in the end, that are to control cpu and disk usage.
prior to per second billing we found using one warehouse for a couple of teams meant less wasted CPU billing, and to some degree that almost is the case with the min 60 second billing, but we have a shared x-small most teams do dev on, and then spin-up bigger warehouses to run one-off loads (and then shut down) or have auto-scaling clusters to handle "normal load" which we also use cron jobs to limit "max size" just so in the off-peek times we intentionally increase latency of total load, to shift expenditure budget to peek times. and compared to the always running clusters, our dev instances are single digit percentages, so 1 or 2 warehouses is a round error.
The way we found the most value for reducing cost, was to look at the bill and see what seemed more $$ then we expected for the bang we where getting, and then we experimented, to see if there were lower cost ways to reach the same end goal. Be it different shaped tables that we multi inserted into, or finding queries that had long execution times, or pruned lots of rows (which might lead to the first point).. if you are want to save dollars you have to whach/care how you are spending them, and make trade-offs.

Part #1
Existence of multiple Warehouse will not incur any cost, cost will only come when it will be utilized as part of compute. However dropping unused objects will certainly ease the operational effort. Also if user exists and not being used it should fall under your security audit and it is always better to disable a user instead of dropping. Validate all downstream application ETL jobs/BI reports (If any) before dropping any users/roles
Cloud service cost is entirely different ball game , it follows 10% rule. One need to pay this amount when cloud service usage exceeds 10% of the warehouse usage on that day.
Part #2
Snowflake always suggest warehouse should be created based on your activity. Please do not create warehouse to segregate teams/user group. Create user and roles for that.
What we observed
During development keeping only one virtual Warehouse, until real requirement pops up (Project team wise segregation for cost sharing or budgeting or credit assessment) there is no need to have multiple warehouse created.
Even for Prod activity wise segregation is ideal, one for ETL load/BI reporting / Data analytics team
Thanks
Palash Chatterjee

SolrCloud deployment for high frequency traffic

We have been looking for a big data storage which can collect a huge pool of user information.
Also, I would like to mention that we are working on RTB platform (namely DSP side). As a result, our platform handles around 100 millions of requests per day. It is about 1-2 thousands of requests per second (depends on time). Here is a simple overview what we are going to implement:
My questions are:
Is it a good solution using Solr (SolrCloud) in Data Management Platform?
How do you think whether SolrCloud can handle high frequency traffic?
I have checked Solr performance problem and there is one option - extremely frequent commits (under Slow commits item). What is the limit?
What configuration of SolrCloud will be suitable for us? I mean amount of shards, cores and server configuration (CPU, RAM) to handle 1-2K QPS and store about 500M docs. How can we calculate this?

Determining when to scale up my AWS RDS database?

I'm building a web service, consisting of many different components, all of which could conceivably be bottlenecks. I'm currently trying to figure out what metrics I should be looking for, when deciding whether or not my database (on AWS RDS) is the bottleneck in the chain.
Looking at AWS Cloudwatch, I see a number of RDS metrics given. Full list:
CPUCreditBalance
CPUCreditUsage
CPUUtilization
DatabaseConnections
DiskQueueDepth
FreeStorageSpace
FreeableMemory
NetworkReceiveThroughput
NetworkTransmitThroughput
ReadIOPS
ReadLatency
ReadThroughput
SwapUsage
WriteIOPS
WriteLatency
WriteThroughput
The key metrics that I think I should be paying attention to:
Read/Write Latency
CPU-Utilization
Freeable Memory
With the latency metrics, I'm thinking that I should set up alerts if it exceeds >300ms (for fast website responsiveness), though I recognize that this is very much workload dependent.
With the CPU/memory-util, I have no idea what numbers to set these to. I'm thinking I should set an alert for 75% CPU-utilization, and 75% drop in Freeable Memory.
Am I on the right track with the metrics I've shortlisted above, and the thresholds I have guessed? Are there any other metrics I should be paying attention to?

The answer is totally dependent on your application. Some applications will require more CPU, some will need more RAM. There is no definitive answer.
The best thing is to monitor your database (with the metrics you list above). Then, when performance is below desired, take a look at which metrics are showing problems. These should be the first ones you track for scaling your database.
The key idea that if your customers are experiencing problems, it should be appearing in your metrics somewhere. If this isn't the case, then you're not collecting sufficient metrics.

I think you are on the right track - especially with the latency metrics; for a typical application with database back-end, the read/write latency is going to be what the user notices most if it degrades. Sure the memory or cpu usage may spike, but does any user care? No, not unless it then causes the latency to go up.
I'd start with the metrics you listed as the low-hanging fruit and adjust accordingly.

Search API on Google App Engine

I am running a proof of concept using App Engine and the built in Search API. We are testing the Search API under the assumption that it provides linear scaling as is the case with other products and services that are bundled with App Engine.
Specs: approx. 8 million documents in a single index
Query type:
Complex queries, we need spatial queries based on square areas, not
distance(!). All queries include 2 ranges based on latitude and
longitude.
Page sizes: between 16 and 250.
Accuracy (result counting) set to 100 in all test cases.
Our target performance (latency) is in the 100's of milliseconds range.
We are testing performance of the Search API running several concurrent requests. Test results are now measured at about 25 concurrent requests, but this number is expected to go up significantly. However, if the Search API is properly scalable then this should be meaningless.
I am measuring the time it takes the Search API to process a call to Index.search(Query).
What I am measuring is the following:
The average time it takes the search method to return is around 8000 ms. There are no cases in which the method returns significantly faster or slower than that. However, using an index with 10 documents results in latency measurements of around 300 ms (!!!). This might be an indication that the Search API is not scalable at all.
The page size does not seem to make any significant differences. Perhaps at page sizes of 10.000 or higher it will, but this is not part of our tests.
Adding one criteria (equality) seems to speed up the search significantly. Up to approximately 40% improvement. This seems like a nice improvement, but 4 seconds is still an eternity.
Questions:
What is the expected latency (best possible scenario/configuration) that the Search API can deliver?
Which parameters influence latency including app engine configuration.
Does the number of documents in an index influence latency?
Is a search based on 2 range queries slower than a search based on equality filters alone? (because we could pre-process the data and add 'index' data to each document).
Is the Search API really scalable?

Our application for this was to plot a number of markers on a map using a tile server. However, the tile server performs many queries (i.e. 'tiles') in parallel, almost 30 per user/view. To make things difficult, we were not able to solve this problem using pre-aggregated maps because we have too many parameters/dimensions to take care of (if this is the case for you then try: Google Maps Engine).
So, we ended up with a CloudSQL instance set to the highest tier for max. performance. Another reason to use a relational database is that index performance is more precisely tunable as opposed to the Search API or BigQuery.
To answer the questions, this is what we found:
The latency depends on the size of the index. At lower volumes per index the latency seems reasonable. At much higher volumes this may become a problem. But for text-searched this is probably ok in most cases.
We did not test at lower volumes but at around 8 million documents, the latency sits between 5000 - 8000 ms. per query. We did not find any parameters that decreased latency, we did find parameters that increased latency.
Yes.
We did not test this.
Yes.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight