We use Azure SQL Database, currently with S3 Tier and we have a problem with one of our services, who is pushing data to the database. It's called very often and most of the time our DTU is more than 95%. We already optimize what we could, but basically it's too many DB hits. We are working on other optimization, caching, etc..
The problem is, that this DB is used by our other application and because of DTU is consumed by other service, we have performance issues.
I was thinking, if there is a way, when we can somehow set up max limit for one SQL User for DTU? e.g. 30%.
I was trying to google it, but couldn't find anything related to this topic.
Thanks a lot for the answers or suggestions
There is currently no way in SQL Database to limit the resources on a per Query / client basis. Which resource dimension are you maxing out (CPU, reads, writes)? If you cannot further optimize you might have to bite the bullet and scale up. If you are IO bound, than switching to P1 will help you. If you are CPU bound you might have to go up to P2.
The % is always based on the S2 tier db. If you are at 95 it means you are at 95% of a S2 tier db on your DTU usage. In this case you are close to the 100%, so you probably soon need a tier larger as the S2. You are using the S3, so you have the right tier.
azure-sql-database-introduces-new-near-real-time-performance-metrics
For example, if your DTU consumption shows a value of 80%, it
indicates it is consuming DTU at the rate of 80% of the limit an S2
database would have. If you see values greater than 100% in this view
it means that you need a performance tier larger than S2.
As an example, let’s say you see a percentage value of 300%. This
tells you that you are using three times more resources than would be
available in an S2. To determine a reasonable starting size, compare
the DTUs available in an S2 (50 DTUs) with the next higher sizes (P1 =
100 DTUs, or 200% of S2, P2 = 200 DTUs or 400% of S2). Because you
are at 300% of S2 you would want to start with a P2 and re-test.
Based on the DTU usage percent you can determine if your database can
fit within S2 performance level (OR a lower/higher level as indicated
through DTU percentage and relative DTU powers of various performance
tiers as documented in MSDN site).
When you have locking problems, you need to find the queries that lock the db and rewrite them. Scaling to a larger db tier will only help a little, and giving the application that causes the problems less db performance will only extend the lock times.
Related
I am rebuilding some indexes in Azure SQL using a fill factor of 80 (recommended by the company who developed the application, who are not experts on the database) and after doing this queries got a LOT slower. We noticed that now they were taking a longer time in "Network I/O". Does anybody know what the problem might be?
Fillfactor is not a silver bullet and has it's tradeoffs. https://www.mssqltips.com/sqlservertip/5908/what-is-the-best-value-for-fill-factor-in-sql-server/
It is important to note which effect the lower fillfactor value has on the underlying data pages and index pages, which comprise your table:
There is now 20% more storage allocated for data pages for the same number of records!
This causes increased I/O. Depending on your Azure storage/compute plan you may be hitting a ceiling and need to bump up you IOPS.
Now, if you are not running out of IOPS, there's more to look into. Is it possible that the index rebuild operation had not completed yet and index is not being used for query optimization? A Profiler/Execution plan can confirm this.
I'd say that if you have a very large table and want to speed things up dramatically, your best bet is partitioning on the column most commonly used to address the data.
See also: https://www.sqlshack.com/database-table-partitioning-sql-server/
Try to identify queries returning large data sets to client hosts. Large result sets may lead to unnecessary network utilization and client application processing. Make sure queries are returning only what is needed using filtering and aggregations, and make sure no duplicates are being returned unnecesarily.
Another possible cause of that wait on Azure SQL may be the client application doesn't fetch results fast enough and doesn't notify Azure SQL that the result set has been received. On the client appliation side, store the results in memory first and only then doing more processing. Make sure the lient application is not under stress and that makes it unable to get the results faster.
One last thing, make sure Azure SQL and the appliation are on the same region, and there is not transfer of data over different regions or zones.
When a single Azure SQL database is upsized to (S1->S3, S7->S9, P2->P4 etc.) with no other changes (no code deployment, no changes in load) the DTU percentage decreases, which is expected. What could explain the DTU percentage increasing upon moving to the higher tier and then decreasing back when downgraded to the smaller tier?
In other words, normally one could expect P2 # ~80% becoming P4 # ~40%. What could explain P2 # ~80% becoming P4 # ~90% at stable load, no code changes, and no increase is db size (the database is read and write heavy (updates, not many inserts).
For example, could Query Store become busier when more DTUs are available?
Please note this is not after optimizing this database (this work is being done but this is not part of this question)
When you increase the DTUs you also get an increase in the transaction log throughput rate.
As your database has a heavy UPDATE load, it is likely the case that at the lower tier your writes are being heavily throttled as well as your CPU and Memory overhead being close to the MAX>
Increasing your DTUs allows you more CPU and Memory headroom, but potentially your throughput rate is still being throttled with the high number of UPDATES.
The formula used for avg_dtu_percent is avg_dtu_percent = MAX(avg_cpu_percent, avg_data_io_percent, avg_log_write_percent) therefore you can see that your DTU can still appear high if only one of the resource types is being heavily used.
To track more detailed usage information, sys.dm_db_resource_stats dynamic management view (DMV) lets you view resource consumption for the last hour. The sys.resource_stats catalog view displays resource consumption for the last 14 days, but at a lower fidelity of five-minute averages.
Part1 #
As per the pricing policy of snowflakes ,we will be paying based on the usage and we will not be charged if we won't use resources..This is clear.However I Am trying to understand ,is there any chance for reducing the cost if we drop the unused or rarely used warehouses? users and roles that are not been used any more ?I was looking some cost savings in terms of reducing the cloud services cost.
Part 2#
which is the most cost effective way .
1)Allocating separate warehouse for each team who uses the warehouse at specific times
(or)
2)Allocating single warehouse for all them and monitor warehouse load closely,such that if we notice queued load on warehouse then opt scale out option(multi cluster)(S+S)?
Please suggest the best way so that we can reduce overall cost.
there are only two things major things you are charged for disk and cpu, and a couple of minor things like compile time, and inter region IO charges. But users, warehouses, & roles are just access control lists in the end, that are to control cpu and disk usage.
prior to per second billing we found using one warehouse for a couple of teams meant less wasted CPU billing, and to some degree that almost is the case with the min 60 second billing, but we have a shared x-small most teams do dev on, and then spin-up bigger warehouses to run one-off loads (and then shut down) or have auto-scaling clusters to handle "normal load" which we also use cron jobs to limit "max size" just so in the off-peek times we intentionally increase latency of total load, to shift expenditure budget to peek times. and compared to the always running clusters, our dev instances are single digit percentages, so 1 or 2 warehouses is a round error.
The way we found the most value for reducing cost, was to look at the bill and see what seemed more $$ then we expected for the bang we where getting, and then we experimented, to see if there were lower cost ways to reach the same end goal. Be it different shaped tables that we multi inserted into, or finding queries that had long execution times, or pruned lots of rows (which might lead to the first point).. if you are want to save dollars you have to whach/care how you are spending them, and make trade-offs.
Part #1
Existence of multiple Warehouse will not incur any cost, cost will only come when it will be utilized as part of compute. However dropping unused objects will certainly ease the operational effort. Also if user exists and not being used it should fall under your security audit and it is always better to disable a user instead of dropping. Validate all downstream application ETL jobs/BI reports (If any) before dropping any users/roles
Cloud service cost is entirely different ball game , it follows 10% rule. One need to pay this amount when cloud service usage exceeds 10% of the warehouse usage on that day.
Part #2
Snowflake always suggest warehouse should be created based on your activity. Please do not create warehouse to segregate teams/user group. Create user and roles for that.
What we observed
During development keeping only one virtual Warehouse, until real requirement pops up (Project team wise segregation for cost sharing or budgeting or credit assessment) there is no need to have multiple warehouse created.
Even for Prod activity wise segregation is ideal, one for ETL load/BI reporting / Data analytics team
Thanks
Palash Chatterjee
We have an Azure SQL database. Up until a few weeks ago, we were set at 10 DTUs (S0). Recently, we've gotten more SQL timeout errors, prompting us to increase our DTUs to 50 (S2). We get the errors less frequently, but still on occasion. When we get these timeouts, we see spikes on the Resource graph hitting 100%. Drilling into that, it's generally Data I/O operations that are making it spike. But when we check Query Performance Insight, none of the listed queries show that they're using that much resources.
Another thing to note is that our database has grown steadily in size. It is now about 19 GB, and the majority (18 GB) of that comes from one table that has a lot of long JSON strings in it. The timeout errors generally do happen on a certain query that has several joins, but they do not interact with the table with the long strings.
We tested making a copy of the database and removing all the long strings, and it did not get any timeouts at 10 DTU, but performed the same as the database with all the long strings at 50 DTU as far as load times.
We have rebuilt our indexes and, though it helped, we continue to experience timeout errors.
Given that the query that gets timeouts is not touching the table with long strings, could the table with long strings still be the culprit for DTU usage? Would it have to do with SQL caching? Could the long strings be hogging the cache and causing a lot of data I/O? (They are accessed fairly frequently too.)
The strings can definitely exhaust your cache budget if they are hot (as you say they are). When the hot working set exceeds RAM cache size performance can fall off a cliff (10-100x). That's because IO is 10-1000x slower than RAM access. This means that even a tiny decrease in cache hit ratio (such as 1%) can multiply into a big performance loss.
This cliff can be very steep. One moment the app is fine, the next moment IO is off the charts.
Since Azure SQL Database has strict resource limits (as I hear and read) this can quickly exhaust the performance that you bought bring on throttling.
I think the test you made kind of confirms that the strings are causing the problem. Can you try to segregate the strings somewhere else? If they are cold move them to another table. If they are hot move them to another datastore (database or NoSQL). That way you can likely move back to a much lower tier.
I'm building a web service, consisting of many different components, all of which could conceivably be bottlenecks. I'm currently trying to figure out what metrics I should be looking for, when deciding whether or not my database (on AWS RDS) is the bottleneck in the chain.
Looking at AWS Cloudwatch, I see a number of RDS metrics given. Full list:
CPUCreditBalance
CPUCreditUsage
CPUUtilization
DatabaseConnections
DiskQueueDepth
FreeStorageSpace
FreeableMemory
NetworkReceiveThroughput
NetworkTransmitThroughput
ReadIOPS
ReadLatency
ReadThroughput
SwapUsage
WriteIOPS
WriteLatency
WriteThroughput
The key metrics that I think I should be paying attention to:
Read/Write Latency
CPU-Utilization
Freeable Memory
With the latency metrics, I'm thinking that I should set up alerts if it exceeds >300ms (for fast website responsiveness), though I recognize that this is very much workload dependent.
With the CPU/memory-util, I have no idea what numbers to set these to. I'm thinking I should set an alert for 75% CPU-utilization, and 75% drop in Freeable Memory.
Am I on the right track with the metrics I've shortlisted above, and the thresholds I have guessed? Are there any other metrics I should be paying attention to?
The answer is totally dependent on your application. Some applications will require more CPU, some will need more RAM. There is no definitive answer.
The best thing is to monitor your database (with the metrics you list above). Then, when performance is below desired, take a look at which metrics are showing problems. These should be the first ones you track for scaling your database.
The key idea that if your customers are experiencing problems, it should be appearing in your metrics somewhere. If this isn't the case, then you're not collecting sufficient metrics.
I think you are on the right track - especially with the latency metrics; for a typical application with database back-end, the read/write latency is going to be what the user notices most if it degrades. Sure the memory or cpu usage may spike, but does any user care? No, not unless it then causes the latency to go up.
I'd start with the metrics you listed as the low-hanging fruit and adjust accordingly.