SQL Azure database DTU increses after upsizing with no other changes - sql-server

When a single Azure SQL database is upsized to (S1->S3, S7->S9, P2->P4 etc.) with no other changes (no code deployment, no changes in load) the DTU percentage decreases, which is expected. What could explain the DTU percentage increasing upon moving to the higher tier and then decreasing back when downgraded to the smaller tier?
In other words, normally one could expect P2 # ~80% becoming P4 # ~40%. What could explain P2 # ~80% becoming P4 # ~90% at stable load, no code changes, and no increase is db size (the database is read and write heavy (updates, not many inserts).
For example, could Query Store become busier when more DTUs are available?
Please note this is not after optimizing this database (this work is being done but this is not part of this question)

When you increase the DTUs you also get an increase in the transaction log throughput rate.
As your database has a heavy UPDATE load, it is likely the case that at the lower tier your writes are being heavily throttled as well as your CPU and Memory overhead being close to the MAX>
Increasing your DTUs allows you more CPU and Memory headroom, but potentially your throughput rate is still being throttled with the high number of UPDATES.
The formula used for avg_dtu_percent is avg_dtu_percent = MAX(avg_cpu_percent, avg_data_io_percent, avg_log_write_percent) therefore you can see that your DTU can still appear high if only one of the resource types is being heavily used.
To track more detailed usage information, sys.dm_db_resource_stats dynamic management view (DMV) lets you view resource consumption for the last hour. The sys.resource_stats catalog view displays resource consumption for the last 14 days, but at a lower fidelity of five-minute averages.

Related

Clickhouse DB slows down on a daily basis at 10am for seemingly no reason

I have been using Clickhouse at work for analytics purposes for a while now.
I am currently running Clickhouse v22.6.3 revision 54455 on-premise on a VM with:
fast storage
200Gb of RAM
no swap
a 40-cores CPU.
I have a few Tb of data, but no table bigger than 300 Gb. I do not use distributed tables or replication yet, and I write frequently into Clickhouse (but I don't use deletes or updates and prefer using things like the ReplacingMergeTree engine). I also leverage the MaterializedView feature for a few tables. Let me know if you need any more context or parameter, I use a pretty standard configuration.
Now, for a few months I have been experiencing performances issues where the server significantly slows down every day at 10am, and I cannot figure out why.
Based on Clickhouse built-in Graphite monitoring, the "symptoms" of the issue seem to be as follow:
At 10am:
On the server side:
Both load and RAM usage remain reasonable. Load goes up a little.
Disk write await time goes up (which I suspect is what leads to higher load)
Disk utilization % skyrockets to something between 90 and 100%
On Clickhouse side:
DiskSpaceReservedForMerge stays roughly the same (ie between 0 and 70Gb)
both OpenFileForRead and OpenFileForWrite go up by a factor of ~2
BackgroundCommonPoolTask goes slightly up, so does BackgroundSchedulePoolTask (which I found weird, because I thought this pool was dedicated to distributed operations - which I don't use) - both numbers remain seemingly reasonable
The number of active Merge tasks per minutes drop significantly but I'm unsure whether it's a consequence of slow writing or if it's causing it
both insert and general querying time are multiplied by ~10 which renders the database effectively unusable even for small tasks
Restarting Clickhouse usually fixes the problem but I obviously do not want to restart my main database every day at 10am. Most of the heavy load I put on the DB (such as data extraction and transformation, etc) happens earlier in the morning (and end around 7-8am) and runs fine. I do not have any heavy tasks running at 10am. The Clickhouse VM takes most of its host resources and I have confirmed with the devOps team that there doesn't seem to be a problem on the host or anything else scheduled on it at that time.
Is there any kind of background tasks or process that is run by Clickhouse on a daily basis and that could have a high impact on our disk capacity? What else can I monitor to figure out what is causing this problem?
Again, let me know if I can be more thorough on our settings and the state of the DB when the "bug" occurs.
Do you use https://github.com/innogames/graphite-ch-optimizer ?
Do you use TTL ?
select * from system.merges;
select * from system.part_log where event_time between ~10am~

Thought on reducing snowflakes cost by dropping unused objects

Part1 #
As per the pricing policy of snowflakes ,we will be paying based on the usage and we will not be charged if we won't use resources..This is clear.However I Am trying to understand ,is there any chance for reducing the cost if we drop the unused or rarely used warehouses? users and roles that are not been used any more ?I was looking some cost savings in terms of reducing the cloud services cost.
Part 2#
which is the most cost effective way .
1)Allocating separate warehouse for each team who uses the warehouse at specific times
(or)
2)Allocating single warehouse for all them and monitor warehouse load closely,such that if we notice queued load on warehouse then opt scale out option(multi cluster)(S+S)?
Please suggest the best way so that we can reduce overall cost.
there are only two things major things you are charged for disk and cpu, and a couple of minor things like compile time, and inter region IO charges. But users, warehouses, & roles are just access control lists in the end, that are to control cpu and disk usage.
prior to per second billing we found using one warehouse for a couple of teams meant less wasted CPU billing, and to some degree that almost is the case with the min 60 second billing, but we have a shared x-small most teams do dev on, and then spin-up bigger warehouses to run one-off loads (and then shut down) or have auto-scaling clusters to handle "normal load" which we also use cron jobs to limit "max size" just so in the off-peek times we intentionally increase latency of total load, to shift expenditure budget to peek times. and compared to the always running clusters, our dev instances are single digit percentages, so 1 or 2 warehouses is a round error.
The way we found the most value for reducing cost, was to look at the bill and see what seemed more $$ then we expected for the bang we where getting, and then we experimented, to see if there were lower cost ways to reach the same end goal. Be it different shaped tables that we multi inserted into, or finding queries that had long execution times, or pruned lots of rows (which might lead to the first point).. if you are want to save dollars you have to whach/care how you are spending them, and make trade-offs.
Part #1
Existence of multiple Warehouse will not incur any cost, cost will only come when it will be utilized as part of compute. However dropping unused objects will certainly ease the operational effort. Also if user exists and not being used it should fall under your security audit and it is always better to disable a user instead of dropping. Validate all downstream application ETL jobs/BI reports (If any) before dropping any users/roles
Cloud service cost is entirely different ball game , it follows 10% rule. One need to pay this amount when cloud service usage exceeds 10% of the warehouse usage on that day.
Part #2
Snowflake always suggest warehouse should be created based on your activity. Please do not create warehouse to segregate teams/user group. Create user and roles for that.
What we observed
During development keeping only one virtual Warehouse, until real requirement pops up (Project team wise segregation for cost sharing or budgeting or credit assessment) there is no need to have multiple warehouse created.
Even for Prod activity wise segregation is ideal, one for ETL load/BI reporting / Data analytics team
Thanks
Palash Chatterjee

Azure SQL database DTUs maxing out - due to large database?

We have an Azure SQL database. Up until a few weeks ago, we were set at 10 DTUs (S0). Recently, we've gotten more SQL timeout errors, prompting us to increase our DTUs to 50 (S2). We get the errors less frequently, but still on occasion. When we get these timeouts, we see spikes on the Resource graph hitting 100%. Drilling into that, it's generally Data I/O operations that are making it spike. But when we check Query Performance Insight, none of the listed queries show that they're using that much resources.
Another thing to note is that our database has grown steadily in size. It is now about 19 GB, and the majority (18 GB) of that comes from one table that has a lot of long JSON strings in it. The timeout errors generally do happen on a certain query that has several joins, but they do not interact with the table with the long strings.
We tested making a copy of the database and removing all the long strings, and it did not get any timeouts at 10 DTU, but performed the same as the database with all the long strings at 50 DTU as far as load times.
We have rebuilt our indexes and, though it helped, we continue to experience timeout errors.
Given that the query that gets timeouts is not touching the table with long strings, could the table with long strings still be the culprit for DTU usage? Would it have to do with SQL caching? Could the long strings be hogging the cache and causing a lot of data I/O? (They are accessed fairly frequently too.)
The strings can definitely exhaust your cache budget if they are hot (as you say they are). When the hot working set exceeds RAM cache size performance can fall off a cliff (10-100x). That's because IO is 10-1000x slower than RAM access. This means that even a tiny decrease in cache hit ratio (such as 1%) can multiply into a big performance loss.
This cliff can be very steep. One moment the app is fine, the next moment IO is off the charts.
Since Azure SQL Database has strict resource limits (as I hear and read) this can quickly exhaust the performance that you bought bring on throttling.
I think the test you made kind of confirms that the strings are causing the problem. Can you try to segregate the strings somewhere else? If they are cold move them to another table. If they are hot move them to another datastore (database or NoSQL). That way you can likely move back to a much lower tier.

Azure SQL DB - Resources Governance

We use Azure SQL Database, currently with S3 Tier and we have a problem with one of our services, who is pushing data to the database. It's called very often and most of the time our DTU is more than 95%. We already optimize what we could, but basically it's too many DB hits. We are working on other optimization, caching, etc..
The problem is, that this DB is used by our other application and because of DTU is consumed by other service, we have performance issues.
I was thinking, if there is a way, when we can somehow set up max limit for one SQL User for DTU? e.g. 30%.
I was trying to google it, but couldn't find anything related to this topic.
Thanks a lot for the answers or suggestions
There is currently no way in SQL Database to limit the resources on a per Query / client basis. Which resource dimension are you maxing out (CPU, reads, writes)? If you cannot further optimize you might have to bite the bullet and scale up. If you are IO bound, than switching to P1 will help you. If you are CPU bound you might have to go up to P2.
The % is always based on the S2 tier db. If you are at 95 it means you are at 95% of a S2 tier db on your DTU usage. In this case you are close to the 100%, so you probably soon need a tier larger as the S2. You are using the S3, so you have the right tier.
azure-sql-database-introduces-new-near-real-time-performance-metrics
For example, if your DTU consumption shows a value of 80%, it
indicates it is consuming DTU at the rate of 80% of the limit an S2
database would have. If you see values greater than 100% in this view
it means that you need a performance tier larger than S2.
As an example, let’s say you see a percentage value of 300%. This
tells you that you are using three times more resources than would be
available in an S2. To determine a reasonable starting size, compare
the DTUs available in an S2 (50 DTUs) with the next higher sizes (P1 =
100 DTUs, or 200% of S2, P2 = 200 DTUs or 400% of S2). Because you
are at 300% of S2 you would want to start with a P2 and re-test.
Based on the DTU usage percent you can determine if your database can
fit within S2 performance level (OR a lower/higher level as indicated
through DTU percentage and relative DTU powers of various performance
tiers as documented in MSDN site).
When you have locking problems, you need to find the queries that lock the db and rewrite them. Scaling to a larger db tier will only help a little, and giving the application that causes the problems less db performance will only extend the lock times.

Estimating IOPS requirements of a production SQL Server system

We're working on an application that's going to serve thousands of users daily (90% of them will be active during the working hours, using the system constantly during their workday). The main purpose of the system is to query multiple databases and combine the information from the databases into a single response to the user. Depending on the user input, our query load could be around 500 queries per second for a system with 1000 users. 80% of those queries are read queries.
Now, I did some profiling using the SQL Server Profiler tool and I get on average ~300 logical reads for the read queries (I did not bother with the write queries yet). That would amount to 150k logical reads per second for 1k users. Full production system is expected to have ~10k users.
How do I estimate actual read requirement on the storage for those databases? I am pretty sure that actual physical reads will amount to much less than that, but how do I estimate that? Of course, I can't do an actual run in the production environment as the production environment is not there yet, and I need to tell the hardware guys how much IOPS we're going to need for the system so that they know what to buy.
I tried the HP sizing tool suggested in the previous answers, but it only suggests HP products, without actual performance estimates. Any insight is appreciated.
EDIT: Main read-only dataset (where most of the queries will go) is a couple of gigs (order of magnitude 4gigs) on the disk. This will probably significantly affect the logical vs physical reads. Any insight how to get this ratio?
Disk I/O demand varies tremendously based on many factors, including:
How much data is already in RAM
Structure of your schema (indexes, row width, data types, triggers, etc)
Nature of your queries (joins, multiple single-row vs. row range, etc)
Data access methodology (ORM vs. set-oriented, single command vs. batching)
Ratio of reads vs. writes
Disk (database, table, index) fragmentation status
Use of SSDs vs. rotating media
For those reasons, the best way to estimate production disk load is usually by building a small prototype and benchmarking it. Use a copy of production data if you can; otherwise, use a data generation tool to build a similarly sized DB.
With the sample data in place, build a simple benchmark app that produces a mix of the types of queries you're expecting. Scale memory size if you need to.
Measure the results with Windows performance counters. The most useful stats are for the Physical Disk: time per transfer, transfers per second, queue depth, etc.
You can then apply some heuristics (also known as "experience") to those results and extrapolate them to a first-cut estimate for production I/O requirements.
If you absolutely can't build a prototype, then it's possible to make some educated guesses based on initial measurements, but it still takes work. For starters, turn on statistics:
SET STATISTICS IO ON
Before you run a test query, clear the RAM cache:
CHECKPOINT
DBCC DROPCLEANBUFFERS
Then, run your query, and look at physical reads + read-ahead reads to see the physical disk I/O demand. Repeat in some mix without clearing the RAM cache first to get an idea of how much caching will help.
Having said that, I would recommend against using IOPS alone as a target. I realize that SAN vendors and IT managers seem to love IOPS, but they are a very misleading measure of disk subsystem performance. As an example, there can be a 40:1 difference in deliverable IOPS when you switch from sequential I/O to random.
You certainly cannot derive your estimates from logical reads. This counter really is not that helpful because it is often unclear how much of it is physical and also the CPU cost of each of these accesses is unknown. I do not look at this number at all.
You need to gather virtual file stats which will show you the physical IO. For example: http://sqlserverio.com/2011/02/08/gather-virtual-file-statistics-using-t-sql-tsql2sday-15/
Google for "virtual file stats sql server".
Please note that you can only extrapolate IOs from the user count if you assume that cache hit ratio of the buffer pool will stay the same. Estimating this is much harder. Basically you need to estimate the working set of pages you will have under full load.
If you can ensure that your buffer pool can always take all hot data you can basically live without any reads. Then you only have to scale writes (for example with an SSD drive).

Resources