We have an Azure SQL database. Up until a few weeks ago, we were set at 10 DTUs (S0). Recently, we've gotten more SQL timeout errors, prompting us to increase our DTUs to 50 (S2). We get the errors less frequently, but still on occasion. When we get these timeouts, we see spikes on the Resource graph hitting 100%. Drilling into that, it's generally Data I/O operations that are making it spike. But when we check Query Performance Insight, none of the listed queries show that they're using that much resources.
Another thing to note is that our database has grown steadily in size. It is now about 19 GB, and the majority (18 GB) of that comes from one table that has a lot of long JSON strings in it. The timeout errors generally do happen on a certain query that has several joins, but they do not interact with the table with the long strings.
We tested making a copy of the database and removing all the long strings, and it did not get any timeouts at 10 DTU, but performed the same as the database with all the long strings at 50 DTU as far as load times.
We have rebuilt our indexes and, though it helped, we continue to experience timeout errors.
Given that the query that gets timeouts is not touching the table with long strings, could the table with long strings still be the culprit for DTU usage? Would it have to do with SQL caching? Could the long strings be hogging the cache and causing a lot of data I/O? (They are accessed fairly frequently too.)
The strings can definitely exhaust your cache budget if they are hot (as you say they are). When the hot working set exceeds RAM cache size performance can fall off a cliff (10-100x). That's because IO is 10-1000x slower than RAM access. This means that even a tiny decrease in cache hit ratio (such as 1%) can multiply into a big performance loss.
This cliff can be very steep. One moment the app is fine, the next moment IO is off the charts.
Since Azure SQL Database has strict resource limits (as I hear and read) this can quickly exhaust the performance that you bought bring on throttling.
I think the test you made kind of confirms that the strings are causing the problem. Can you try to segregate the strings somewhere else? If they are cold move them to another table. If they are hot move them to another datastore (database or NoSQL). That way you can likely move back to a much lower tier.
Related
I have been using Clickhouse at work for analytics purposes for a while now.
I am currently running Clickhouse v22.6.3 revision 54455 on-premise on a VM with:
fast storage
200Gb of RAM
no swap
a 40-cores CPU.
I have a few Tb of data, but no table bigger than 300 Gb. I do not use distributed tables or replication yet, and I write frequently into Clickhouse (but I don't use deletes or updates and prefer using things like the ReplacingMergeTree engine). I also leverage the MaterializedView feature for a few tables. Let me know if you need any more context or parameter, I use a pretty standard configuration.
Now, for a few months I have been experiencing performances issues where the server significantly slows down every day at 10am, and I cannot figure out why.
Based on Clickhouse built-in Graphite monitoring, the "symptoms" of the issue seem to be as follow:
At 10am:
On the server side:
Both load and RAM usage remain reasonable. Load goes up a little.
Disk write await time goes up (which I suspect is what leads to higher load)
Disk utilization % skyrockets to something between 90 and 100%
On Clickhouse side:
DiskSpaceReservedForMerge stays roughly the same (ie between 0 and 70Gb)
both OpenFileForRead and OpenFileForWrite go up by a factor of ~2
BackgroundCommonPoolTask goes slightly up, so does BackgroundSchedulePoolTask (which I found weird, because I thought this pool was dedicated to distributed operations - which I don't use) - both numbers remain seemingly reasonable
The number of active Merge tasks per minutes drop significantly but I'm unsure whether it's a consequence of slow writing or if it's causing it
both insert and general querying time are multiplied by ~10 which renders the database effectively unusable even for small tasks
Restarting Clickhouse usually fixes the problem but I obviously do not want to restart my main database every day at 10am. Most of the heavy load I put on the DB (such as data extraction and transformation, etc) happens earlier in the morning (and end around 7-8am) and runs fine. I do not have any heavy tasks running at 10am. The Clickhouse VM takes most of its host resources and I have confirmed with the devOps team that there doesn't seem to be a problem on the host or anything else scheduled on it at that time.
Is there any kind of background tasks or process that is run by Clickhouse on a daily basis and that could have a high impact on our disk capacity? What else can I monitor to figure out what is causing this problem?
Again, let me know if I can be more thorough on our settings and the state of the DB when the "bug" occurs.
Do you use https://github.com/innogames/graphite-ch-optimizer ?
Do you use TTL ?
select * from system.merges;
select * from system.part_log where event_time between ~10am~
I have a web service sitting on IIS that has been quite happy for months but now I'm getting timeouts and I don't know how to diagnose what the problem is.
The client sends up basic information in a 'heartbeat' message to IIS which then updates this in a SQL database (on a different server). There are 250 clients in the wild, all sending up their heartbeat every 5 minutes ... so there's only 250 rows in the table, with appropriate indexing on the column being used for the update.
Ordinarily it only takes 50-100ms to do the update, but since last week you can see that the response time in the IIS log has increased and I'm also getting timeouts too.
Nothing has changed with the setup so I don't know what I'm looking for to determine the reason. The error I get back is:
System.ServiceModel.FaultException: An error occurred while updating
the entries. See the inner exception for details.An error occurred
while updating the entries. See the inner exception for
details.Execution Timeout Expired. The timeout period elapsed prior to
completion of the operation or the server is not responding. The
statement has been terminated.The wait operation timed out
Any advice on where to start looking? I did enable the failed request log trace in IIS but I don't know what it all means if I'm perfectly honest. The difference between a successful requiest and a failed one is that the request log stops after the 'AspNetStart' entry.
Thanks!
Mark
There are lots of reasons a service can gradually or suddenly become slow. Poor code structure can lead to things like memory leaks on the server, small enough they don't really show up or cause problems during testing, but when run over weeks/months start to stack up. Unauthorized requests could be targeting your server if this is a public-facing service, or has a link to public-facing services.
Things to look at:
Does this happen at certain times of the day or throughout the day?
Is this a load issue that starts occurring when multiple users are sending updates concurrently? 250 users isn't a lot. Has the # of users grown over the last few months or has it been relatively stable since the start?
What is the memory and CPU usage looking like on the Web server(s) and DB server?
This is the first clue to check to see if either server is under considerable load. From there you can investigate why it might be under load or if it possibly needs a bit more grunt to deal with the load. Look at the running processes. If these servers are managed by an IT department or such some culprits can include things like Virus Scanners hogging resources. (I.e. policy changes in the last few months have lead to additional load on the servers)
What recovery model is your database set up for?
What is the size of your Tx Log (.mdx file)
Do you have a regular scheduled database backup and index maintenance?
This is one that new projects tend to forget. An empty database is small and has no Tx Log history being recorded, but as it runs over time that Tx Log grows silently in the background, especially with Full recovery. Larger Tx Logs can lead to slower performance over time especially if the log file needs to be enlarged. A good thing to check is whether the log file is set to grow by a # of bytes or percentage. Percentage is I believe the default but this can cause exponential "grow" time/space issues so it's better to set it to a fixed size per grow. You'll want regular backups being done that allow the Tx Log to reset. Ideally don't shrink the file if the Log size between backups stays consistent.
How many records across all tables are being inserted or updated in a given day?
This is important to build a picture of how much the database will be tracking through the day between backups. You may have 250 clients, but every heartbeat is potentially updating a row and inserting others.
What are you using for PKs for inserted records? (Ints vs. UUIDs) If using UUIDs are you using NEWSEQUENTIALID() or NEWID()/Guid.New()?
GUIDs can be a time bomb for indexing if done poorly. A GUID combined with NEWID() or Guid.New() will lead to considerable index fragmentation when inserting rows. Provided the GUIDs are not visible to clients you should use NEWSEQUENTIALID(). If IDs are set via code then there are implementations you can find to generate sequential GUIDs. (It's a matter of re-arranging the parts that make up the GUID) Regular index maintenance is a requirement for using UUID columns in indexed fields.
Are you using Dependency Injection in your web service?
What is the lifetime scope of the DbContexts performing the updates?
This is a potential time bomb for web servers if the lifetime scope for a DbContext is set up incorrectly. You want a DbContext to be alive for no longer than it is needed. At a maximum the lifetime scope should be set to PerRequest. A DbContext set up for Singleton for instance would be tracking entities across requests. The more entities a DbContext is tracking, the slower read and update operations become. This would be a possible culprit if the web server memory usage is climbing.
Are you running an SQL Profiler?
In a test environment with nothing else touching the database, running scenarios through the application with an SQL Profiler can reveal potential issues such as unexpected queries being kicked off due to things like lazy loading. For one operation you might expect one or a small number of queries to be run, only to find dozens or even hundreds. Multiply this across concurrent requests and you have a recipe for the database server to say "Just sit down and wait, dammit!" :) Any queries you don't expect based on the code that is running should be investigated for either eager loading relationships or implementing projection. (Recommended for best performance)
Do the web servers get restarted periodically?
For some tricky to debug issues and memory leaks, sometimes the easiest "fix" is to schedule regular restarts of the web server. It's a hack, but compared to the considerable cost of trying to track down memory leaks or fix up inefficient code that slows down over time, it is a cheap and effective fix. (At least while you do research options to address the issues and optimize the code)
That should give you a start into things to check with the service & database.
A bit of background: We have 17 different TempDB database files and 6 TempDB log files on the server. These are spread out on different drives, but are hosted on 2 drive arrays.
I’m seeing Disk IO response times exceeding the recommended limits. Typically you want your disks to respond in 5-10ms, with nothing going over 200ms. We’re seeing random spikes up to 800ms on the TempDB files, but only on one drive array.
Proposed solution: Restart SQL server. While SQL server is shut down, reboot the drive array hosting the majority of the TempDB files. In addition, while SQL is down, redo the network connection to bypass the network switch in an attempt to eliminate any source of slowness on the hardware.
Is this a good idea or a shot in the dark? Any ideas?
Thanks in advance.
17? Who came up with that number? Please read this and this - very few scenarios where > 8 files will help, particularly if you only have 2 underlying arrays/controllers. Some suggestions:
Use an even number of files. Most folks start with 4 or 8, and increase beyond that only when they've proven that they still have contention (and also know that their underlying I/O can actually handle more files and scale with them; in some cases it will have no effect or the exact opposite effect - a different drive letter does not necessarily mean better I/O pathing).
Make sure all of the data files are sized the same, and have identical autogrow settings. Having 17 files with different sizes and autogrowth settings will defeat round robin - in a lot of cases only one file will be used due to the way SQL Server performs proportional fill. And having an odd number just seems... well, odd to me.
Get rid of your 5 extra log files. They are absolutely useless.
Use trace flag 1117 to make sure that all the data files grow at the same time and (because of 2.) at the same rate. Note though that this trace flag applies to all databases, not just tempdb. More info here.
You can also consider trace flag 1118 to change allocation, but please read this first.
Make sure instant file initialization is on, so that the file doesn't have to be zeroed out when it expands.
Pre-size your tempdb files so that they don't have to grow during normal day-to-day activity. Do not shrink tempdb files because they suddenly got big - this is just a rinse and repeat operation, since if they got that big once, they'll get that big again. It's not like you can lease out the recovered space in the meantime.
When possible, perform DBCC CHECKDB elsewhere. If you're running CHECKDB regularly, yay! Pat yourself on the back. However this can take a toll on tempdb - please see this article on optimizing this operation and pulling it away from your production instance where feasible.
Finally, validate what type of contention you're seeing. You say that tempdb performance crawls, but in what way? How are you measuring this? Some info on determining the exact nature of tempdb bottlenecks here and here and here and here and here.
Have you considered making less use of tempdb explicitly (fewer #temp tables, #table variables, and static cursors - or cursors altogether)? Are you making heavy use of RCSI, or MARS, or LOB-type local variables?
When improving database performance, What is the fastest way?
should I go straight to my target (5000 users)?
Or should I go through artificial milestones - (for example - first 1000 users, only then 2500...)?
we have a performance problem. For the purpose of the question, suppose the only bottleneck is the Database and all other resources consume zero time.
Facts:
Database performance for 50 concurrent users is good
Database performance for more then 50 is not acceptable and by that I mean database requests takes too long. Instead of ~0.5 sec they usually take between ~2 and ~7 sec, and sometimes even ~30 sec
On about 70 users, the system stop working at all, most database requests takes more then ~30 sec
Database is implemented in SQL SERVER 2005\8
Requirement:
The application need to support 5000 concurrent users.
From my understanding, and please correct me if I'm wrong or there is something missing, the best practise for database performance improvement is:
Define performance objective targets.
Establish testing environment identical (as much as possible) to production environment
Check if the performance targets is OK. if not do the following until performance targets is OK
3.1 Improve performance (on at a time) by - server re-configuration, indexes improvements ,code improvements, database structure improvements and more, using trace and monitor tools such as - performance counters, SQL Server profiler, logs and more
3.2 Test and analyse the affects of the change and accept or reject it
My question is:
Assuming that getting the system to work on anything else then 5000, say for example 1000, or even 2500 users Doesn't provide any business value, What is the fastest way to get to 5000 users?
Getting there directly?
Or first get to smaller milestones?
for example - first target 500 users, then 1000, then 2500 ....
Something else??
Do I get any technical value from using milestones, that will eventually take me to 5000 users faster?
Tuning up to 500 users, for example, can lead you to solutions that won't be effective when you increase to 1000 users. Example: let's imagine that you can tune cache for 500 users, perhaps you won't have enought RAM to get a good result for 5,000.
Go for your goal.
I don't see that artificial milestones really add any value, if the requirement if 5000 concurrent users, go straight for that, otherwise you run the risk of wasting time making optimisations which don't scale.
I would say that getting small milestone would make you a problem, imagine that you get a small number of users (or a very big one) you would have to find a point where the milestones are not too big nor too small. That can be tricky.
Further to my previous question about the Optimal RAID setup for SQL server, could anyone suggest a quick and dirty way of benchmarking the database performance on the new and old servers to compare them? Obviously, the proper way would be to monitor our actual usage and set up all sorts of performance counters and capture the queries, etc., but we are just not at that level of sophistication yet and this isn't something we'll be able to do in a hurry. So in the meanwhile, I'm after something that would be a bit less accurate, but quick to do and still better than nothing. Just as long as it's not misleading, which would be worse than nothing. It should be SQL Server specific, not just a "synthetic" benchmark. It would be even better if we could use our actual database for this.
Measure the performance of your application itself with the new and old servers. It's not necessarily easy:
Set up a performance test environment with your application on (depending on your architecture this may consist of several machines, some of which may be able to be VMs, but some of which may not be)
Create "driver" program(s) which give the application simulated work to do
Run batches of work under the same conditions - remember to reboot the database server between runs to nullify effects of caching (Otherwise your 2nd and subsequent runs will probably be amazingly fast)
Ensure that the performance test environment has enough hardware machines in to be able to load the database heavily - this may mean swapping out some VMs for real hardware.
Remember to use production-grade hardware in your performance test environment - even if it is expensive.
Our database performance test cluster contains six hardware machines, several of which are production-grade, one of which contains an expensive storage array. We also have about a dozen VMs on a 7th simulating other parts of the service.
you can always insert, read, and delete a couple of million rows - it's not a realistic mix of operations but it should strain the disks nicely...
Find at least a couple of the queries that are taking some time, or at least that you suspect are taking time, insert a lot of data if you don't have it already, and run the queries having set:
SET STATISTICS IO ON
SET STATISTICS TIME ON
SET STATISTICS PROFILE ON
Those should give you a rough idea of the resources being consumed.
You can also run SQL Server Profiler to get a general idea of what queries are taking a long time and how long they are taking plus other statistics. It outputs a lot of data so try to filter it down a little bit, possibly by long duration or one of the other performance statistics.