Is there some way to monitor the CPU usage of a MS-SQL process and if it rises above a certain treshold, log the queries which get executed while the CPU usage is above the treshold?
Basically, the problem that I am having is that one of my databases becomes really slow regularly - several times a day. During the periods when the database is slow, the CPU usage of the SQL process is around the 90%- 100% and all the queries are timing out.
I am currently looking into ways to develop a small application to do that monitoring for me using .NET, but I thought that there might already be something existing for that.
Try sqlmonitor from redgate.
Its quite good at answering questions like "Usually everything works nice, but at a certain point in time something goes wrong".
Related
I have been using Clickhouse at work for analytics purposes for a while now.
I am currently running Clickhouse v22.6.3 revision 54455 on-premise on a VM with:
fast storage
200Gb of RAM
no swap
a 40-cores CPU.
I have a few Tb of data, but no table bigger than 300 Gb. I do not use distributed tables or replication yet, and I write frequently into Clickhouse (but I don't use deletes or updates and prefer using things like the ReplacingMergeTree engine). I also leverage the MaterializedView feature for a few tables. Let me know if you need any more context or parameter, I use a pretty standard configuration.
Now, for a few months I have been experiencing performances issues where the server significantly slows down every day at 10am, and I cannot figure out why.
Based on Clickhouse built-in Graphite monitoring, the "symptoms" of the issue seem to be as follow:
At 10am:
On the server side:
Both load and RAM usage remain reasonable. Load goes up a little.
Disk write await time goes up (which I suspect is what leads to higher load)
Disk utilization % skyrockets to something between 90 and 100%
On Clickhouse side:
DiskSpaceReservedForMerge stays roughly the same (ie between 0 and 70Gb)
both OpenFileForRead and OpenFileForWrite go up by a factor of ~2
BackgroundCommonPoolTask goes slightly up, so does BackgroundSchedulePoolTask (which I found weird, because I thought this pool was dedicated to distributed operations - which I don't use) - both numbers remain seemingly reasonable
The number of active Merge tasks per minutes drop significantly but I'm unsure whether it's a consequence of slow writing or if it's causing it
both insert and general querying time are multiplied by ~10 which renders the database effectively unusable even for small tasks
Restarting Clickhouse usually fixes the problem but I obviously do not want to restart my main database every day at 10am. Most of the heavy load I put on the DB (such as data extraction and transformation, etc) happens earlier in the morning (and end around 7-8am) and runs fine. I do not have any heavy tasks running at 10am. The Clickhouse VM takes most of its host resources and I have confirmed with the devOps team that there doesn't seem to be a problem on the host or anything else scheduled on it at that time.
Is there any kind of background tasks or process that is run by Clickhouse on a daily basis and that could have a high impact on our disk capacity? What else can I monitor to figure out what is causing this problem?
Again, let me know if I can be more thorough on our settings and the state of the DB when the "bug" occurs.
Do you use https://github.com/innogames/graphite-ch-optimizer ?
Do you use TTL ?
select * from system.merges;
select * from system.part_log where event_time between ~10am~
I have spent a lot of time optimizing a query for our DB Admin. When I run it in our test server and our live server I get similar results. However, once it is actually running in production with the query being utilized a lot it runs very poorly.
I could post code but the query is over 1400 lines and it would take a lot of time to obfuscate. All data types match and I am using indices on my queries. It is broken down into 58 temp tables. When I test it using SQL Sentry I get it using 707 CPU cycles and 90,0007 reads and a time of 1.2 seconds to run a particular exec statement. The same parameters in production last week used 10,831 CPU cycles, 2.9 million reads, and took 33.9 seconds to run.
My question is what could be making the optimizer run more cycles and reads in production than my one off tests? Like I mentioned, I could post code if needed, but I am looking for answers that point me in a direction to troubleshoot such a discrepancy. This particular procedure is run a lot during our Billing cycle so it is hitting the server hundreds of times a day, it will become thousands as we near the 15th of the month.
Thanks
ADDENDUM:
I know how to optimized queries as this is a regular part of my job duties. As I stated in a comment, my test queries don't have large differences like this usually from actual production. I don't know the SQL Server side of things and wondered if there was something I needed to be aware of that might affect my query when the server is under a heavier load. This may be outside the scope of this forum, but I thought I would reach out for ideas from the community.
UPDATE:
I am still troubleshooting this, just replying to some of the comments.
The execution plans are the same between my one off tests and the production level executions. I am testing in the same environment, on the same server as production. This is a procedure for report data. The data returned, the records, and tables hit, are all the same. I am testing using the same parameters that during production took astronomical amounts of time to process, the difference between what I am doing and what happened during the production run is the load on the server. Not all of production executions are taking a long time, the vast majority are within the thresholds of acceptable CPU and reads, when the outliers have such a large discrepancy it is 500 times the CPU and 150 times the reads of the average execution (even with the same parameters).
I have zero control over the server side of things. I only can control how to code the proc. I realize that my proc is large and without it, it is probably impossible to get a good answer on this forum. I also realize that even with the code posted here I would not likely get a good answer due to the size of the procedure.
I was/am looking only for insights, directions of things to look at, using anecdotal evidence of issues other developers have overcome when dealing with similar problems. Comments that state the size of my code is the reason why performance is in the toilet, and that code that size is rarely needed, are not helpful and quite frankly ignorant. I am working with legacy c# code and a database that deals with millions of transactions for billing. There are thousands of tables in dozens of interconnected databases with an ERD that would blow your mind, I am optimizing a nightmare. That said, I am very good at it, and when myself and the database administrators are stumped as to why we see such stark numbers I thought I would widen my net and see if this larger community had any ideas.
Below is an image showing a report of the top 32 executions for this procedure in a 15 min window. Even among the top 32 the numbers are not consistent. The image below that shows all of the temp tables and main query that I just ran on the same server for #1 resource hog of the first image. The rows are different temp tables with a sum at the bottom. The sum shows 1.5 (1.492) seconds to run with 534 CPU and 92,045 reads. Contrast that with the 33.9 seconds, 10,831 CPU, and 2.9 million reads yesterday:
I'm facing this situation, and need some advise as to how approach this.
From time to time, usually out of business hours, but it's most likely random, the activity on a database we host goes out of scale.
This means that the disk queue grows from accepted levels (below 2) to crazy and sustained values, for example the last incident queue went to an average of 450 for 30 minutes.
Cases are not always this exaggerate, and maybe this makes it easier to spot than the more subtle cases.
When this happens, it causes services depending on the DB to go down / error out / timeout, etc.
I can see it and record it on perfmon, I know it's read-queue (at least last incident), I know it's focused on a high activity SQL SERVER DB stored exclusively on that disk, but I can't pin-point the exact cause.
I have tried collecting data with profiler, but this tends to happen:
- I monitor query durations (3 secs or more), looking for rogue queries with bad execution plans.
- If lucky enough to witness one of these incidents, I will see mostly any query being captured, which really means that the server is slowed down, and even good queries are taking long because something else is going on.
I believe query duration is not what I have to chase down, so maybe you could help me figure out how else to approach this ? It's a process doing something, a query, a maintenance task, a backup, anything, that is causing massive IO on the disk. Although I don't really believe it's a query alone, it looks too aggressive as to be that.
Sql Server logs do not show errors, neither point to something that could help troubleshooting this. Maintenance activities, some of them have logs, other don't generate logs, which is yielding to equivocal interpretations.
So, what are the events that I should hook onto profiler to spot the cause of these incidents ?
What else can I do, without turning to a paid tool, to pin-point the activity that sql server starts doing, at random times, that cause disk queue to build up, ultimately going out of scale and causing service downages.
I turn to SO as last resource, believe me, I have been studying this and looking for ideas, and tried more than a few things, but I just have to admit I failed and look for your advise.
Thank you !
Sorry for the long introduction but before I can ask my question, I think giving the background would help understanding our problem much better.
We are using sql server 2008 for our web services as the backend and from time to time it takes too much time for responding back for the requests that supposed to run really fast, like taking more than 20 seconds for a select request that queries a table that has only 22 rows. We went through many potential areas that could cause the issue from indexes to stored procedures, triggers etc, and tried to optimize whatever we can like removing indexes that are not read but write frequently or adding NOLOCK for our select queries to reduce the locking of the tables (we are OK with dirty reads).
We also had our DBA's reviewed the server and benchmarked the components to see any bottlenecks in CPU, memory or disk subsystem, and found out that hardware-wise we are OK as well. And since the pikes are occurring occasionally, it is really hard to reproduce the error on production or development because most of the time when we rerun the same query it yields response times that we are expecting, which are short, not the one that has been experienced earlier.
Having said that, I almost have been suspicious about I/O although it is not seem to be a bottleneck. But I think I was just be able to reproduce the error after running an index fragmentation report for a specific table on the server, which immediately caused pikes in requests not only run against that table but also in other requests that query other tables. And since the DB, and the server, is shared with other applications we use and also from time to time queries can be run on the server and database that take long time is a common scenario for us, my suspicion regarding occasional I/O bottleneck is, I believe, becoming a fact.
Therefore I want to find out a way that would prioritize requests that are coming from web services which will be processed even if there are other resource sensitive queries being run. I have been looking for some kind of prioritization I described above since very beginning of the resolution process and found out that SQL Server 2008 has a feature called 'Resource Governor' that allows prioritization of the requests.
However, since I am not an expert on Resource Governor nor a DBA, I would like to ask other people's experience who may have used or is using Resource Governor, as well as whether I can prioritize I/O for a specific login or a specific stored procedure (For example, if one I/O intensive process is being run at the time we receive a web service request, can SQL server stops, or slows down, I/O activity for that process and give a priority to the request we just received?).
Thank you for anyone that spends time on reading or helping out in advance.
Some Hardware Details:
CPU: 2x Quad Core AMD Opteron 8354
Memory: 64GB
Disk Subsystem: Compaq EVA8100 series (I am not sure but it should be RAID 0+1 accross 8 HP HSV210 SCSI drives)
PS:And I can almost 100 percent sure that application servers are not causing the error and there is no bottleneck we can identify there.
Update 1:
I'll try to answer as much as I can for the following questions that gbn asked below. Please let me know if you are looking something else.
1) What kind of index and statistics maintenance do you have please?
We have a weekly running job that defrags indexes every Friday. In addition to that, Auto Create Statistics and Auto Update Statistics are enabled. And the spikes are occurring in other times than the fragmentation job as well.
2) What kind of write data volumes do you have?
Hard to answer.In addition to our web services, there is a front end application that accesses the same database and periodically resource intensive queries needs to be run to my knowledge, however, I don't know how to get, let's say weekly or daily, write amount to DB.
3) Have you profiled Recompilation and statistics update events?
Sorry for not be able to figure out this one. I didn't understand what you are asking about by this question. Can you provide more information for this question, if possible?
first thought is that statistics are being updated because of the data change threshold is reached causing execution plans to be rebuilt.
What kind of index and statistics maintenance do you have please? Note: index maintenance updates index stats, not column stats: you may need separate stats updates.
What kind of write data volumes do you have?
Have you profiled Recompilation and statistics update events?
In response to question 3) of your Update to the original question, take a look at the following reference on SQL Server Pedia. It provides an explanation of what query recompiles are and also goes on to explain how you can monitor for these events. What I believe gbn is asking (feel free to correct me sir :-) ) is are you seeing recompile events prior to the slow execution of the troublesome query. You can look for this occurring by using the SQL Server Profiler.
Reasons for Recompiling a Query Execution Plan
Further to my previous question about the Optimal RAID setup for SQL server, could anyone suggest a quick and dirty way of benchmarking the database performance on the new and old servers to compare them? Obviously, the proper way would be to monitor our actual usage and set up all sorts of performance counters and capture the queries, etc., but we are just not at that level of sophistication yet and this isn't something we'll be able to do in a hurry. So in the meanwhile, I'm after something that would be a bit less accurate, but quick to do and still better than nothing. Just as long as it's not misleading, which would be worse than nothing. It should be SQL Server specific, not just a "synthetic" benchmark. It would be even better if we could use our actual database for this.
Measure the performance of your application itself with the new and old servers. It's not necessarily easy:
Set up a performance test environment with your application on (depending on your architecture this may consist of several machines, some of which may be able to be VMs, but some of which may not be)
Create "driver" program(s) which give the application simulated work to do
Run batches of work under the same conditions - remember to reboot the database server between runs to nullify effects of caching (Otherwise your 2nd and subsequent runs will probably be amazingly fast)
Ensure that the performance test environment has enough hardware machines in to be able to load the database heavily - this may mean swapping out some VMs for real hardware.
Remember to use production-grade hardware in your performance test environment - even if it is expensive.
Our database performance test cluster contains six hardware machines, several of which are production-grade, one of which contains an expensive storage array. We also have about a dozen VMs on a 7th simulating other parts of the service.
you can always insert, read, and delete a couple of million rows - it's not a realistic mix of operations but it should strain the disks nicely...
Find at least a couple of the queries that are taking some time, or at least that you suspect are taking time, insert a lot of data if you don't have it already, and run the queries having set:
SET STATISTICS IO ON
SET STATISTICS TIME ON
SET STATISTICS PROFILE ON
Those should give you a rough idea of the resources being consumed.
You can also run SQL Server Profiler to get a general idea of what queries are taking a long time and how long they are taking plus other statistics. It outputs a lot of data so try to filter it down a little bit, possibly by long duration or one of the other performance statistics.