How do i find and fix the true cause of page life expectancy issues? - sql-server

We have been having performance problems on our sql server the last month, and we have a hard time getting it fixed.
This was the page life expectancy of one full week, one year ago:
You see that we have frequent dips in the morning, but thats when our heavy batch processes kick in. During the day everything is peachy.
The same graph, but the current situation:
Big difference, lots of user complaints.
A detail of the day:
We have 64Gb of ram in the server, which should be plenty. Normally when we have performance issues, we start looking at the queries that cause the most total wait time (we have an analysis tool or that), we try to improve/remove/cache these queries and usually this simply works.
In this case, i have been following the same approach, but it does not seem to affect the PLE counters. How can i correctly identity the cause of these problems so i can fix what needs to be fixed?

Related

Optimized SQL Stored Proc is taking a long time to run in production

I have spent a lot of time optimizing a query for our DB Admin. When I run it in our test server and our live server I get similar results. However, once it is actually running in production with the query being utilized a lot it runs very poorly.
I could post code but the query is over 1400 lines and it would take a lot of time to obfuscate. All data types match and I am using indices on my queries. It is broken down into 58 temp tables. When I test it using SQL Sentry I get it using 707 CPU cycles and 90,0007 reads and a time of 1.2 seconds to run a particular exec statement. The same parameters in production last week used 10,831 CPU cycles, 2.9 million reads, and took 33.9 seconds to run.
My question is what could be making the optimizer run more cycles and reads in production than my one off tests? Like I mentioned, I could post code if needed, but I am looking for answers that point me in a direction to troubleshoot such a discrepancy. This particular procedure is run a lot during our Billing cycle so it is hitting the server hundreds of times a day, it will become thousands as we near the 15th of the month.
Thanks
ADDENDUM:
I know how to optimized queries as this is a regular part of my job duties. As I stated in a comment, my test queries don't have large differences like this usually from actual production. I don't know the SQL Server side of things and wondered if there was something I needed to be aware of that might affect my query when the server is under a heavier load. This may be outside the scope of this forum, but I thought I would reach out for ideas from the community.
UPDATE:
I am still troubleshooting this, just replying to some of the comments.
The execution plans are the same between my one off tests and the production level executions. I am testing in the same environment, on the same server as production. This is a procedure for report data. The data returned, the records, and tables hit, are all the same. I am testing using the same parameters that during production took astronomical amounts of time to process, the difference between what I am doing and what happened during the production run is the load on the server. Not all of production executions are taking a long time, the vast majority are within the thresholds of acceptable CPU and reads, when the outliers have such a large discrepancy it is 500 times the CPU and 150 times the reads of the average execution (even with the same parameters).
I have zero control over the server side of things. I only can control how to code the proc. I realize that my proc is large and without it, it is probably impossible to get a good answer on this forum. I also realize that even with the code posted here I would not likely get a good answer due to the size of the procedure.
I was/am looking only for insights, directions of things to look at, using anecdotal evidence of issues other developers have overcome when dealing with similar problems. Comments that state the size of my code is the reason why performance is in the toilet, and that code that size is rarely needed, are not helpful and quite frankly ignorant. I am working with legacy c# code and a database that deals with millions of transactions for billing. There are thousands of tables in dozens of interconnected databases with an ERD that would blow your mind, I am optimizing a nightmare. That said, I am very good at it, and when myself and the database administrators are stumped as to why we see such stark numbers I thought I would widen my net and see if this larger community had any ideas.
Below is an image showing a report of the top 32 executions for this procedure in a 15 min window. Even among the top 32 the numbers are not consistent. The image below that shows all of the temp tables and main query that I just ran on the same server for #1 resource hog of the first image. The rows are different temp tables with a sum at the bottom. The sum shows 1.5 (1.492) seconds to run with 534 CPU and 92,045 reads. Contrast that with the 33.9 seconds, 10,831 CPU, and 2.9 million reads yesterday:

Sql Server DB causes huge disk QUEUE and outages of service.

I'm facing this situation, and need some advise as to how approach this.
From time to time, usually out of business hours, but it's most likely random, the activity on a database we host goes out of scale.
This means that the disk queue grows from accepted levels (below 2) to crazy and sustained values, for example the last incident queue went to an average of 450 for 30 minutes.
Cases are not always this exaggerate, and maybe this makes it easier to spot than the more subtle cases.
When this happens, it causes services depending on the DB to go down / error out / timeout, etc.
I can see it and record it on perfmon, I know it's read-queue (at least last incident), I know it's focused on a high activity SQL SERVER DB stored exclusively on that disk, but I can't pin-point the exact cause.
I have tried collecting data with profiler, but this tends to happen:
- I monitor query durations (3 secs or more), looking for rogue queries with bad execution plans.
- If lucky enough to witness one of these incidents, I will see mostly any query being captured, which really means that the server is slowed down, and even good queries are taking long because something else is going on.
I believe query duration is not what I have to chase down, so maybe you could help me figure out how else to approach this ? It's a process doing something, a query, a maintenance task, a backup, anything, that is causing massive IO on the disk. Although I don't really believe it's a query alone, it looks too aggressive as to be that.
Sql Server logs do not show errors, neither point to something that could help troubleshooting this. Maintenance activities, some of them have logs, other don't generate logs, which is yielding to equivocal interpretations.
So, what are the events that I should hook onto profiler to spot the cause of these incidents ?
What else can I do, without turning to a paid tool, to pin-point the activity that sql server starts doing, at random times, that cause disk queue to build up, ultimately going out of scale and causing service downages.
I turn to SO as last resource, believe me, I have been studying this and looking for ideas, and tried more than a few things, but I just have to admit I failed and look for your advise.
Thank you !

Does changing the system time adversely effect SQL Server

I can't find anything on this with Google. My SQL Server is on a VM and for some reason the system clock wanders from the Domain time, up to ~30 seconds. This happens randomly 0 to 3 times per week. I have been hounding my VM admin for months about this and he can't seem to find the cause. He has set the server to check with the domain time every 30 minutes but this does not stop the wandering, it just fixes it faster.
Luckily the system only generates a very few transactions per hour so a 30 second time jump is not likely to cause any of the records to be out of order based on the DATETIME fields.
The VM stuff is out of my hands and this has been going on for months so my question is, can changing the system time cause corruption to the SQL files or some other problem I should be keeping an eye out for?
Timekeeping in virtual machines is quite different from physical machines. Basically, on physical machines, the system clock works by counting processor cycles, but a virtual machine can't do it that way. More info here. So what you are seeing is normal behaviour for a VM, it's one of the fundamentals of virtualisation, and although it's annoying there is nothing you can do about it. We run plenty of SQL servers on VMs and yes, the clock jumps when it syncs, but it's never caused an issue to my knowledge.

Identifying Timeout Causes with SQL Server Profiler

We are experiencing seemingly random timeouts on a two app (one ASP.Net and one WinForms) SQL Server application. I had SQL Profiler run during an hour block to see what might be causing the problem. I then isolated the times when the timeouts were occurring.
There are a large number of Reads but there is no large difference in the reads when the timeout errors occur and when they don't. There are virtually no writes during this period (primarily because everyone is getting time outs and can't write).
Example:
Timeout occurs 11:37. There are an average of 1500 transactions a minute leading up to the timeout, with about 5709219 reads.
That seems high EXCEPT that during a period in between timeouts (over a ten minute span), there are just as many transactions per minute and the reads are just as high. The reads do spike a little before the timeout (jumping up to over 6005708) but during the non-timeout period, they go as high as 8251468. The timeouts are occurring in both applications.
The bigger problem here is that this only started occurring in the past week and the application has been up and running for several years. So yes, the Profiler has given us a lot of data to work with but the current issue is the timeouts.
Is there something else that I should be possibly looking for in the Profiler or should I move to Performance Monitor (or another tool) over on the server?
One possible culprit might be the Database Size. The database is fairly large (>200 GB) but the AutoGrow setting was set to 1MB. Could it be that SQL Server is resizing itself and that transaction doesn't show itself in the profiler?
Many thanks
Thanks to the assistance here, I was able to identify a few bottlenecks but I wanted to outline my process to possibly help anyone going through this.
The #1 problem was found to be a high number of LOCK_MK_S entries found from the SQLDiag and other tools.
Run the Trace Profiler over two different periods of time. Comparing durations for similar methods led me to find that certain UPDATE calls were always taking the same amount of time, over 10 seconds.
Further investigation found that these UPDATE stored procs were updating a table with a trigger that was taking too much time. Since a trigger may lock the table while it completes, it was affecting every other query. ( See the comment section - I was incorrectly stating that the trigger would always lock the table - in our case, the trigger was preventing the lock from being released)
Watch the use of Triggers for doing major updates.

Prioritizing I/O for a specific query request in SQL server

Sorry for the long introduction but before I can ask my question, I think giving the background would help understanding our problem much better.
We are using sql server 2008 for our web services as the backend and from time to time it takes too much time for responding back for the requests that supposed to run really fast, like taking more than 20 seconds for a select request that queries a table that has only 22 rows. We went through many potential areas that could cause the issue from indexes to stored procedures, triggers etc, and tried to optimize whatever we can like removing indexes that are not read but write frequently or adding NOLOCK for our select queries to reduce the locking of the tables (we are OK with dirty reads).
We also had our DBA's reviewed the server and benchmarked the components to see any bottlenecks in CPU, memory or disk subsystem, and found out that hardware-wise we are OK as well. And since the pikes are occurring occasionally, it is really hard to reproduce the error on production or development because most of the time when we rerun the same query it yields response times that we are expecting, which are short, not the one that has been experienced earlier.
Having said that, I almost have been suspicious about I/O although it is not seem to be a bottleneck. But I think I was just be able to reproduce the error after running an index fragmentation report for a specific table on the server, which immediately caused pikes in requests not only run against that table but also in other requests that query other tables. And since the DB, and the server, is shared with other applications we use and also from time to time queries can be run on the server and database that take long time is a common scenario for us, my suspicion regarding occasional I/O bottleneck is, I believe, becoming a fact.
Therefore I want to find out a way that would prioritize requests that are coming from web services which will be processed even if there are other resource sensitive queries being run. I have been looking for some kind of prioritization I described above since very beginning of the resolution process and found out that SQL Server 2008 has a feature called 'Resource Governor' that allows prioritization of the requests.
However, since I am not an expert on Resource Governor nor a DBA, I would like to ask other people's experience who may have used or is using Resource Governor, as well as whether I can prioritize I/O for a specific login or a specific stored procedure (For example, if one I/O intensive process is being run at the time we receive a web service request, can SQL server stops, or slows down, I/O activity for that process and give a priority to the request we just received?).
Thank you for anyone that spends time on reading or helping out in advance.
Some Hardware Details:
CPU: 2x Quad Core AMD Opteron 8354
Memory: 64GB
Disk Subsystem: Compaq EVA8100 series (I am not sure but it should be RAID 0+1 accross 8 HP HSV210 SCSI drives)
PS:And I can almost 100 percent sure that application servers are not causing the error and there is no bottleneck we can identify there.
Update 1:
I'll try to answer as much as I can for the following questions that gbn asked below. Please let me know if you are looking something else.
1) What kind of index and statistics maintenance do you have please?
We have a weekly running job that defrags indexes every Friday. In addition to that, Auto Create Statistics and Auto Update Statistics are enabled. And the spikes are occurring in other times than the fragmentation job as well.
2) What kind of write data volumes do you have?
Hard to answer.In addition to our web services, there is a front end application that accesses the same database and periodically resource intensive queries needs to be run to my knowledge, however, I don't know how to get, let's say weekly or daily, write amount to DB.
3) Have you profiled Recompilation and statistics update events?
Sorry for not be able to figure out this one. I didn't understand what you are asking about by this question. Can you provide more information for this question, if possible?
first thought is that statistics are being updated because of the data change threshold is reached causing execution plans to be rebuilt.
What kind of index and statistics maintenance do you have please? Note: index maintenance updates index stats, not column stats: you may need separate stats updates.
What kind of write data volumes do you have?
Have you profiled Recompilation and statistics update events?
In response to question 3) of your Update to the original question, take a look at the following reference on SQL Server Pedia. It provides an explanation of what query recompiles are and also goes on to explain how you can monitor for these events. What I believe gbn is asking (feel free to correct me sir :-) ) is are you seeing recompile events prior to the slow execution of the troublesome query. You can look for this occurring by using the SQL Server Profiler.
Reasons for Recompiling a Query Execution Plan

Resources