Debugging SQL Server Slowness: Same Database, Different Servers - sql-server

For a while now we've been having anecdotal slowness on our newly-minted (VMWare-based) SQL Server 2005 database servers. Recently the problem has come to a head and I've started looking for the root cause of the issue.
Here's the weird part: on the stored procedure that I'm using as a performance test case, I get a 30x difference in the execution speed depending on which DB server I run it on. This is using the same database (mdf) and log (ldf) files, detached, copied, and reattached from the slow server to the fast one. This doesn't appear to be a (virtualized) hardware issue: he slow server has 4x the CPU capacity and 2x the memory as the fast one.
As best as I can tell, the problem lies in the environment/configuration of the servers (either operating system or SQL Server installation). However, I've checked a bunch of variables (SQL Server config options, running services, disk fragmentation) and found nothing that has made a difference in testing.
What things should I be looking at? What tools can I use to investigate why this is happening?

Blindly checking variables and settings won't get you very far. You need to approach this methodically.
Are the two procedures executed the same way? Namely, is the plan different? A quick check is to SET STATISTICS IO ON and run the two cases. Is the number of logical-reads the same? Is the number of physical-reads the same? Is the number of writes the same? Differences in logical-reads or writes would indicate a different plan. Differences in physical-reads (while logical-reads is similar) indicate cache and memory problems. If the plans are different, you need to further investigate what is different in the actual execution plan. Does one plan uses a different degree of parallelism? Does one use different join types? Different access paths?
If the plans are similar yet the execution is still different, and you cannot blame the IO subsystem, then you need to check contention. Use SET STATISTICS TIME ON and compare the elapsed time and worker time in the two cases. Similar worker time but different elapsed time indicate that there is more waiting in one case. Use the wait_type and wait_resource info in sys.dm_exec_requests to identify the cause of contention.
The methodology of investigation is discussed in more detail in the Waits and Queues whitepaper.

Run SQL Server Profiler to gather information about running processes within SQL Server. This is probably the best start. This will give you a good idea of the things that are consuming a lot of resources.
If you still have issues after Indexing / Rebuilding Indexes, or rewriting queries, then the next step would be to run PerfMon.

Related

SQL Server CXPACKET timeout

We've got SQL Server 2016 (v13.0.4206.0), by default there is no restrictions for parallelism - any count SQL wants. And it didn't lead any problems... Till now.
For another feature there were written query that unexpectedly raised timeout exception in our application. I was deeply surprised when it was successfully executed with setting up maximum threads per query to 1. Yes, 6 seconds for query is not so good, even accounting to most of time was spent for fetching, but it's far away from 3 minutes timeout!
By the way, executing this query with SQL Server Management Studio works all the time despite of parallelism settings. It seems that something wrong with connection to database, but all other queries works fine, even which much harder then that one.
Our application is built on ASP.NET Core 3.0 (don't know if it matters), database connection is made using System.Data.SqlClient v4.8.0. All I could determine is that there are so much tasks created for this query:
I've tried to watch for execution in sys.dm_os_waiting_tasks (thanks google). I'm not sure I got it right, but it seems that tasks with context_id 0-8 is blocked with those who have context_id 9-16 and vise versa. Obvious example of deadlock, isn't it? But how can SQL Server manage threads to make it without my "help"? Or what am I doing wrong?
Just in case some inappropriate answers:
I won't turn parallelism off (set maximum threads per query to 1) as solution because of some heavy queries in our application;
I don't want to raise Cost Threshold for Parallelism setting because I'm afraid of same problem with another query (guess, a heavier one). So I just want to determine real cause;
Optimizing the query isn't considered (anymore), as according to actual execution plan I can't make it faster - there are enough indexes for it. But I'm ready to rethink after some really weighty arguments.
So, my question is: why does parallelism that I didn't ask for spoil the query execution? And how can I avoid that?
It's true sometimes the engine chooses to use parallel execution (or not to use) which leads to worse performance.
You do not want to control the server option and the cost as you are not sure how this will reflect to other queries, which is understandable.
If you are sure, your query will be execute better without being handle in parallel, you can specify the option just for it using query hints - MAXDOP like this:
SELECT ...
FROM ...
OPTION (MAXDOP 1);
It's easy and you can rollback if needed. Also, you are not affecting other queries.
You are saying that:
Optimizing query isn't considered (anymore), as according to actual execution plan...
The execution plan is sometimes misleading. As a start - you can save your execution plan and open it with SentryOne Plan Explorer - it's free and can give you a better look of what's going on.
Also, if a query is execute for either 3 seconds or 6 minutes, there must be something wrong with it or may be the activity of your database. If it is executed fast in the SSMS always, maybe the engine is using the correct cache plan. I thing it's better to share the query itself and to attach the two plans (serial and parallel) and spend more time tuning it.

SQL server - high buffer IO and network IO

I have a a performance tuning question on SQL server.
I have a program that needs to run every month and it takes more than 24hrs to finish. I need to tune this program in the hope that I can decrease the running time to 12 hrs or less.
As this program isn't developed by us, i can't check the program content and modify it. All i can do is just open the SQL server profiler and activity monitor to trace and analyze the sql content. I have disabled unused triggers and did some housekeeping, but the running time only decreased 1 hr.
I found that the network I/O and buffer I/O are high, but i don't know the cause and meaning of this ?
Can anyone tell me the cause of these two issues (network I/O and butter I/O)? Are there any suggestions for optimizing this program?
Thank you!
. According to your descriptions, I think your I/O is normal, your
question is only one:one procedure is too slowly. the solution:
1.open the SSMS
2.find the procedure
3.click the buttton named "Display estimated execution plan"
4.fix the procedure.
To me it seems like your application reads a lot of data into the application, which would explain the figures. Still, I would check out the following:
Is there blocking? That can easily be a huge waste of time if the process is just waiting for something else to complete. It doesn't look like that based on your statistics, but it's still important to check.
Are the tables indexed properly? Good indexes to match search criteria / joins. If there's huge key lookups, covering indexes might make a big difference. Too many indexes / unnecessary indexes can slow down updates.
You should look into plan cache to see statements responsible for the most I/O or CPU usage
Are the query plans correct for the most costly operations? You might have statistics that are outdated or other optimization issues.
If the application transfers a lot of data to and/or from the database, is the network latency & bandwidth good enough or could it be causing slowness? Is the server where the application is running a bottleneck?
If these don't help, you should probably post a new question with detailed information: The SQL statements that are causing the issues, table & indexing structure of the involved tables with row counts and query plans.

SQL Server randomly 200x slower than normal for simple query

Sometimes queries that normally take almost no time to run at all suddenly start to take as much as 2 seconds to run. (The query is select count(*) from calendars, which returns the number 10). This only happens when running queries through our application, and not when running the query directly against the database server. When we restart our application server software (Tomcat), suddenly performance is back to normal. Normally I would blame the network, but it doesn't make any sense to me that restarting the application server would make it suddenly behave much faster.
My suspicion falls on the connection pool, but I've tried all sorts of different settings and multiple different connection pools and I still have the same result. I'm currently using HikariCP.
Does anyone know what could be causing something like this, or how I might go about diagnosing the problem?
Do you use stored procedures or ad-hoc queries? On reason to get different executions when running a query let's say in management studio vs using stored procedure in you application can be inefficient cached execution plan, which could have been generated like that due to parameter sniffing. You could read more about it here and there are number of solutions you could try (like substituting parameters with local variables). If you restart the whole computer (and SQL Server is also running on it), than this could explain why you get fast queries in the beginning after a restart - because the execution plans are cleaned after reboot.
It turned out we had a rogue process that was grabbing 64 connections to the database at once and using all of them for intense and inefficient work. We were able to diagnose this using jstack. We ran jstack when we noticed the system had slowed down a ton, and it showed us what the application was working on. We saw 64 stack traces all inside the same rogue process, and we had our answer!

SQL Server 2005 Caching

It's my understanding that SQL Server 2005 does some sort of result or index caching. I'm currently profiling complex select statements which take several seconds to several minutes to complete. My problem is that a second run of a query never takes more than a second to run even if I don't alter it. I'm currently using SQL Server Management Studio Express to execute the queries against a SQL Server 2005 server.
My question is, Is there any way to avoid or clear the cache that is causing my queries to execute so quickly on a second run?
There are a couple different things that could at play here, the 3 that come to mind initially (in probable-ish order) would be as follows - if you would like some help interpreting the results, follow the instructions below and paste the stat outputs in the question:
Your query/batch is taking a long time to compile an execution plan. Execution plans are determined and cached (see this post on serverfault for an overview of understanding for how long, when they are rebuilt, etc.)
To verify this, turn on statistics time output, which will provide you information on how long the engine is taking to generate a query plan. For the query/batch in question:
DBCC FREEPROCCACHE
SET STATISTICS TIME ON
Execute the batch, capture the stats output
Execute the batch again, capture the stats output
Compare the 2 stat outputs, paying particular attention to the parse/compile time differences between the 2 executions.
If this is the problem, you can take a couple of approaches to resolving the issue, including specifying a plan guide, specifying a static plan with use plan, or possibly other options like creating a scheduled job to simply compile the plan every few minutes (not as good on option on Sql 2k5 as the others).
Your query/batch is touching a lot of data - on the first execution the data may not be in the buffer pool (basically the cached pages of data the server needs) and the query is performing physical IO operations as opposed to logical IO operations (i.e. reads from disk vs. reads from cache).
To verify this, turn on statistics io output, which will provide you information on the types of IOs and how many of those the engine is performing for the batch. For the query/batch in question:
DBCC DROPCLEANBUFFERS
SET STATISTICS IO ON
Execute the batch, capture the stats output
Execute the batch again, capture the stats output
Compare the 2 stat outputs, paying particular attention to the physical/read-ahead and logical IO outpus between the 2 executions.
To resolve this, you've basically got only 1 option - optimize the query in question so it performs fewer IO operations. You could consider creating a scheduled job that runs the query every so ofter to keep the data in the buffer pool, but this wouldn't be as good an option.
Your query/batch is getting a poor execution plan and/or a poor execution plan choice for different variable values - is this a batch/query that is using a parameterized statement (i.e. you are using variables/static values in the where/join clauses?)? If so, are you seeing the difference in execution times for the same values or different values? If for the same values, the answer is likely #1 or #2 - if for different values, this is potentially your problem. If you think this is the issue after researching #1 and #2, repost with the .sqlplan, the TSQL, and the different parameter values you are using.
I've found the only realiable metrics for performance tuning come from SQL Server's Profiler application. When looking at CPU Time, and Reads in particular, you become much more sheltered from 'other influences'.
For example, the OS being busy, or multiple users being active will reduce your share of CPU and so increase the time to execute. And you may or may not get parallelism over multiple CPUs. But either way, the total CPU Time (as opposed to execution time) will stay approximately the same.
Do this:
CHECKPOINT
DBCC DROPCLEANBUFFERS

Reducing the overhead of a SQL Trace with filters

We have a SQL 2000 server that has widely varied jobs that run at different times of day, or even different days of the month. Normally, we only use the SQL profiler to run traces for very short periods of time for performance troubleshooting, but in this case, that really wouldn't give me a good overall picture of the kinds of queries that are run against the database over the course of a day or week or month.
How can I minimize the performance overhead of a long-running SQL trace? I already know to:
Execute the trace server-side (sp_ create_trace), instead of using the SQL Profiler UI.
Trace to a file, and not to a database table (which would add extra overhead to the DB server).
My question really is about filters. If I add a filter to only log queries that run more than a certain duration or reads, it still has to examine all activity on the server to decide if it needs to log it, right? So even with that filter, is the trace going to create an unacceptable level of overhead for a server that is already on the edge of unacceptable performance?
Adding Filters does minimize the overhead of event collection and also prevents the server from logging transaction entries you don't need.
As for whether the trace is going to create an unacceptable level of overhead, you'll just have to test it out and stop it if there are additional complaints. Taking the hints of the DB Tuning Advisor with that production trace file could improve performance for everyone tomorrow though.
You actually should not have the server process the trace as that can cause problems: "When the server processes the trace, no event are dropped - even if it means sacrificing server performace to capture all the events. Whereas if Profiler is processing the trace, it will skip events if the server gets too busy." (From SQL 70-431 exam book best practices.)
I found an article that actually measures the performance impact of a SQL profiler session vs a server-side trace:
http://sqlblog.com/blogs/linchi_shea/archive/2007/08/01/trace-profiler-test.aspx
This really was my underlying question, how to make sure that I don't bog down my production server during a trace. It appears that if you do it correctly, there is minimal overhead.
It’s actually possible to collect more detailed measurements than you can collect from Profiler – and do it 24x7 across an entire instance -- without incurring any overhead. This avoids the necessity of figuring out ahead of time what you need to filter… which can be tricky.
Full disclosure: I work for one of the vendors who provide such tools… but whether you use ours or someone else’s… this may get you around the core issue here.
More info on our tool here http://bit.ly/aZKerz

Resources