I am trying to find a solution for a problem that is driving me mad...
I have a query which runs very fast in a QA Server but it is very slow in production. I realized that they have different execution plans... so I have try recompiling, cleanning the cache for the execution plans, update statistics, check the type of collation... but I still can't find what's going on...
The databases where the query is running are exactly the same and the SQL Servers have also the same configuration.
Any new ideas would be much appreciated.
Thanks,
A.
I just realised the the QA server is running SP3 and in production is SP2. Could this have any impact on this issue?
Is it possible the production server has a larger database size? The plan can be different because it is based on statistics on the data it contains.
I think it could be due to the volume of data present. It happened to us one time where the query literally flew in QA server but was incredibly slow in the production. After breaking our heads for a while we found out that QA server had 15K rows where as production had 1.5 million.
HTH
If the execution plan was the same and one was slow, it would be database load, hardware, locking/blocking, etc.
However, if the execution plans are different something is different between the two databases. Are statistics up to date in both, have the exact same schemas, same indexes, similar number of rows, same distribution of PK and index values, etc. Where did the QA data come from, random data or is it a restore from production?
Disable parallel query execution on production :)
I ran into this recently and here's what I found.
I had two databases that were essentially copies of each other. On one version a TVF was taking 1 second to run, while on the other version took 15 minutes to run.
The execution plans of the underlying SQL code were very different. I was able to fix it by rebuilding some indexes that the TVF relied on. The execution plans aren't the same, but it did change a lot. And the execution time is back down to around a second.
Now, both versions had indexes that were highly fragmented. My assumption is that historical statistic or execution plan information allowed the fast version to continue to find an optimal execution plan.
So to sum up: make sure you look at the fragmentation of your indexes even if they have the same structure or similar rates of fragmentation.
Related
Over the years our sql sever database has accumulated a lot of data which was causing the queries to run slow which further resulted in the application to slow down significantly.
We eventually decided to archive certain data by storing it in a different data store and deleting from SQL Server. Note that the data is spread over 22 tables (meta-data). After deleting about 40% of the data we saw that certain queries were running significantly slower. Even though there are queries which have slight performance improvement, we are observing that the response times and transactions per second have come down significantly when we ran a load test.
As part of cleanup on the database side, we Reorganized and Rebuilt the indexes. We made sure that the fragmentation after the deletes was with in permissible limits (ideally < 30% but it was way less than that). After this we ran update statistics with full scan.
We identified a particular query which runs significantly slower after the deletes and saw that it had a different query plan as compared to before delete (we created a new database with the pre-delete data set to compare the differences) even though the table definitions (indexes, constraints etc) are the same for all tables in the query.
We ran the load tests again after these steps but still saw the same performance as before. I am not a SQL Server expert but am working with closely with the DBAs to understand the underlying issue for the slow down.
The DBAs are suggesting to optimize the queries but I am not convinced if that would actually help, the queries are performing better with a larger dataset which means deleting the data caused some change on the tables which is resulting in the slowdown.
I would really appreciate any pointers or guidance on addressing this issue.
Thanks,
Karthik
I am using SQL Server 2008 R2.
The process is actually like this:
First, about 2 million records are pulled from a remote server,
then a join is done locally,
the final result is thousands of records.
The time cost varies from less one 1 min to 30 mins.
And after I experienced the 30 mins delay, it seems the following time costs are all only around 3 mins.
It is the same data, same SP.
What could cause this drastic difference?
Update
I delet the SP, re-start the SQL server service, and re-creat the SP. The execution took only 50 seconds!
What's wrong?
The behaviour you describe seems extreme - but (if you exclude the client), there are 3 logical places to look.
The first is the query execution on the database server. It's worth using the Query Analyzer tool to see if it's using any indices - by far the most common reason for variable performance of database queries is that the query is not using (the right) indices, and that therefore the impact of the query cache plays a big part. SQL Server will cache a lot of data, and the first run of your proc populates that cache; the second run is faster because it hits the cache. After a while, the cache goes stale, and running the proc slows down again.
The second possibility is that the database server is wobbly - it may just not be powerful enough to do all the work it's supposed to do. In that case, one moment you get lucky, have all the server resources to yourself; the next, someone else is running a query and yours slows down. That would make all queries slow, not just this one - so it doesn't sound likely.
Third possibility is networking weirdness - as Phil says, "thousands of records" is nothing too scary, but if they're big, and your network is saturated with pictures of kittens, it might have an impact. Again, that would manifest in general network slowness, and is unlikely to explain a delay of 30 minutes...
Fourth, is anything going on at the same time?
Fifth, does your SP use dynamically generated SQL statements? This would cause the SP not to become pre-compiled. If possible seperate such statements into child SPs.
I'm running into a performance issue with the current schema. So I built an equivalent schema to solve the issue.
I ran some tests on both schemas and the results are hard to understand. For the record, the data is the same.
I get the following from the Profiler when executing equivalent requests on the two schemas.
Old schema:
1,300,000 reads
5,000 CPU
4 seconds execution time
New schema:
30,000 reads
3,000 CPU
6 seconds execution time
The difference seems to be in the query plan used. The old schema has parallelism in the query plan. The new schema isn't using parallelism.
Has anyone faced similar situations (less IO/CPU but more execution time). How did you solve it?
Is there a way to force parallelism? I've played with query hints(http://msdn.microsoft.com/en-us/library/ms18171). I'm able to stop parallelism on the old schema but can't seem the query on the new schema to use parallelism.
Thanks in advance.
Louis,
Currently there is no way to force parallelism in SQL Server straight out of the box but Adam Machanic did some work to do that though.
http://whoisactive.com
Coming to your first question, yes we have seen cases like that too. Note that Parallelism is cpu bound and that's why you are seeing more cpu time but overall less execution time as you have multiple threads doing the work for you.
http://www.simple-talk.com/sql/learn-sql-server/understanding-and-using-parallelism-in-sql-server/
Make sure you have proper indexes in place and also stats are updated with full scan. In the long run it is best if Query Optimizer makes the decisions by itself but if you want to overwrite the QO plans then you may have to add lot more details. Schema, data and repro.
HTH
Here's one I need help from the SQL administrators out there. I have two separate SQL Server instances on Amazon EC2. One is our staging environment, and the other is our production environment, but they are configured exactly the same way (spawned from the same image).
We had a database that we copied from staging to our production environment last week. The way we copy a db to production is we take a backup of it on our staging site, and restore the backup in production. Anyways, we found that in production, one particular complex query was timing out after an hour, but that exact query in our staging environment completed in 10 minutes.
The explain plan on both were almost the same, except in one server it was doing a PK scan on a large table (8M rows), and on the other table it was doing an index seek. We're assuming this was the difference. So one server was doing a lot of disk IO, and the other was not.
So my question is, what are the reasons that one installation of SQL server would decide to use an index, while another one ignores it--assuming same versions of SQL server, and same data set? Even better, what are the best ways to find out why SQL is ignoring an index?
SQL Server uses statistics to determine the query execution plan.
Normally, they should be the same on the same datasets, but there is a chance of outdated statistics on one of the machines.
Use sp_updatestats to update statistics on both machines.
Also, I'm not familiar with Amazon EC2, but there may be a chance that the machines running the two instances have different number of CPU installed (or made available for use by SQL Server). This is also taken into account by the optimizer.
Parameter Sniffing?
An SP will use the query plan that was deemed most appropriate based on the parameters passed to it when it was executed (and so compiled) for the first time.
Restoring a database wipes the plan cache; if the SP on the copy of the database was run with parameters that favored an index seek, then that's what will subsequently be used.
You can check this by sp_recompile'ing both and running them again with identical parameters.
This was our mistake.
After much digging investigation, we found that one of our devs had added a couple additional indexes to the production db after the transfer. This was a case where the additional indexes actually caused the query optimizer to pick a less efficient route in the production environment.
Removing those additional indexes appeared to have addressed the performance issue for the particular query, and both explain plans are now the same.
We have a site in development that when we deployed it to the client's production server, we started getting query timeouts after a couple of hours.
This was with a single user testing it and on our server (which is identical in terms of Sql Server version number - 2005 SP3) we have never had the same problem.
One of our senior developers had come across similar behaviour in a previous job and he ran a query to manually update the statistics and the problem magically went away - the query returned in a few miliseconds.
A couple of hours later, the same problem occurred.So we again manually updated the statistics and again, the problem went away. We've checked the database properties and sure enough, auto update statistics isTRUE.
As a temporary measure, we've set a task to update stats periodically, but clearly, this isn't a good solution.
The developer who experienced this problem before is certain it's an environment problem - when it occurred for him previously, it went away of its own accord after a few days.
We have examined the SQL server installation on their db server and it's not what I would regard as normal. Although they have SQL 2005 installed (and not 2008) there's an empty "100" folder in installation directory. There is also MSQL.1, MSQL.2, MSQL.3 and MSQL.4 (which is where the executables and data are actually stored).
If anybody has any ideas we'd be very grateful - I'm of the opinion that rather than the statistics failing to update, they are somehow becoming corrupt.
Many thanks
Tony
Disagreeing with Remus...
Parameter sniffing allows SQL Server to guess the optimal plan for a wide range of input values. Some times, it's wrong and the plan is bad because of an atypical value or a poorly chosen default.
I used to be able to demonstrate this on demand by changing a default between 0 and NULL: plan and performance changed dramatically.
A statistics update will invalidate the plan. The query will thus be compiled and cached when next used
The workarounds are one of these follows:
parameter masking
use OPTIMISE FOR UNKNOWN hint
duplicate "default"
See these SO questions
Why does the SqlServer optimizer get so confused with parameters?
At some point in your career with SQL Server does parameter sniffing just jump out and attack?
SQL poor stored procedure execution plan performance - parameter sniffing
Known issue?: SQL Server 2005 stored procedure fails to complete with a parameter
...and Google search on SO
Now, Remus works for the SQL Server development team. However, this phenomenon is well documented by Microsoft on their own website so blaming developers is unfair
How Data Access Code Affects Database Performance (MSDN mag)
Suboptimal index usage within stored procedure (MS Connect)
Batch Compilation, Recompilation, and Plan Caching Issues in SQL Server 2005 (an excellent white paper)
Is not that the statistics are outdated. What happens when you update statistics all plans get invalidated and some bad cached plan gets evicted. Things run smooth until a bad plan gets again cached and causes slow execution.
The real question is why do you get bad plans to start with? We can get into lengthy technical and philosophical arguments whether a query processor shoudl create a bad plan to start with, but the thing is that, when applications are written in a certain way, bad plans can happen. The typical example is having a where clause like (#somevaribale is null) or (somefield= #somevariable). Ultimately 99% of the bad plans can be traced to developers writing queries that have C style procedural expectation instead of sound, set based, relational processing.
What you need to do now is to identify the bad queries. Is really easy, just check sys.dm_exec_query_stats, the bad queries will stand out in terms of total_elapsed_time and total_logical_reads. Once you identified the bad plan, you can take corrective measures which depend from query to query.