Why would LIKE be faster than =? - sql-server

A co-worker recently ran into a situation where a query to look up security permissions was taking ~15 seconds to run using an = comparison on UserID (which is a UNIQUEIDENTIFIER). Needless to say, the users were less than impressed.
Out of frustration, my co-worker changed the = comparison to use a LIKE and the query sped up to under 1 second.
Without knowing anything about the data schema (I don't have access to the database or execution plans), what could potentially cause this change in performance?
(Broad and vague question, I know)

It may have just been a poor execution plan that had been cached; Changing to the LIKE statement then just caused a new execution plan to be generated. The same speedup may have been noticed if the person had run sp_recompile on the table in question and then re-run the = query.

The other possibility is that this is a complex query and a type conversion is taking place across the = operator for every row. LIKE changes the semantics somewhat so that the type conversion does not have to weigh as heavily in execution planning. I would suggest that your coworker take a look at the execution plan with the = in place and see if there is something like
CONVERT(varchar, variable) = othervariable
in the execution step. In the wrong circumstances, a single typecast can slow a query by two orders of magnitude.

In some cases, LIKE can be faster than an equivalent function like SUBSTRING when an index can be utilized.
Can you give the exact SQL?
Sometimes functions can stop the optimizer from being able to use an index.
Compare the execution plans.

Well, if he ran the two queries one after the other, then it is quite likely that the data had to read from the disk for the first query, but was still in the RDBMS data cache for the second one...
If this is what happened, then if he ran them in the opposite order he would have seen the opposite results... If he used like with an exact value (no wildcards) then the query plan should have been identical..

Have you tried updating the statistics on this table/database? Might be worth a try.

Related

salesforce isdeleted=false slows down the query rather than improving it

I read that when the records in the recycle bin are too many to exclude those deleted from queries, you can use the condition "where isDeleted = false". But in my batch, monitoring the times, the query is much slower than the one without the explicit condition. At least the first run, then it looks faster.
However, the results obtained by developer console were always exciting.
Can anyone tell me why and help me, please!
Where you've read it? Looks very suspicious to me. isDeleted = false should have no impact on all normal queries (ones that don't have ALL ROWS at the end) because that's what they do out of the box. If anything it might even slow down the execution because query optimizer would need to consider this field (it's not indexed, it'd be useless to index something that 99% of the time has same value).
You can experiment with Query Optimizer in the developer console and remember that typically index statistics are recalculated overnight so if you've loaded lots of test data - "today" queries might still run off old statistics.
You might be overcomplicating it, relying on something that's one-off results because for example the server's load was low at the time you started your experiment. Or maybe whatever this was about is simply undocumented behaviour that changed with one of recent releases. Just select / create a meaningful index, you'll be better off.
More reading material:
https://developer.salesforce.com/docs/atlas.en-us.salesforce_large_data_volumes_bp.meta/salesforce_large_data_volumes_bp/ldv_deployments_infrastructure_indexes.htm
https://developer.salesforce.com/docs/atlas.en-us.salesforce_large_data_volumes_bp.meta/salesforce_large_data_volumes_bp/ldv_deployments_techniques_deleting_data.htm
https://developer.salesforce.com/docs/atlas.en-us.apexcode.meta/apexcode/langCon_apex_SOQL_VLSQ.htm

Tough SQL optimization

There is a rather complex SQL Server query I have been attempting to optimize for some months now which takes a very long time to execute despite multiple index additions (adding covering, non-clustered indexes) and query refactoring/changes. Without getting into the full details, the execution plan is below. Is there anything here which jumps out to anyone as particularly inefficient or bad? I got rid of all key lookups and there appears to be heavy use of index seeks which is why I am confused that it still takes a huge amount of time somehow. When the query runs, the bottleneck is clearly CPU (not disk I/O). Thanks much for any thoughts.
OK so I made a change based on Martin's comments which have seemingly greatly helped the query speed. I'm not 100% positive this is the solution bc I've been running this a lot and it's possible that so much underlying data has been put into memory that it is now fast. But I think there is actually a true difference.
Specifically, the 3 scans inside of the nested loops were being caused by sub-queries on very small tables that contain a small set of records to be completely excluded from the result set. Conceptually, the query was something like:
SELECT fields
FROM (COMPLEX JOIN)
WHERE id_field NOT IN (SELECT bad_ID_field FROM BAD_IDs)
the idea being that if a record appears in BAD_IDs it should never be included in the results.
I tinkered with this and changed it to something like:
SELECT fields
FROM (COMPLEX JOIN)
LEFT JOIN BAD_IDs ON id_field = bad_ID_field
WHERE BAD_IDs.bad_ID_field IS NULL
This is logically the same thing - it excludes results for any ID in BAD_IDs - but it uses a join instead of a subquery. Even the execution plan is almost identical; a TOP operation gets changed to a FILTER elsewhere in the tree, but the clustered index scan is still there.
But, it seems to run massively faster! Is this to be expected? I have always assumed that a subquery used in the fashion I did was OK and that the server would know how to create the fastest (and presumably identical, which it almost is) execution plan. Is this not correct?
Thx!

Small table has very high cost in query plan

I am having an issue with a query where the query plan says that 15% of the execution cost is for one table. However, this table is very small (only 9 rows).
Clearly there is a problem if the smallest table involved in the query has the highest cost.
My guess is that the query keeps on looping over the same table again and again, rather than caching the results.
What can I do about this?
Sorry, I can't paste the exact code (which is quite complex), but here is something similar:
SELECT Foo.Id
FROM Foo
-- Various other joins have been removed for the example
LEFT OUTER JOIN SmallTable as st_1 ON st_1.Id = Foo.SmallTableId1
LEFT OUTER JOIN SmallTable as st_2 ON st_2.Id = Foo.SmallTableId2
WHERE (
-- various where clauses removed for the example
)
AND (st_1.Id is null OR st_1.Code = 7)
AND (st_2.Id is null OR st_2.Code = 4)
Take these execution-plan statistics with a wee grain of salt. If this table is "disproportionately small," relative to all the others, then those cost-statistics probably don't actually mean a hill o' beans.
I mean... think about it ... :-) ... if it's a tiny table, what actually is it? Probably, "it's one lousy 4K storage-page in a file somewhere." We read it in once, and we've got it, period. End of story. Nothing (actually...) there to index; no (actual...) need to index it; and, at the end of the day, the DBMS will understand this just as well as we do. Don't worry about it.
Now, having said that ... one more thing: make sure that the "cost" which seems to be attributed to "the tiny table" is not actually being incurred by very-expensive access to the tables to which it is joined. If those tables don't have decent indexes, or if the query as-written isn't able to make effective use of them, then there's your actual problem; that's what the query optimizer is actually trying to tell you. ("It's just a computer ... backwards things says it sometimes.")
Without the query plan it's difficult to solve your problem here, but there is one glaring clue in your example:
AND (st_1.Id is null OR st_1.Code = 7)
AND (st_2.Id is null OR st_2.Code = 4)
This is going to be incredibly difficult for SQL Server to optimize because it's nearly impossible to accurately estimate the cardinality. Hover over the elements of your query plan and look at EstimatedRows vs. ActualRows and EstimatedExecutions vs. ActualExecutions. My guess is these are way off.
Not sure what the whole query looks like, but you might want to see if you can rewrite it as two queries with a UNION operator rather than using the OR logic.
Well, with the limited information available, all I can suggest is that you ensure all columns being used for comparisons are properly indexed.
In addition, you haven't stated if you have an actual performance problem. Even if those table accesses took up 90% of the query time, it's most likely not a problem if the query only takes (for example) a tenth of a second.

What do you do to make sure a new index does not slow down queries?

When we add or remove a new index to speed up something, we may end up slowing down something else.
To protect against such cases, after creating a new index I am doing the following steps:
start the Profiler,
run a SQL script which contains lots of queries I do not want to slow down
load the trace from a file into a table,
analyze CPU, reads, and writes from the trace against the results from the previous runs, before I added (or removed) an index.
This is kind of automated and kind of does what I want. However, I am not sure if there is a better way to do it. Is there some tool that does what I want?
Edit 1 The person who voted to close my question, could you explain your reasons?
Edit 2 I googled up but did not find anything that explains how adding an index can slow down selects. However, this is a well known fact, so there should be something somewhere. If nothing comes up, I can write up a few examples later on.
Edit 3 One such example is this: two columns are highly correlated, like height and weight. We have an index on height, which is not selective enough for our query. We add an index on weight, and run a query with two conditions: a range on height and a range on weight. because the optimizer is not aware of the correlation, it grossly underestimates the cardinality of our query.
Another example is adding an index on increasing column, such as OrderDate, can seriously slow down a query with a condition like OrderDate>SomeDateAfterCreatingTheIndex.
Ultimately what you're asking can be rephrased as 'How can I ensure that the queries that already use an optimal, fast, plan do not get 'optimized' into a worse execution plan?'.
Whether the plan changes due to parameter sniffing, statistics update or metadata changes (like adding a new index) the best answer I know of to keep the plan stable is plan guides. Deploying plan guides for critical queries that already have good execution plans is probably the best way to force the optimizer into keep using the good, validated, plan. See Applying a Fixed Query Plan to a Plan Guide:
You can apply a fixed query plan to a plan guide of type OBJECT or
SQL. Plan guides that apply a fixed query plan are useful when you
know about an existing execution plan that performs better than the
one selected by the optimizer for a particular query.
The usual warnings apply as to any possible abuse of a feature that prevents the optimizer from using a plan which may be actually better than the plan guide.
How about the following approach:
Save the execution plans of all typical queries.
After applying new indexes, check which execution plans have changed.
Test the performance of the queries with modified plans.
From the page "Query Performance Tuning"
Improve Indexes
This page has many helpful step-by-step hints on how to tune your indexes for best performance, and what to watch for (profiling).
As with most performance optimization techniques, there are tradeoffs. For example, with more indexes, SELECT queries will potentially run faster. However, DML (INSERT, UPDATE, and DELETE) operations will slow down significantly because more indexes must be maintained with each operation. Therefore, if your queries are mostly SELECT statements, more indexes can be helpful. If your application performs many DML operations, you should be conservative with the number of indexes you create.
Other resources:
http://databases.about.com/od/sqlserver/a/indextuning.htm
However, it’s important to keep in mind that non-clustered indexes slow down the data modification and insertion process, so indexes should be kept to a minimum
http://searchsqlserver.techtarget.com/tip/Stored-procedure-to-find-fragmented-indexes-in-SQL-Server
Fragmented indexes and tables in SQL Server can slow down application performance. Here's a stored procedure that finds fragmented indexes in SQL servers and databases.
Ok . First off, index's slow down two things (at least)
-> insert/update/delete : index rebuild
-> query planning : "shall I use that index or not ?"
Someone mentioned the query planner might take a less efficient route - this is not supposed to happen.
If your optimizer is even half-decent, and your statistics / parameters correct, there is no way it's going to pick the wrong plan.
Either way, in your case (mssql), you can hardly trust the optimizer and will still have to check every time.
What you're currently doing looks quite sound, you should just make sure the data you're looking at is relevant, i.e. real use case queries in the right proportion (this can make a world of difference).
In order to do that I always advise to write a benchmarking script based on real use - through logging of production-env. queries, a bit like I said here :
Complete db schema transformation - how to test rewritten queries?

Does using WHERE IN hurt query performance?

I've heard that using an IN Clause can hurt performance because it doesn't use Indexes properly. See example below:
SELECT ID, Name, Address
FROM people
WHERE id IN (SELECT ParsedValue FROM UDF_ParseListToTable(#IDList))
Is it better to use the form below to get these results?
SELECT ID,Name,Address
FROM People as p
INNER JOIN UDF_ParseListToTable(#IDList) as ids
ON p.ID = ids.ParsedValue
Does this depend on which version of SQL Server you are using? If so which ones are affected?
Yes, assuming relatively large data sets.
It's considered better to use EXISTS for large data sets. I follow this and have noticed improvements in my code execution time.
According to the article, it has to do with how the IN vs. EXISTS is internalized. Another article: http://weblogs.sqlteam.com/mladenp/archive/2007/05/18/60210.aspx
It's very simple to find out - open Management studio, put both versions of the query in, then run with the Show Execution plan turned on. Compare the two execution plans. Often, but not always, the query optimizer will make the same exact plan / literally do the same thing for different versions of a query that are logically equivalent.
In fact, that's its purpose - the goal is that the optimizer would take ANY version of a query, assuming the logic is the same, and make an optimal plan. Alas, the process isn't perfect.
Here's one scientific comparison:
http://sqlinthewild.co.za/index.php/2010/01/12/in-vs-inner-join/
http://sqlinthewild.co.za/index.php/2009/08/17/exists-vs-in/
IN can hurt performance because SQL Server must generate a complete result set and then create potentially a huge IF statement, depending on the number of rows in the result set. BTW, calling a UDF can be a real performance hit as well. They are very nice to use but can really impact performance, if you are not careful. You can Google UDF and Performance to do some research on this.
More than the IN or the Table Variable, I would think that proper use of an Index would increase the performance of your query.
Also, from the table name, it does not seem like you are going to have a lot of entries in it so which way you go may be moot point in this particular example.
Secondly, IN will be evaluated only once since there is no subquery. In your case, the #IDList variable is probably going to cause mistmatches you will need #IDList1, #IDList2, #IdList3.... because IN demands a list.
As a general rule of thumb, you should avoid IN with subqueries and use EXISTS with a join - you will get better performance more often than not.
Your first example is not the same as your second example, because WHERE X IN (#variable) is the same as WHERE X = #variable (i.e. you cannot have variable lists).
Regarding performance, you'll have to look at the execution plans to see what indexes are chosen.

Resources