Does using WHERE IN hurt query performance? - sql-server

I've heard that using an IN Clause can hurt performance because it doesn't use Indexes properly. See example below:
SELECT ID, Name, Address
FROM people
WHERE id IN (SELECT ParsedValue FROM UDF_ParseListToTable(#IDList))
Is it better to use the form below to get these results?
SELECT ID,Name,Address
FROM People as p
INNER JOIN UDF_ParseListToTable(#IDList) as ids
ON p.ID = ids.ParsedValue
Does this depend on which version of SQL Server you are using? If so which ones are affected?

Yes, assuming relatively large data sets.
It's considered better to use EXISTS for large data sets. I follow this and have noticed improvements in my code execution time.
According to the article, it has to do with how the IN vs. EXISTS is internalized. Another article: http://weblogs.sqlteam.com/mladenp/archive/2007/05/18/60210.aspx

It's very simple to find out - open Management studio, put both versions of the query in, then run with the Show Execution plan turned on. Compare the two execution plans. Often, but not always, the query optimizer will make the same exact plan / literally do the same thing for different versions of a query that are logically equivalent.
In fact, that's its purpose - the goal is that the optimizer would take ANY version of a query, assuming the logic is the same, and make an optimal plan. Alas, the process isn't perfect.
Here's one scientific comparison:
http://sqlinthewild.co.za/index.php/2010/01/12/in-vs-inner-join/
http://sqlinthewild.co.za/index.php/2009/08/17/exists-vs-in/

IN can hurt performance because SQL Server must generate a complete result set and then create potentially a huge IF statement, depending on the number of rows in the result set. BTW, calling a UDF can be a real performance hit as well. They are very nice to use but can really impact performance, if you are not careful. You can Google UDF and Performance to do some research on this.

More than the IN or the Table Variable, I would think that proper use of an Index would increase the performance of your query.
Also, from the table name, it does not seem like you are going to have a lot of entries in it so which way you go may be moot point in this particular example.
Secondly, IN will be evaluated only once since there is no subquery. In your case, the #IDList variable is probably going to cause mistmatches you will need #IDList1, #IDList2, #IdList3.... because IN demands a list.
As a general rule of thumb, you should avoid IN with subqueries and use EXISTS with a join - you will get better performance more often than not.

Your first example is not the same as your second example, because WHERE X IN (#variable) is the same as WHERE X = #variable (i.e. you cannot have variable lists).
Regarding performance, you'll have to look at the execution plans to see what indexes are chosen.

Related

Can joining with an iTVF be as fast as joining with a temp table?

Scenario
Quick background on this one: I am attempting to optimize the use of an inline table-valued function uf_GetVisibleCustomers(#cUserId). The iTVF wraps a view CustomerView and filters out all rows containing data for customers whom the provided requesting user is not permitted to see. This way, should selection criteria ever change in the future for certain user types, we won't have to implement that new condition a hundred times (hyperbole) all over the SQL codebase.
Performance is not great, however, so I want to fix that before encouraging use of the iTVF. Changed database object names here just so it's easier to demonstrate (hopefully).
Queries
In attempting to optimize our iTVF uf_GetVisibleCustomers, I've noticed that the following SQL …
CREATE TABLE #tC ( idCustomer INT )
INSERT #tC
SELECT idCustomer
FROM [dbo].[uf_GetVisibleCustomers]('requester')
SELECT T.fAmount
FROM [Transactions] T
JOIN #tC C ON C.idCustomer = T.idCustomer
… is orders of magnitude faster than my original (IMO more readable, likely to be used) SQL here…
SELECT T.fAmount
FROM [Transactions] T
JOIN [dbo].[uf_GetVisibleCustomers]('requester') C ON C.idCustomer = T.idCustomer
I don't get why this is. The former (top block of SQL) returns ~700k rows in 17 seconds on a fairly modest development server. The latter (second block of SQL) returns the same number of rows in about ten minutes when there is no other user activity on the server. Maybe worth noting that there is a WHERE clause, however I have omitted it here for simplicity; it is the same for both queries.
Execution Plan
Below is the execution plan for the first. It enjoys automatic parallelism as mentioned while the latter query isn't worth displaying here because it's just massive (expands the entire iTVF and underlying view, subqueries). Anyway, the latter also does not execute in parallel (AFAIK) to any extent.
My Questions
Is it possible to achieve performance comparable to the first block without a temp table?
That is, with the relative simplicity and human-readability of the slower SQL.
Why is a join to a temp table faster than a join to iTVF?
Why is it faster to use a temp table than an in-memory table populated the same way?
Beyond those explicit questions, if someone can point me in the right direction toward understanding this better in general then I would be very grateful.
Without seeing the DDL for your inline function - it's hard to say what the issue is. It would also help to see the actual execution plans for both queries (perhaps you could try: https://www.brentozar.com/pastetheplan/). That said, I can offer some food for thought.
As you mentioned, the iTVF accesses the underlying tables, views and associated indexes. If your statistics are not up-to-date you can get a bad plan, that won't happen with your temp table. On that note, too, how long does it take to populate that temp table?
Another thing to look at (again, this is why DDL is helpful) is: are the data type's the same for Transactions.idCustomer and #TC.idCustomer? I see a hash match in the plan you posted which seems bad for a join between two IDs (a nested loops or merge join would be better). This could be slowing both queries down but would appear to have a more dramatic impact on the query that leverages your iTVF.
Again this ^^^ is speculation based on my experience. A couple quick things to try (not as a perm fix but for troubleshooting):
1. Check to see if re-compiling your query when using the iTVF speeds things up (this would be a sign of a bad stats or a bad execution plan being cached and re-used)
2. Try forcing a parallel plan for the iTVF query. You can do this by adding OPTION (QUERYTRACEON 8649) to the end of your query of by using make_parallel() by Adam Machanic.

Small table has very high cost in query plan

I am having an issue with a query where the query plan says that 15% of the execution cost is for one table. However, this table is very small (only 9 rows).
Clearly there is a problem if the smallest table involved in the query has the highest cost.
My guess is that the query keeps on looping over the same table again and again, rather than caching the results.
What can I do about this?
Sorry, I can't paste the exact code (which is quite complex), but here is something similar:
SELECT Foo.Id
FROM Foo
-- Various other joins have been removed for the example
LEFT OUTER JOIN SmallTable as st_1 ON st_1.Id = Foo.SmallTableId1
LEFT OUTER JOIN SmallTable as st_2 ON st_2.Id = Foo.SmallTableId2
WHERE (
-- various where clauses removed for the example
)
AND (st_1.Id is null OR st_1.Code = 7)
AND (st_2.Id is null OR st_2.Code = 4)
Take these execution-plan statistics with a wee grain of salt. If this table is "disproportionately small," relative to all the others, then those cost-statistics probably don't actually mean a hill o' beans.
I mean... think about it ... :-) ... if it's a tiny table, what actually is it? Probably, "it's one lousy 4K storage-page in a file somewhere." We read it in once, and we've got it, period. End of story. Nothing (actually...) there to index; no (actual...) need to index it; and, at the end of the day, the DBMS will understand this just as well as we do. Don't worry about it.
Now, having said that ... one more thing: make sure that the "cost" which seems to be attributed to "the tiny table" is not actually being incurred by very-expensive access to the tables to which it is joined. If those tables don't have decent indexes, or if the query as-written isn't able to make effective use of them, then there's your actual problem; that's what the query optimizer is actually trying to tell you. ("It's just a computer ... backwards things says it sometimes.")
Without the query plan it's difficult to solve your problem here, but there is one glaring clue in your example:
AND (st_1.Id is null OR st_1.Code = 7)
AND (st_2.Id is null OR st_2.Code = 4)
This is going to be incredibly difficult for SQL Server to optimize because it's nearly impossible to accurately estimate the cardinality. Hover over the elements of your query plan and look at EstimatedRows vs. ActualRows and EstimatedExecutions vs. ActualExecutions. My guess is these are way off.
Not sure what the whole query looks like, but you might want to see if you can rewrite it as two queries with a UNION operator rather than using the OR logic.
Well, with the limited information available, all I can suggest is that you ensure all columns being used for comparisons are properly indexed.
In addition, you haven't stated if you have an actual performance problem. Even if those table accesses took up 90% of the query time, it's most likely not a problem if the query only takes (for example) a tenth of a second.

What are SQL Execution Plans and how can they help me?

I've been hearing a lot lately that I ought to take a look at the execution plan of my SQL to make a judgment on how well it will perform. However, I'm not really sure where to begin with this feature or what exactly it means.
I'm looking for either a good explanation of what the execution plan does, what its limitations are, and how I can utilize it or direction to a resource that does.
It describes actual algorithms which the server uses to retrieve your data.
An SQL query like this:
SELECT *
FROM mytable1
JOIN mytable2
ON …
GROUP BY
…
ORDER BY
…
, describes what should be done but not how it should be done.
The execution plan shows how: which indexes are used, which join methods are chosen (nested loops or hash join or merge join), how the results are grouped (using sorting or hashing), how they are ordered etc.
Unfortunately, even modern SQL engines cannot automatically find the optimal plans for more or less complex queries, it still takes an SQL developer to reformulate the queries so that they are performant (even they do what the original query does).
A classical example would be these too queries:
SELECT (
SELECT COUNT(*)
FROM mytable mi
WHERE mi.id <= mo.id
)
FROM mytable mo
ORDER BY
id
and
SELECT RANK() OVER (ORDER BY id)
FROM mytable
, which do the same and in theory should be executed using the same algorithms.
However, no actual engine will optimize the former query to implement the same algorithms, i. e. store a counter in a variable and increment it.
It will do what it's told to do: count the rows over and over and over again.
To optimize the queries you need to actually see what's happening behind the scenes, and that's what the execution plans show you.
You may want to read this article in my blog:
Double-thinking in SQL
Here and Here are some article check it out. Execution plans lets you identify the area which is time consuming and therefore allows you to improve your query.
An execution plan shows exactly how SQL Server processes a query
it is produced as part of the query optimisation process that SQL Server does. It is not something that you directly create.
it will show what indexes it has decided are best to be used, and basically is a plan for how SQL server processes a query
the query optimiser will take a query, analyse it and potentially come up with a number of different execution plans. It's a cost-based optimisation process, and it will choose the one that it feels is the best.
once an execution plan has been generated, it will go into the plan cache so that subsequent calls for that same query can reuse the same plan again to save having to redo the work to come up with a plan.
execution plans automatically get dropped from the cache, depending on their value (low value plans get removed before high value plans do in order to provide maximum performance gain)
execution plans help you spot performance issues such as where indexes are missing
A way to ease into this, is simply by using "Ctrl L" (Query | Display Estimated Execution Plan) for some of your queries, in SQL Management Studio.
This will result in showing a graphic view of Execution Plan, which, at first are easier to "decode" than the text version thereof.
Query plans in a tiny nutshell:
Essentially the query plan show the way SQL Server intends to use in resolving a query.
There are indeed many options, even with simple queries.
For example when dealing with a JOIN, one needs to decide whether to loop through the [filtered] rows of "table A" and to lookup the rows of "table B", or to loop through "table B" first instead (this is a simplified example, as there are many other tricks which can be used in dealing with JOINs). Typically, SQL will estimate the number of [filtered] rows which will be produced by either table and pick the one which the smallest count for the outer loop (as this will reduce the number of lookups in the other table)
Another example, is to decide which indexes to use (or not to use).
There are many online resources as well as books which describe the query plans in more detail, the difficulty is that SQL performance optimization is a very broad and complex problem, and many such resources tend to go into too much detail for the novice; One first needs to understand the fundamental principles and structures which underlie SQL Server (the way indexes work, the way the data is stored, the difference between clustered indexes and heaps...) before diving into many of the [important] details of query optimization. It is a bit like baseball: first you need to know the rules before understanding all the subtle [and important] concepts related to the game strategy.
See this related SO Question for additional pointers.
Here's a great resource to help you understand them
http://downloads.red-gate.com/ebooks/HighPerformanceSQL_ebook.zip
This is from red-gate which is a company that makes great SQL server tools, it's free and it's well worth the time to download and read.
it is a very serious part of knowledge. And I highly to recommend special training courses about that. As for me after spent week on courses I boosted performance of queries about 1000 times (nostalgia)
The Execution Plan shows you how the database is fetching, sorting and filtering the data required for your query.
For example:
SELECT
*
FROM
TableA
INNER JOIN
TableB
ON
TableA.Id = TableB.TableAId
WHERE
TableB.TypeId = 2
ORDER BY
TableB.Date ASC
Would result in an execution plan showing the database getting records from TableA and TableB, matching them to satisfy the JOIN, filtering to satisfy the WHERE and sorting to satisfy the ORDER BY.
From this, you can work out what is slowing down the query, whether it would be beneficial to review your indexes or if you can speed things up in another way.

advantages in specifying HASH JOIN over just doing a JOIN?

What are the advantages, if any, of explicitly doing a HASH JOIN over a regular JOIN (wherein SQL Server will decide the best JOIN strategy)? Eg:
select pd.*
from profiledata pd
inner hash join profiledatavalue val on val.profiledataid=pd.id
In the simplistic sample code above, I'm specifying the JOIN strategy, whereas if I leave off the "hash" key word SQL Server will do a MERGE JOIN behind the scenes (per the "actual execution plan").
The optmiser does a good enough job for everyday use. However, in theory it might need 3 weeks to find the perfect plan in the extreme, so there is a chance that the generated plan will not be ideal.
I'd leave it alone unless you have a very complex query or huge amounts of data where it simply can't produce a good plan. Then I'd consider it.
But over time, as data changes/grows or indexes change etc, your JOIN hint will becomes obsolete and prevents an optimal plan. A JOIN hint can only optimise for that single query at the time of development with that set of data you have.
Personally, I've never specified a JOIN hint in any production code.
I've normally solved a bad join by changing my query around, adding/changing an index or breaking it up (eg load a temp table first). Or my query was just wrong, or I had an implicit data type conversion, or it highlighted a flaw in my schema etc.
I've seen other developers use them but only where they had complex views nested upon complex views and they caused later problems when they refactored.
Edit:
I had a conversion today where some colleagues are going to use them to force a bad query plan (with NOLOCK and MAXDOP 1) to "encourage" migration away from legacy complex nested views that one of their downstream system calls directly.
Hash joins parallelize and scale better than any other join and are great at maximizing throughput in data warehouses.
When to try a hash hint, how about:
After checking that adequate indices exist on at least one of the
tables.
After having tried to re-arrange the query. Things like converting
joins to "in" or "exists", changing join order (which is only really a
hint anyway), moving logic from where clause to join condition, etc.
Some basic rules about when a hash join is effective is when a join condition does not exist as a table index and when the tables sizes are different. If you looking for a technical description there are some good descriptions out there about how a hash join works.
Why use any join hints (hash/merge/loop with side effect of force order)?
To avoid extremely slow execution (.5 -> 10.0s) of corner cases.
When the optimizer consistently chooses a mediocre plan.
A supplied hint is likely to be non-ideal for some circumstances but provides more consistently predictable runtimes. The expected worst case and best case scenarios should be pre-tested when using a hint. Predictable runtimes are critical for web services where a rigidly optimized nominal [.3s, .6s] query is preferred over one that can range [.25, 10.0s] for example. Large runtime variances can happen with statistics freshly updated and best practices followed.
When testing in a development environment, one should turn off "cheating" as well to avoid hot/cold runtime variances. From another post...
CHECKPOINT -- flushes dirty pages to disk
DBCC DROPCLEANBUFFERS -- clears data cache
DBCC FREEPROCCACHE -- clears execution plan cache
The last option may be the same as the option(recompile) hint.
The MAXDOP and loading of the machine can also make a huge difference in runtime. Materialization of CTE into temp tables is also a good locking down mechanism and something to consider.
The only hint I've ever seen in shipping code was OPTION (FORCE ORDER). Stupid bug in SQL query optimizer would generate a plan that tried to join an unfiltered varchar and a unique identifier. Adding FORCE ORDER caused it to run the filter first.
I know, overloading columns is bad. Sometimes, you've got to live with it.
The logical plan optimizator doesn't assure to you that it finds the optimal solution: an exact algorithm is too slow to use in a production server; instead there are used some greedy algorithms.
Hence, the rationale behind those commands is to let the user specify the optimal join strategy, in the case the optimizator can't sort out what's really the best to adopt.

Alternatives to MS SQL 2005 FullText Catalog

I can't seem to get acceptable performance from FullText Catalogs. We have situations where we must run 100k+ queries as quickly as possible. Some of the queries use FREETEXT some don't. Here's an example of a query
IF EXISTS(select 1 from user_data d where d.userid=#userid and FREETEXT(*, #activities) SET #match=1
This can take between 3-15 seconds. I need it to be much faster < 1s if possible.
I like the "flexibility" of the fulltext query in that it can search across multiple columns and the syntax is pretty intuitive. I'd rather not use a Like statement because we want to be able to match words like "Writer" and "Writing".
I've tried some of the suggestions listed here http://msdn.microsoft.com/en-us/library/ms142560(SQL.90).aspx
We've got as much memory and cpu as we can afford, unfortunately we can't put the catalogs on their own disk controllers.
I'm stumped and ready to explore other alternatives to FullText Queries. Is there anything else out there that gives that kind of "Writer"/"Writing" similar matches? Perhaps even something that uses the CLR?
Check out these alternatives, although I doubt they'll improve performance without isolating them onto separate hardware:
Which search technology to use with ASP.NET?
Lucene.Net and SQL Server
Due to the nature of FREETEXT, the performance is less than when you'd use CONTAINS, simply because it has to take into account less precise alternatives for the keywords given. CONTAINS can find Writing when you specify Write btw, I'm not sure if you've checked whether CONTAINS will do the trick or not.
Also be sure to avoid IF statements in SQL, as they often lead to a complete recompilation of the execution plan for every query execution, which will likely contribute to the poor performance you're seeing. I'm not sure how the IF statement is used, as it's likely inside a bigger piece of SQL. Try to merge the EXISTS query with that bigger piece of sql, as you can set the #match parameter from within the SELECT statement inside the EXISTS, or get rid of the variable altogether and use the EXISTS clause as a predicate in the bigger query.
SQL is a set-oriented language and interpreted. Therefore, it's often faster to get rid of imperative programming constructs and use the native set-operators of sql instead.
perhaps https://github.com/MahyTim/LuceneNetSqlDirectory can help you, it allows to store a LuceneNET index in SQLServer.

Resources