Improving performance of an OUTER APPLY with XML in SQLS - sql-server

I'm generating reports from a database that makes extensive use of XML to store time-series data. Annoyingly, most of these entries hold only a single value, complicating everything for no benefit. Looking here on SO, I found a couple of examples using OUTER APPLY to decode these fields into a single value.
One of these queries is timing out on the production machine, so I'm looking for ways to improve its performance. The query contains a dozen lines similar to:
SELECT...
PR.D.value('#A', 'NVARCHAR(16)') AS RP,
...
FROM Profiles LP...
OUTER APPLY LP.VariableRP.nodes('/X/E') RP(D)
...
When I look in the Execution Plan, each of these OUTER APPLYs has a huge operator cost, although I'm not sure that really means anything. In any event, these operators make up 99% of the query time.
Does anyone have any advice on how to improve these sorts of queries? I suspect there's a way to do this without OUTER APPLY, but my google-fu is failing.

Taking this literally
most of these entries hold only a single value
...it should be faster to avoid APPLY (which produces quite an overhead on creating a derived table) and read the one and only value directly:
SELECT LP.VariableRP.value('(/X/E/#A)[1]', 'NVARCHAR(16)') AS RP
FROM Profiles LP
If this does not provide what you need, please show us some examples of your XML, but I doubt this will get much faster.
There are XML indexes, but in most cases they don't help and can make things even worse
You might use some kind of trigger or run-once logic to shift the needed values into a side column (into a related side table) and query from there.

Related

Which is better; Multiple CTE's in a single query or multiple views joined?

I am currently in the progress of a database migration from MS Access to SQL Server. To improve performance of a specific query, I am translating from access to T-SQL and executing server-side. The query in question is essentially made up of almost 15 subqueries branching off in to different directions with varying levels of complexity. The top level query is a culmination (final Select) of all of these queries.
Without actually going into the specifics of the fields and relationships in my queries, I want to ask a question on a generic example.
Take the following:
Top Level Query
|
___________|___________
| |
Query 1 <----------> Query 2
_________________________| Views? |_______________________________
| | | |
Query 1.1 Query 1.2 Query 2.1 Query 2.2
________|______ ______|________
| | | |
Query 1.1.1 Query 1.1.2 Query 2.1.1 Query 2.1.2
| | | |
... ... ... ...
I am attempting to convert the above MS Access query structure to T-SQL, whilst maximising performance. So far I have converted all of Query1 Into a single query starting from the bottom and working my way up. I have achieved by using CTE's to represent every single subquery and then finally selected from this entire CTE tree to produce Query1. Due to the original design of the query, there is a high level of dependency between the subqueries.
Now my question is quite simple actually. With regards to Query2, should I continue to use this same method within the same query window or should I make both Query1 and Query2 seperate entities (Views) and then do a select from each? Or should I just continue adding more CTE's and then get the final Top Level Query result from this one super query?
This is an extremely bastardised version of the actual query, I am working with which has a large number of calculated fields and more subquery levels.
What do you think is the best approach here?
There is no way to say for sure from a diagram like this, but I suspect that you want to use Views for a number of reasons.
1) If the sub-query/view is used in more than one place there is a high chance that caching will allow for results to be shared in more than one place, but it is not as strong effect as a CTE but can be mitigated with a materialized query
2) It is easy turn a view into a materialized view. Then you get huge bonus if it is used multiple times or is used many times before it needs to be refreshed.
3) If you find a slow part it will be isolated to one view -- then you can optimize and change that small section easier.
I would recommend using views for EVERY sub-view if you can. Unless you can demonstrate (via execution plan or testing) that the CTE runs faster.
Final note as someone who has migrated Access to SQL in the past. Access encourages more sub-queries than needed with modern SQL and windowing functions. It is very likely with some analysis these access queries can be made much simpler. Try to find cases where you can roll them up to the parent query
A query you submit, or a "view" is all the same thing.
Your BASIC simple question stands!
Should you use a query on query, or a use a CTE?
Well, first, lets get rid of some confusing you have here.
A CTE is great for eliminaton of the need to build a query, save the query (say as a view) AND THEN query against it.
However, in your question we are mixing up TWO VERY different issues.
Are you doing a query against a query, or in YOUR case using a sub-query? While these two things SEEM simular, they really at not!!!
In the case of a sub-query, using a CTE will in fact save you the pain of having to build a separate query/view and saving that query. In this case, you are I suppose doing a query on query, but it REALLY a sub query. From a performance point of view, I don't believe you find any difference, so do what works best for you. I do in some ways like adopting CTE's since THEN you have have the "whole thing" in one spot. And updates to the whole mess occurs in one place. This can especially be an advantage if you have several sites. So to update things, you ONLY have to update this one huge big saved "thing". I do find this a signficant advantage.
The advantages of breaking out into separate views (as opposed to using CTE's) can often be the simple issue that how do you eat a elephant?
Answer: One bite at a time.
However, I in fact consider the conept and approahc of a sub-query a DIFFERENT issue then building a query on query. One of the really big reasons to using CTE'S in sql server, is SQL server has one REALLY big limitation compared to Access SQL. That limiation of course is being able to re-use derived columns.
eg this:
SELECT ID, Company, State, TaxRate, Purchased, Payments,
(Purchased - Payments) as Balance,
(Balance * TaxRate) as BalanceWithTax
FROM Customers
Of course in T-SQL, you can't re-use expression like you can in Access T-SQL. So the above is a GREAT use of CTE'S. Balance in t-sql cannot be re-used. So you are having to constant repeat expressions in t-sql (my big pet peeve with t-sql). Using a CTE means we CAN use the above ability to repeat a expression.
So I tend to think of the CTE solving two issues, and you should keep these concepts seperate:
I want to eleiminate the need for a query on query, and that includes sub-queries.
So, sure, using CTE for above is a good use.
The SECOND use case is the abilty to repeat use expession columns. This is VERY painful in T-SQL, and CTE's go a long way to reducing this pain point (Access SQL is still far better), but CTE's are at least very helpfull.
So, from a performance point of view, using CTE's to eliminate sub query should not effect perfomance, and as noted you can saving having to create 2-5 seperate queries for this to work.
Then there is a query on query (especially in the above use case of being able to re-use columns in expressions. In this case, I believe some performance advantages exist, but again, likely not enough to justify one approach or the other. So once again, adopting CTE's should be which road is LESS work for you! (but for a very nasty say sub-query that sums() and does some real work, and THEN you need to re-use that columns, then that is really when CTE's shine here.
So as a general coding approach, I used CTE's to advoid query on query (but NOT a sub quer s you are doing). And I use CTE's to gain re-use of a complex expression column.
Using CTE's to eliminate having sub-queries is not really all that great of a benefit. (I mean, just shove the sub-query into the main query - MOST of the time a CTE will not help you).
So, using CTE's just for the concept of a sub-query is not all that great of a advantage. You can, but I don't see great gains from a developer poitn of view. However, in the case of query on query (to gain re-use of column expressions?). Well then the CTE's elemintes the need for a query against the same table/query.
So, for just sub-queries, I can't say CTE's are huge advantage. But for re-use of column expressions, then you MUST either do a query on a query (say a saved view), and THEN you gain re-use of the columns as expressions to be used in additional expressions.
While this is somewhat opinion?
CTE'S ability to allow re-use of columns is their use case, because this elimiantes the need to create a sepeate view. It is not so much that you elimited the need for a seperate view (query on query), but that you gained used of a column exprssion for re-use is the main benefit here.
So, you certainly can use CTE's to eliminate having to create views (yes, a good idea), but in your case you likly could have just used sub-queries anyway, and the CTE's are not really required. For column re-use, you have NO CHOICE in the matter. Since you MUST use a query on query for column expression re-use, the the CTE's will eliminate this need. In your case (at least so far) you really did not need to use a CTE's and you were not being forced to in regards to your solution. For column re-use you have zero choice - you ARE being forced to query on query - so a CTE's eliminates this need.
As far as I can tell, so far you don't really need necessary to use a CTE unless the issue is being able to re-use some columns in other expressions like we could/can in Access sql.
If column re-use is the goal? Then yes, CTE's are a great solution. So it more of a column re-use issue then that of choosing to use query on query. If you did not have the additional views in Access, then no question that adopting CTE's to keep a similar approach and design makes a lot of sense. So the motivation of column re-use is what we lost by going to sql server, and CTE's do a lot to regain this ability.

Can joining with an iTVF be as fast as joining with a temp table?

Scenario
Quick background on this one: I am attempting to optimize the use of an inline table-valued function uf_GetVisibleCustomers(#cUserId). The iTVF wraps a view CustomerView and filters out all rows containing data for customers whom the provided requesting user is not permitted to see. This way, should selection criteria ever change in the future for certain user types, we won't have to implement that new condition a hundred times (hyperbole) all over the SQL codebase.
Performance is not great, however, so I want to fix that before encouraging use of the iTVF. Changed database object names here just so it's easier to demonstrate (hopefully).
Queries
In attempting to optimize our iTVF uf_GetVisibleCustomers, I've noticed that the following SQL …
CREATE TABLE #tC ( idCustomer INT )
INSERT #tC
SELECT idCustomer
FROM [dbo].[uf_GetVisibleCustomers]('requester')
SELECT T.fAmount
FROM [Transactions] T
JOIN #tC C ON C.idCustomer = T.idCustomer
… is orders of magnitude faster than my original (IMO more readable, likely to be used) SQL here…
SELECT T.fAmount
FROM [Transactions] T
JOIN [dbo].[uf_GetVisibleCustomers]('requester') C ON C.idCustomer = T.idCustomer
I don't get why this is. The former (top block of SQL) returns ~700k rows in 17 seconds on a fairly modest development server. The latter (second block of SQL) returns the same number of rows in about ten minutes when there is no other user activity on the server. Maybe worth noting that there is a WHERE clause, however I have omitted it here for simplicity; it is the same for both queries.
Execution Plan
Below is the execution plan for the first. It enjoys automatic parallelism as mentioned while the latter query isn't worth displaying here because it's just massive (expands the entire iTVF and underlying view, subqueries). Anyway, the latter also does not execute in parallel (AFAIK) to any extent.
My Questions
Is it possible to achieve performance comparable to the first block without a temp table?
That is, with the relative simplicity and human-readability of the slower SQL.
Why is a join to a temp table faster than a join to iTVF?
Why is it faster to use a temp table than an in-memory table populated the same way?
Beyond those explicit questions, if someone can point me in the right direction toward understanding this better in general then I would be very grateful.
Without seeing the DDL for your inline function - it's hard to say what the issue is. It would also help to see the actual execution plans for both queries (perhaps you could try: https://www.brentozar.com/pastetheplan/). That said, I can offer some food for thought.
As you mentioned, the iTVF accesses the underlying tables, views and associated indexes. If your statistics are not up-to-date you can get a bad plan, that won't happen with your temp table. On that note, too, how long does it take to populate that temp table?
Another thing to look at (again, this is why DDL is helpful) is: are the data type's the same for Transactions.idCustomer and #TC.idCustomer? I see a hash match in the plan you posted which seems bad for a join between two IDs (a nested loops or merge join would be better). This could be slowing both queries down but would appear to have a more dramatic impact on the query that leverages your iTVF.
Again this ^^^ is speculation based on my experience. A couple quick things to try (not as a perm fix but for troubleshooting):
1. Check to see if re-compiling your query when using the iTVF speeds things up (this would be a sign of a bad stats or a bad execution plan being cached and re-used)
2. Try forcing a parallel plan for the iTVF query. You can do this by adding OPTION (QUERYTRACEON 8649) to the end of your query of by using make_parallel() by Adam Machanic.

Small table has very high cost in query plan

I am having an issue with a query where the query plan says that 15% of the execution cost is for one table. However, this table is very small (only 9 rows).
Clearly there is a problem if the smallest table involved in the query has the highest cost.
My guess is that the query keeps on looping over the same table again and again, rather than caching the results.
What can I do about this?
Sorry, I can't paste the exact code (which is quite complex), but here is something similar:
SELECT Foo.Id
FROM Foo
-- Various other joins have been removed for the example
LEFT OUTER JOIN SmallTable as st_1 ON st_1.Id = Foo.SmallTableId1
LEFT OUTER JOIN SmallTable as st_2 ON st_2.Id = Foo.SmallTableId2
WHERE (
-- various where clauses removed for the example
)
AND (st_1.Id is null OR st_1.Code = 7)
AND (st_2.Id is null OR st_2.Code = 4)
Take these execution-plan statistics with a wee grain of salt. If this table is "disproportionately small," relative to all the others, then those cost-statistics probably don't actually mean a hill o' beans.
I mean... think about it ... :-) ... if it's a tiny table, what actually is it? Probably, "it's one lousy 4K storage-page in a file somewhere." We read it in once, and we've got it, period. End of story. Nothing (actually...) there to index; no (actual...) need to index it; and, at the end of the day, the DBMS will understand this just as well as we do. Don't worry about it.
Now, having said that ... one more thing: make sure that the "cost" which seems to be attributed to "the tiny table" is not actually being incurred by very-expensive access to the tables to which it is joined. If those tables don't have decent indexes, or if the query as-written isn't able to make effective use of them, then there's your actual problem; that's what the query optimizer is actually trying to tell you. ("It's just a computer ... backwards things says it sometimes.")
Without the query plan it's difficult to solve your problem here, but there is one glaring clue in your example:
AND (st_1.Id is null OR st_1.Code = 7)
AND (st_2.Id is null OR st_2.Code = 4)
This is going to be incredibly difficult for SQL Server to optimize because it's nearly impossible to accurately estimate the cardinality. Hover over the elements of your query plan and look at EstimatedRows vs. ActualRows and EstimatedExecutions vs. ActualExecutions. My guess is these are way off.
Not sure what the whole query looks like, but you might want to see if you can rewrite it as two queries with a UNION operator rather than using the OR logic.
Well, with the limited information available, all I can suggest is that you ensure all columns being used for comparisons are properly indexed.
In addition, you haven't stated if you have an actual performance problem. Even if those table accesses took up 90% of the query time, it's most likely not a problem if the query only takes (for example) a tenth of a second.

Does using WHERE IN hurt query performance?

I've heard that using an IN Clause can hurt performance because it doesn't use Indexes properly. See example below:
SELECT ID, Name, Address
FROM people
WHERE id IN (SELECT ParsedValue FROM UDF_ParseListToTable(#IDList))
Is it better to use the form below to get these results?
SELECT ID,Name,Address
FROM People as p
INNER JOIN UDF_ParseListToTable(#IDList) as ids
ON p.ID = ids.ParsedValue
Does this depend on which version of SQL Server you are using? If so which ones are affected?
Yes, assuming relatively large data sets.
It's considered better to use EXISTS for large data sets. I follow this and have noticed improvements in my code execution time.
According to the article, it has to do with how the IN vs. EXISTS is internalized. Another article: http://weblogs.sqlteam.com/mladenp/archive/2007/05/18/60210.aspx
It's very simple to find out - open Management studio, put both versions of the query in, then run with the Show Execution plan turned on. Compare the two execution plans. Often, but not always, the query optimizer will make the same exact plan / literally do the same thing for different versions of a query that are logically equivalent.
In fact, that's its purpose - the goal is that the optimizer would take ANY version of a query, assuming the logic is the same, and make an optimal plan. Alas, the process isn't perfect.
Here's one scientific comparison:
http://sqlinthewild.co.za/index.php/2010/01/12/in-vs-inner-join/
http://sqlinthewild.co.za/index.php/2009/08/17/exists-vs-in/
IN can hurt performance because SQL Server must generate a complete result set and then create potentially a huge IF statement, depending on the number of rows in the result set. BTW, calling a UDF can be a real performance hit as well. They are very nice to use but can really impact performance, if you are not careful. You can Google UDF and Performance to do some research on this.
More than the IN or the Table Variable, I would think that proper use of an Index would increase the performance of your query.
Also, from the table name, it does not seem like you are going to have a lot of entries in it so which way you go may be moot point in this particular example.
Secondly, IN will be evaluated only once since there is no subquery. In your case, the #IDList variable is probably going to cause mistmatches you will need #IDList1, #IDList2, #IdList3.... because IN demands a list.
As a general rule of thumb, you should avoid IN with subqueries and use EXISTS with a join - you will get better performance more often than not.
Your first example is not the same as your second example, because WHERE X IN (#variable) is the same as WHERE X = #variable (i.e. you cannot have variable lists).
Regarding performance, you'll have to look at the execution plans to see what indexes are chosen.

advantages in specifying HASH JOIN over just doing a JOIN?

What are the advantages, if any, of explicitly doing a HASH JOIN over a regular JOIN (wherein SQL Server will decide the best JOIN strategy)? Eg:
select pd.*
from profiledata pd
inner hash join profiledatavalue val on val.profiledataid=pd.id
In the simplistic sample code above, I'm specifying the JOIN strategy, whereas if I leave off the "hash" key word SQL Server will do a MERGE JOIN behind the scenes (per the "actual execution plan").
The optmiser does a good enough job for everyday use. However, in theory it might need 3 weeks to find the perfect plan in the extreme, so there is a chance that the generated plan will not be ideal.
I'd leave it alone unless you have a very complex query or huge amounts of data where it simply can't produce a good plan. Then I'd consider it.
But over time, as data changes/grows or indexes change etc, your JOIN hint will becomes obsolete and prevents an optimal plan. A JOIN hint can only optimise for that single query at the time of development with that set of data you have.
Personally, I've never specified a JOIN hint in any production code.
I've normally solved a bad join by changing my query around, adding/changing an index or breaking it up (eg load a temp table first). Or my query was just wrong, or I had an implicit data type conversion, or it highlighted a flaw in my schema etc.
I've seen other developers use them but only where they had complex views nested upon complex views and they caused later problems when they refactored.
Edit:
I had a conversion today where some colleagues are going to use them to force a bad query plan (with NOLOCK and MAXDOP 1) to "encourage" migration away from legacy complex nested views that one of their downstream system calls directly.
Hash joins parallelize and scale better than any other join and are great at maximizing throughput in data warehouses.
When to try a hash hint, how about:
After checking that adequate indices exist on at least one of the
tables.
After having tried to re-arrange the query. Things like converting
joins to "in" or "exists", changing join order (which is only really a
hint anyway), moving logic from where clause to join condition, etc.
Some basic rules about when a hash join is effective is when a join condition does not exist as a table index and when the tables sizes are different. If you looking for a technical description there are some good descriptions out there about how a hash join works.
Why use any join hints (hash/merge/loop with side effect of force order)?
To avoid extremely slow execution (.5 -> 10.0s) of corner cases.
When the optimizer consistently chooses a mediocre plan.
A supplied hint is likely to be non-ideal for some circumstances but provides more consistently predictable runtimes. The expected worst case and best case scenarios should be pre-tested when using a hint. Predictable runtimes are critical for web services where a rigidly optimized nominal [.3s, .6s] query is preferred over one that can range [.25, 10.0s] for example. Large runtime variances can happen with statistics freshly updated and best practices followed.
When testing in a development environment, one should turn off "cheating" as well to avoid hot/cold runtime variances. From another post...
CHECKPOINT -- flushes dirty pages to disk
DBCC DROPCLEANBUFFERS -- clears data cache
DBCC FREEPROCCACHE -- clears execution plan cache
The last option may be the same as the option(recompile) hint.
The MAXDOP and loading of the machine can also make a huge difference in runtime. Materialization of CTE into temp tables is also a good locking down mechanism and something to consider.
The only hint I've ever seen in shipping code was OPTION (FORCE ORDER). Stupid bug in SQL query optimizer would generate a plan that tried to join an unfiltered varchar and a unique identifier. Adding FORCE ORDER caused it to run the filter first.
I know, overloading columns is bad. Sometimes, you've got to live with it.
The logical plan optimizator doesn't assure to you that it finds the optimal solution: an exact algorithm is too slow to use in a production server; instead there are used some greedy algorithms.
Hence, the rationale behind those commands is to let the user specify the optimal join strategy, in the case the optimizator can't sort out what's really the best to adopt.

Resources