CASE statement versus temporary table

CASE statement versus temporary table - sql-server

I have my choice between two different techniques in converting codes to text:
insert into #TMP_CONVERT(code, value)
(1, 'Uno')
,(2, 'Dos')
,(3, 'Tres')
;
coalesce(tc.value, 'Unknown') as THE_VALUE
...
LEFT OUTER JOIN #TMP_CONVERT tc
on tc.code = x.code
Or
case x.code
when 1 then 'Uno'
when 2 then 'Dos'
when 3 then 'Tres'
else 'Unknown'
end as THE_VALUE
The table has about 20 million rows.
Typical size the of the code lookup table is 10 rows.
I would rather to #1 but I don't like left outer joins.
My questions are:
Is one faster than the other in any really meaningful way?
Does any SQL engine optimize this out anyway? That is: It just reads the table into memory essentially does the case statement logic anyway?
I happen to be using tsql, but I would like to know for any number of RDBM systems because I use several.
[Edit to clarify not liking LEFT OUTER JOIN]
I use LEFT OUTER JOINS when I need them, but whenever I use them I double check my logic and data to confirm I actually need them. Then I add a comment to the code that indicates why I am using a LEFT OUTER JOIN. Of course I have to do a similar exercise when I use INNER JOIN; that is: make sure I am am not dropping data.

There is some overhead to using a join. However, if the code is made a primary clustered key, then the performance for the two might be comparable -- with the join even winning out. Without an index, I would expect the case to be slightly better than the left join.
These are just guesses. As with all performance questions, though, you should check on your data on your systems.
I also want to react to your not liking left joins. These provide important functionality for SQL and are the right way to address this problem.

The code executed for the join will likely be substantially more than the code executed for the hardcoded case options.
The execution plan will have an extra join iterator along with an extra scan or seek operator (depending on the availability of a suitable index). On the positive side likely the 10 rows will all fit on a single page in #TMP_CONVERT and this will be in memory anyway, also being a temp table it won't bother with taking and releasing row locks each time, but still the code to latch the page, locate the correct row, and crack the desired column value out of it over 20,000,000 iterations would likely add some amount of measurable CPU time compared with looking up in a hardcoded list of values (potentially you could try nested CASE statements too, to perform a binary search and avoid the need for 10 branches there).
But even if there is a measurable time difference it still may not be particularly significant as a proportion of the query time as a whole. Test it. Let us know what you find...

You can also avoid creating temporary table in this case by using with construction. So you query might be something like this.
WITH TMP_CONVERT(code,value) AS -- Semicolon can be required before WITH.
(
SELECT * FROM (VALUES (1,'UNO'),
(2,'DOS'),
(3,'Tres')
) tbl(code,value)
)
coalesce(tc.value, 'Unknown') as THE_VALUE
...
LEFT OUTER JOIN TMP_CONVERT tc
on tc.code = x.code
OR even sub query can be used :
coalesce(tc.value, 'Unknown') as THE_VALUE
...
LEFT OUTER JOIN (VALUES (1,'UNO'),
(2,'DOS'),
(3,'Tres')
) tc(code,value)
ON tc.code = x.code
Hope this can be helpful.

Related

Optimizing array overlap query using && operator

I have a relatively simple query that attempts to calculate the count of rows I'll have to deal with in a later operation. It looks like:
SELECT COUNT(*)
FROM my_table AS t1
WHERE t1.array_of_ids && ARRAY[cast('1' as bigint)];
The tricky piece is that ARRAY[] portion is determined by the code that invokes the query so instead of having 1 element in this example it could have hundreds or thousands. This makes the query take a decent amount of time to run if a user is actively waiting for the calculation to complete.
Is there anything obvious I'm doing wrong or any obvious improvement that could be made?
Thanks!
Edit:
There are not any indexes on the table. I tried to create one with
CREATE INDEX my_index on my_table(array_of_ids);
and it came back with
ERROR: index row requires 8416 bytes, maximum size is 8191
I'm not very experienced here unfortunately. Maybe there is simply too many rows for an index to be useful?
I ran an explain on the query and the output essentially looks like:
QUERY PLAN | Filter: ((array_of_ids && '{1, 2, 3, 4, 5, 6, 7 ... n}'::bigint[])
so I guess it is automatically doing the ::bigint[]. I tried this as well and the query takes the same time to execute which I guess makes sense.
I realize I'm only pasting a portion of the response to the explain (analyze, buffers, format text) but I'm doing this in psql and my system often runs out of memory. There are tons of --- in the output, I am not sure if there is a way to not have psql do that.
The plan looks pretty simple so is this basically saying there is no way to optimize this? I have two huge arrays and it just takes time to determine an overlap? Not sure if there is a JOIN solution here, I tried to unnest and do a JOIN on equivalent entries but the query never returned so I'm not sure if I got it wrong or if it is just a far slower approach.

SQL Server "shortcut" processing

Here is a simplified version of the problem I'm trying to solve.
I have a temporary table #MyData with 2 columns: description and value.
I also have tables MyRuleSequenceCollection, MyRuleSequence, and MyRule, which I want to use to assess the data in the temporary table. MyRuleSequence is an ordered list of MyRule records, and MyRuleSequenceCollection is an unordered collection of MyRuleSequence records.
One of the sequences looks for records in the temporary table with descriptions "A" and "B", then attempts to divide A by B. The first rule tests for the presence of A, if it's not there, the process should stop. The second rule tests for the presence of B, if it's not there, the process should stop. The third rule tests that B is not 0, if it is 0, the process should stop. Finally, the 4th rule divides A by B and tests whether the result is more than 1.
Temporary table contains:
A 20
B 5
Result: all 4 rules assessed, final result true
Temporary table contains:
A 20
B 0
Result: only first 3 rules run, no divide by 0 error, final result false
Temporary table contains:
B 20
C 5
Result: only first rule runs, final result false
The only way I can see to design this is either with cursors, or worse, with dynamic SQL.
So I'm looking for design suggestions. Considering the above as just an example (many cases are far more complex), can this process be designed to avoid cursors or dynamic SQL? Could recursion be a solution?
Update: several days with no suggestions or input. Does anyone have an opinion about using CTE for this? Or is that just a cursor with the deallocate handled for you?

Sometimes, there really is no valid choice other than using a database cursor.
So that's what I've done. It works, and performance and resource usage are reasonable. It's a firehose forward only local static no locking everything else I could think of to speed it up cursor, and if I have performance problems later I can subset a couple of the joined/aliased tables (3 tables are left joined/aliased 10 times) into a table variable and use that in the cursor to speed it up further.
I'm left wondering why the automatic aversion to cursors is so strong, even in a case where there isn't a pragmatic alternative.

The Filter geometry function in T-SQL is not precise and does not return valid results

I have two tables, edges (a set of polyline features) and nodes (set of points)
I want to find the "id" of those points that have an intersection with the first point of any edges. I used the STIntersects() function in SQL and wrote down the below script :
select EdgeTable.id, NodeTable.idj,
from EdgeTable inner join NodeTable
ON NodeTable.GeomJ.STIntersects(EdgeTable.StartPointG) =1
This code works correctly, but the problem is with their performance. As a matter of fact, EdgeTable and NodeTable have about 2 million records and the execution time of the above script is about 55 minutes which is not at all suitable for my work.
Thus, I found Filter geometry function which seems a good solution for improving the performance of this task. In this regard, firstly I created the spatial index on my tables and then use this function as below:
select EdgeTable.id, NodeTable.idj,
from EdgeTable inner join NodeTable
ON NodeTable.GeomJ.Filter(EdgeTable.StartPointG) =1
But this function returns completely wrong results. For example, I tested the script for the edge 1 (idj = 1) which the start point of this edge intersects with node 1 , this is the result of STIntersects() function and this for Filter. Due to description of Filter geometry function, it remarked that this method is not deterministic and is not precise. Hence, why the results of these two functions are different and how can I come up with it?

Why does a Recursive CTE in Transact-SQL require a UNION ALL and not a UNION?

I get that an anchor is necessary, that makes sense. And I know that a UNION ALL is needed, if your recursive CTE doesn't have one, it just doesn't work... but I can't find a good explanation of why that is the case. All the documentation just states that you need it.
Why can't we use a UNION instead of a UNION ALL in a recursive query? It seems like it would be a good idea to not include duplicates upon deeper recursion, doesn't it? Something like that should already be working under the hood already, I would think.

I presume the reason is that they just haven't considered this a priority feature worth implementing. It looks like Postgres does support both UNION and UNION ALL.
If you have a strong case for this feature you can provide feedback at Connect (or whatever the URL of its replacement will be).
Preventing duplicates being added could be useful as a duplicate row added in a later step to a previous one will nearly always end up causing an infinite loop or exceeding the max recursion limit.
There are quite a few places in the SQL Standards where code is used demonstrating UNION such as below
This article explains how they are implemented in SQL Server. They aren't doing anything like that "under the hood". The stack spool deletes rows as it goes so it wouldn't be possible to know if a later row is a duplicate of a deleted one. Supporting UNION would need a somewhat different approach.
In the meantime you can quite easily achieve the same in a multi statement TVF.
To take a silly example below (Postgres Fiddle)
WITH R
AS (SELECT 0 AS N
UNION
SELECT ( N + 1 )%10
FROM R)
SELECT N
FROM R
Changing the UNION to UNION ALL and adding a DISTINCT at the end won't save you from the infinite recursion.
But you can implement this as
CREATE FUNCTION dbo.F ()
RETURNS #R TABLE(n INT PRIMARY KEY WITH (IGNORE_DUP_KEY = ON))
AS
BEGIN
INSERT INTO #R
VALUES (0); --anchor
WHILE ##ROWCOUNT > 0
BEGIN
INSERT INTO #R
SELECT ( N + 1 )%10
FROM #R
END
RETURN
END
GO
SELECT *
FROM dbo.F ()
The above uses IGNORE_DUP_KEY to discard duplicates. If the column list is too wide to be indexed you would need DISTINCT and NOT EXISTS instead. You'd also probably want a parameter to set the max number of recursions and avoid infinite loops.

This is pure speculation, but I would say, that the UNION ALL ensures, that the result of each iteration can be calculated individually. Essentially it ensures, that an iteration cannot interfere with another.
A UNION would require a sort operation in the background which might modify the result of previous iterations. The program should not change the state of a previous call in the call stack, it should interact with it using input parameters and the result of the subsequent iteration (in a procedural setting). This probably should apply to set based operations, thus to SQL Server's recursive CTEs.
I might be wrong, late night brain-dumps are not 100% reliable :)
Edit (just another thought):
When a recursion starts, you have a call stack. Each level in this stack starts calculating it's result, but should wait for the result of all subsequent calls before it can finish and return it's result. UNION would try to eliminate duplication, but you don't have any records until you reach the termination condition (and the final would be built from the bottom to the top), but the result of the subsequent call is required by the ones above it. The UNION would be reduced to a DISTINCT at the very end.

A good explanation of pred post speculation here : https://sqlite.org/lang_with.html :
Optimization note: ...... Very little memory is needed to run the above example. However, if the example had used UNION instead of UNION ALL, then SQLite would have had to keep around all previously generated content in order to check for duplicates. For this reason, programmers should strive to use UNION ALL instead of UNION when feasible.

Assign proper fill factor option for each index

I use SQL Server and want to assign proper fill factor value for each indexes. I know below parameter for each index:
Row count of each table
Amount of Scan occurred for each index
Amount of Seek occurred for each index
Amount of lookup occurred for each index.
Amount of update occurred for each index.
I know that scan, seek and lookup raise fill factor value to 100 and update down fill factor to 0. but I look for a formula for calculate proper fill factor option according to above parameter of each table.
EDIT
I use below script to get above parameters :
select SCHEMA_NAME(B.schema_id)+'.'+B.name+' \ '+C.name AS IndexName,
A.user_scans,
A.user_seeks,
A.user_lookups,
A.user_updates,
D.rowcnt,
C.fill_factor
from sys.dm_db_index_usage_stats A
INNER JOIN sys.objects B ON A.object_id = B.object_id
INNER JOIN sys.indexes C ON C.object_id = B.object_id AND C.index_id = A.index_id
INNER JOIN sys.sysindexes D ON D.id = B.object_id AND D.indid = A.index_id
Edit 2
I use below reference for best value for fill factor option :
Best value for fill factor 1
Best value for fill factor 2

I would use the technique described by Kendra Little from Brent Ozar Unlimited.
Here is the article. She describes her methodology for finding and addressing fill factor issues.
Also as Remus mentioned in his comments, you should use discretion when messing with the fill factor. I realize a lot of articles on the internet make it sound as though a high fill factor will cause a innumerable page splits and ruin your performance, but lowering the fill factor can cause more problems than it solves.
Kendra suggests using the default fill factor and tracking fragmentation over time, and only when an index appears to have a fragmentation issue due to page splits, should you slowly decrease the fill factor. I've been using this technique and I've noticed a much better use of my cache because of how much less my indexes are needlessly inflated.
"I frequently find that people have put a fillfactor setting of 80 or below on all the indexes in a database. This can waste many GB of space on disk and in memory. This wasted space causes extra trips to storage, and the whole thing drags down the performance of your queries."
Check out this quote in Books Online: “For example, a fill factor value of 50 can cause database read performance to decrease by two times. “
So in a nice way of saying it. I'm not sure that you should start needlessly messing with the fill factor. Observer, study, then act.