Eliminating code duplication when querying multiple tables with the same schema's

Eliminating code duplication when querying multiple tables with the same schema's - sql-server

I've inherited some code which uses multiple tables to store the same information depending on how old it is (one for the current day, the last month, etc.).
Currently most of the code is duplicated for every condition, and I'd like to try and eliminate the majority of the duplication in the stored procedures. Right now re-architecting the design is not an option as there are a number of applications that depend on the current design that I have no control over.
One option I've tried so far is loading the needed data into a temp table which I found to have a rather large performance hit. I've also tried using a cte structured like this:
;WITH cte_table(...)
AS
(
SELECT ...
FROM a
WHERE #queried_date = CONVERT(DATE, GETDATE())
UNION ALL
SELECT ...
FROM b
WHERE #queried_date BETWEEN --some range
)
This works and the performance isn't terrible, but it's not very nice looking.
Could anyone offer a better alternative?

Two suggestions:
Just use UNION, not UNION ALL. The UNION operator removes duplicates in that case. UNION ALL preserves dupes.
Using the CTE, the SELECT clause on the outside / end can have a DISTICT operator to bring back unique rows. Of course, not sure why you'd be using a CTE in this scenario since UNION should work just fine. (In fact, I believe SQL will optimize the query to the same plan structure either way...)
Any way you slice it, if you have duplicate data, either you have to do something like the above, or you have to make explicit clauses that remove dupe cases, using things like #temp tables or WHERE ... NOT IN ().

Related

Data in subquery not recognized as number?

I have created a View based on several tables generated by an application I use at work. The SQL generating the View contains a lot of CASE WHEN expressions as the raw tables lacks some of the logic needed for the reports I run.
One of the things I do is that I want 0 values in some columns when one Item does not match between tables.
case when e.Item=p.Item then p.ColumnA else 0 end as NewColumnA
ColumnA has data type FLOAT and it seems like NewColumA is FLOAT as well. But when I run a query on the View and specify that I do not want records where NewColumnA is zero it runs extremely slow. However, if I add a ROUND outside the CASE WHEN in the VIEW it runs much faster (80s vs 0.5s).
round(case when e.Item=p.Item then p.ColumnA else 0 end,6) as NewColumnA
Now, this solves my performance issue, but I assume that this is hardly a "best practice" way of solving it. I am also extremely interested in knowing what is the problem.

Using queries on a computed values in a view, will slow down your query, since it needs to do all calculations in the view before it can do the comparison.
Include e.Item and p.Item in your view under different alias, in this demonstration I use e_item and p_item.
Apply this to the WHERE clause, and don't include NewColumnA in your WHERE clause:
WHERE
exists(SELECT e_item EXCEPT SELECT p_item)
Using the above syntax to prevent NULL values giving the wrong result.
If your columns never contains null, you can simply use
WHERE
e_item <> p_item

How can I force a subquery to perform as well as a #temp table?

I am re-iterating the question asked by Mongus Pong Why would using a temp table be faster than a nested query? which doesn't have an answer that works for me.
Most of us at some point find that when a nested query reaches a certain complexity it needs to broken into temp tables to keep it performant. It is absurd that this could ever be the most practical way forward and means these processes can no longer be made into a view. And often 3rd party BI apps will only play nicely with views so this is crucial.
I am convinced there must be a simple queryplan setting to make the engine just spool each subquery in turn, working from the inside out. No second guessing how it can make the subquery more selective (which it sometimes does very successfully) and no possibility of correlated subqueries. Just the stack of data the programmer intended to be returned by the self-contained code between the brackets.
It is common for me to find that simply changing from a subquery to a #table takes the time from 120 seconds to 5. Essentially the optimiser is making a major mistake somewhere. Sure, there may be very time consuming ways I could coax the optimiser to look at tables in the right order but even this offers no guarantees. I'm not asking for the ideal 2 second execute time here, just the speed that temp tabling offers me within the flexibility of a view.
I've never posted on here before but I have been writing SQL for years and have read the comments of other experienced people who've also just come to accept this problem and now I would just like the appropriate genius to step forward and say the special hint is X...

There are a few possible explanations as to why you see this behavior. Some common ones are
The subquery or CTE may be being repeatedly re-evaluated.
Materialising partial results into a #temp table may force a more optimum join order for that part of the plan by removing some possible options from the equation.
Materialising partial results into a #temp table may improve the rest of the plan by correcting poor cardinality estimates.
The most reliable method is simply to use a #temp table and materialize it yourself.
Failing that regarding point 1 see Provide a hint to force intermediate materialization of CTEs or derived tables. The use of TOP(large_number) ... ORDER BY can often encourage the result to be spooled rather than repeatedly re evaluated.
Even if that works however there are no statistics on the spool.
For points 2 and 3 you would need to analyse why you weren't getting the desired plan. Possibly rewriting the query to use sargable predicates, or updating statistics might get a better plan. Failing that you could try using query hints to get the desired plan.

I do not believe there is a query hint that instructs the engine to spool each subquery in turn.
There is the OPTION (FORCE ORDER) query hint which forces the engine to perform the JOINs in the order specified, which could potentially coax it into achieving that result in some instances. This hint will sometimes result in a more efficient plan for a complex query and the engine keeps insisting on a sub-optimal plan. Of course, the optimizer should usually be trusted to determine the best plan.
Ideally there would be a query hint that would allow you to designate a CTE or subquery as "materialized" or "anonymous temp table", but there is not.

Another option (for future readers of this article) is to use a user-defined function. Multi-statement functions (as described in How to Share Data between Stored Procedures) appear to force the SQL Server to materialize the results of your subquery. In addition, they allow you to specify primary keys and indexes on the resulting table to help the query optimizer. This function can then be used in a select statement as part of your view. For example:
CREATE FUNCTION SalesByStore (#storeid varchar(30))
RETURNS #t TABLE (title varchar(80) NOT NULL PRIMARY KEY,
qty smallint NOT NULL) AS
BEGIN
INSERT #t (title, qty)
SELECT t.title, s.qty
FROM sales s
JOIN titles t ON t.title_id = s.title_id
WHERE s.stor_id = #storeid
RETURN
END
CREATE VIEW SalesData As
SELECT * FROM SalesByStore('6380')

Having run into this problem, I found out that (in my case) SQL Server was evaluating the conditions in incorrect order, because I had an index that could be used (IDX_CreatedOn on TableFoo).
SELECT bar.*
FROM
(SELECT * FROM TableFoo WHERE Deleted = 1) foo
JOIN TableBar bar ON (bar.FooId = foo.Id)
WHERE
foo.CreatedOn > DATEADD(DAY, -7, GETUTCDATE())
I managed to work around it by forcing the subquery to use another index (i.e. one that would be used when the subquery was executed without the parent query). In my case I switched to PK, which was meaningless for the query, but allowed the conditions from the subquery to be evaluated first.
SELECT bar.*
FROM
(SELECT * FROM TableFoo WITH (INDEX([PK_Id]) WHERE Deleted = 1) foo
JOIN TableBar bar ON (bar.FooId = foo.Id)
WHERE
foo.CreatedOn > DATEADD(DAY, -7, GETUTCDATE())
Filtering by the Deleted column was really simple and filtering the few results by CreatedOn afterwards was even easier. I was able to figure it out by comparing the Actual Execution Plan of the subquery and the parent query.
A more hacky solution (and not really recommended) is to force the subquery to get executed first by limiting the results using TOP, however this could lead to weird problems in the future if the results of the subquery exceed the limit (you could always set the limit to something ridiculous). Unfortunately TOP 100 PERCENT can't be used for this purpose since SQL Server just ignores it.

Use of With Clause in SQL Server

How does with clause work in SQL Server? Does it really give me some performance boost or does it just help to make more readable scripts?
When it is right to use it? What should you know about with clause before you start to use it?
Here's an example of what I'm talking about:
http://www.dotnetspider.com/resources/33984-Use-With-Clause-Sql-Server.aspx

I'm not entirely sure about performance advantages, but I think it can definitely help in the case where using a subquery results in the subquery being performed multiple times.
Apart from that it can definitely make code more readable, and can also be used in the case where multiple subqueries would be a cut and paste of the same code in different places.
What should you know before you use it?
A big downside is that when you have a CTE in a view, you cannot create a clustered index on that view. This can be a big pain because SQL Server does not have materialised views, and has certainly bitten me before.

Unless you use recursive abilities, a CTE is not better performance-wise than a simple inline view.
It just saves you some typing.
The optimizer is free to decide whether to reevaluate it or not, when it's being reused, and it most cases it decides to reevaluate:
WITH q (uuid) AS
(
SELECT NEWID()
)
SELECT *
FROM q
UNION ALL
SELECT *
FROM q
will return you two different NEWIDs.
Note that other engines may behave differently.
PostgreSQL, unlike SQL Server, materializes the CTEs.
Oracle supports a special hint, /*+ MATERIALIZE */, that tells the optimizer whether it should materialize the CTE or not.

with is a keyword in SQL which just stores the temporary result in a temporary table. Example:
with a(--here a is the temporary table)
(id)(--id acts as colomn for table a )
as(select colomn_name from table_name )
select * from a

What are examples of SQL Server Query features/clauses that should be avoided?

What are examples of SQL Server Query features/clauses that should be avoided?
Recently I got to know that the NOT IN clause degrades performance badly.
Do you have more examples?

The reason to avoid NOT IN isn't really performance, it's that it has really surprising behaviour when the set contains a null. For example:
select 1
where 1 not in (2,null)
This won't return any rows, because the where is interpreted like:
where 1 <> 2 and 1 <> null
First 1 <> null evaluates to unknown. Then 1 <> 2 and unknown evaluates to unknown. So you won't receive any rows.

I avoid correlated subqueries (noncorrelated subqueries and derived tables are OK) and any cursors that I can avoid. Also avoid while loops if you can. Think in terms of sets of data not row-by-row processing.
If you are using a UNION, check to see if UNION ALL will work instead. There is a potential results difference, so make sure before you make the change.
I always look at the word DISTINCT as a clue to see if there is a better way to present the data. DISTINCT is costly compared to using a derived table or some other method to avoid it's use.
Avoid the implied join syntax to avoid getting an accidental cross join (which people often fix, shudder, with distinct). (Generating a list of 4 million records and then distincting to get the three you want is costly.)
Avoid views that call other views! We have some folks who designed one whole client database that way and performance is HORRIBLE! Do not go down that path.
Avoid syntax like
WHERE MyField Like "%test%'
That and other non-saragable where clauses can keep the optimizer from using indexes.

Avoid
CURSOR - use set based ops
SELECT * - name you columns explicitly
EXEC(#dynamic_sql_with_input_parms) - use sp_executesql with input paramters.

I Recently changed a view from
Select Row1, Row2 FROM table Where blahID = FKblahID
UNION
Select Row1, Row2 FROM table2 Where blah2ID = FKblahID
to just
Select Row1, Row2 FROM table Where blahID = FKblahID
and saw a query that was taking ~8 mins run now only took ~20 secs not 100% sure why such a big change.
The second union was only returning about 200 records also, while the first was returning a couple thousand.

Which are more performant, CTE or temporary tables?

Which are more performant, CTE or Temporary Tables?

It depends.
First of all
What is a Common Table Expression?
A (non recursive) CTE is treated very similarly to other constructs that can also be used as inline table expressions in SQL Server. Derived tables, Views, and inline table valued functions. Note that whilst BOL says that a CTE "can be thought of as temporary result set" this is a purely logical description. More often than not it is not materlialized in its own right.
What is a temporary table?
This is a collection of rows stored on data pages in tempdb. The data pages may reside partially or entirely in memory. Additionally the temporary table may be indexed and have column statistics.
Test Data
CREATE TABLE T(A INT IDENTITY PRIMARY KEY, B INT , F CHAR(8000) NULL);
INSERT INTO T(B)
SELECT TOP (1000000) 0 + CAST(NEWID() AS BINARY(4))
FROM master..spt_values v1,
master..spt_values v2;
Example 1
WITH CTE1 AS
(
SELECT A,
ABS(B) AS Abs_B,
F
FROM T
)
SELECT *
FROM CTE1
WHERE A = 780
Notice in the plan above there is no mention of CTE1. It just accesses the base tables directly and is treated the same as
SELECT A,
ABS(B) AS Abs_B,
F
FROM T
WHERE A = 780
Rewriting by materializing the CTE into an intermediate temporary table here would be massively counter productive.
Materializing the CTE definition of
SELECT A,
ABS(B) AS Abs_B,
F
FROM T
Would involve copying about 8GB of data into a temporary table then there is still the overhead of selecting from it too.
Example 2
WITH CTE2
AS (SELECT *,
ROW_NUMBER() OVER (ORDER BY A) AS RN
FROM T
WHERE B % 100000 = 0)
SELECT *
FROM CTE2 T1
CROSS APPLY (SELECT TOP (1) *
FROM CTE2 T2
WHERE T2.A > T1.A
ORDER BY T2.A) CA
The above example takes about 4 minutes on my machine.
Only 15 rows of the 1,000,000 randomly generated values match the predicate but the expensive table scan happens 16 times to locate these.
This would be a good candidate for materializing the intermediate result. The equivalent temp table rewrite took 25 seconds.
INSERT INTO #T
SELECT *,
ROW_NUMBER() OVER (ORDER BY A) AS RN
FROM T
WHERE B % 100000 = 0
SELECT *
FROM #T T1
CROSS APPLY (SELECT TOP (1) *
FROM #T T2
WHERE T2.A > T1.A
ORDER BY T2.A) CA
Intermediate materialisation of part of a query into a temporary table can sometimes be useful even if it is only evaluated once - when it allows the rest of the query to be recompiled taking advantage of statistics on the materialized result. An example of this approach is in the SQL Cat article When To Break Down Complex Queries.
In some circumstances SQL Server will use a spool to cache an intermediate result, e.g. of a CTE, and avoid having to re-evaluate that sub tree. This is discussed in the (migrated) Connect item Provide a hint to force intermediate materialization of CTEs or derived tables. However no statistics are created on this and even if the number of spooled rows was to be hugely different from estimated is not possible for the in progress execution plan to dynamically adapt in response (at least in current versions. Adaptive Query Plans may become possible in the future).

I'd say they are different concepts but not too different to say "chalk and cheese".
A temp table is good for re-use or to perform multiple processing passes on a set of data.
A CTE can be used either to recurse or to simply improved readability.
And, like a view or inline table valued function can also be treated like a macro to be expanded in the main query
A temp table is another table with some rules around scope
I have stored procs where I use both (and table variables too)

CTE has its uses - when data in the CTE is small and there is strong readability improvement as with the case in recursive tables. However, its performance is certainly no better than table variables and when one is dealing with very large tables, temporary tables significantly outperform CTE. This is because you cannot define indices on a CTE and when you have large amount of data that requires joining with another table (CTE is simply like a macro). If you are joining multiple tables with millions of rows of records in each, CTE will perform significantly worse than temporary tables.

Temp tables are always on disk - so as long as your CTE can be held in memory, it would most likely be faster (like a table variable, too).
But then again, if the data load of your CTE (or temp table variable) gets too big, it'll be stored on disk, too, so there's no big benefit.
In general, I prefer a CTE over a temp table since it's gone after I used it. I don't need to think about dropping it explicitly or anything.
So, no clear answer in the end, but personally, I would prefer CTE over temp tables.

So the query I was assigned to optimize was written with two CTEs in SQL server. It was taking 28sec.
I spent two minutes converting them to temp tables and the query took 3 seconds
I added an index to the temp table on the field it was being joined on and got it down to 2 seconds
Three minutes of work and now its running 12x faster all by removing CTE. I personally will not use CTEs ever they are tougher to debug as well.
The crazy thing is the CTEs were both only used once and still putting an index on them proved to be 50% faster.

I've used both but in massive complex procedures have always found temp tables better to work with and more methodical. CTEs have their uses but generally with small data.
For example I've created sprocs that come back with results of large calculations in 15 seconds yet convert this code to run in a CTE and have seen it run in excess of 8 minutes to achieve the same results.

CTE won't take any physical space. It is just a result set we can use join.
Temp tables are temporary. We can create indexes, constrains as like normal tables for that we need to define all variables.
Temp table's scope only within the session.
EX:
Open two SQL query window
create table #temp(empid int,empname varchar)
insert into #temp
select 101,'xxx'
select * from #temp
Run this query in first window
then run the below query in second window you can find the difference.
select * from #temp

Late to the party, but...
The environment I work in is highly constrained, supporting some vendor products and providing "value-added" services like reporting. Due to policy and contract limitations, I am not usually allowed the luxury of separate table/data space and/or the ability to create permanent code [it gets a little better, depending upon the application].
IOW, I can't usually develop a stored procedure or UDFs or temp tables, etc. I pretty much have to do everything through MY application interface (Crystal Reports - add/link tables, set where clauses from w/in CR, etc.). One SMALL saving grace is that Crystal allows me to use COMMANDS (as well as SQL Expressions). Some things that aren't efficient through the regular add/link tables capability can be done by defining a SQL Command. I use CTEs through that and have gotten very good results "remotely". CTEs also help w/ report maintenance, not requiring that code be developed, handed to a DBA to compile, encrypt, transfer, install, and then require multiple-level testing. I can do CTEs through the local interface.
The down side of using CTEs w/ CR is, each report is separate. Each CTE must be maintained for each report. Where I can do SPs and UDFs, I can develop something that can be used by multiple reports, requiring only linking to the SP and passing parameters as if you were working on a regular table. CR is not really good at handling parameters into SQL Commands, so that aspect of the CR/CTE aspect can be lacking. In those cases, I usually try to define the CTE to return enough data (but not ALL data), and then use the record selection capabilities in CR to slice and dice that.
So... my vote is for CTEs (until I get my data space).

One use where I found CTE's excelled performance wise was where I needed to join a relatively complex Query on to a few tables which had a few million rows each.
I used the CTE to first select the subset based of the indexed columns to first cut these tables down to a few thousand relevant rows each and then joined the CTE to my main query. This exponentially reduced the runtime of my query.
Whilst results for the CTE are not cached and table variables might have been a better choice I really just wanted to try them out and found the fit the above scenario.

I just tested this- both CTE and non-CTE (where the query was typed out for every union instance) both took ~31 seconds. CTE made the code much more readable though- cut it down from 241 to 130 lines which is very nice. Temp table on the other hand cut it down to 132 lines, and took FIVE SECONDS to run. No joke. all of this testing was cached- the queries were all run multiple times before.

This is a really open ended question, and it all depends on how its being used and the type of temp table (Table variable or traditional table).
A traditional temp table stores the data in the temp DB, which does slow down the temp tables; however table variables do not.

From my experience in SQL Server,I found one of the scenarios where CTE outperformed Temp table
I needed to use a DataSet(~100000) from a complex Query just ONCE in my stored Procedure.
Temp table was causing an overhead on SQL where my Procedure was
performing slowly(as Temp Tables are real materialized tables that
exist in tempdb and Persist for the life of my current procedure)
On the other hand, with CTE, CTE Persist only until the following
query is run. So, CTE is a handy in-memory structure with limited
Scope. CTEs don't use tempdb by default.
This is one scenario where CTEs can really help simplify your code and Outperform Temp Table.
I had Used 2 CTEs, something like
WITH CTE1(ID, Name, Display)
AS (SELECT ID,Name,Display from Table1 where <Some Condition>),
CTE2(ID,Name,<col3>) AS (SELECT ID, Name,<> FROM CTE1 INNER JOIN Table2 <Some Condition>)
SELECT CTE2.ID,CTE2.<col3>
FROM CTE2
GO

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight