SQL Server: Are temp tables or unions better? - sql-server

Generally speaking, for combining a lot of data is it better to use a temp table/temp variable as a staging area or should I just stick to "UNION ALL"?
Assumptions:
No further processing is needed, the results are sent directly to the client.
The client waits for the complete recordset, so streaming results isn't necessary.

I would stick to UNION ALL. If there's no need to do intermediary processing, thus requiring a temp table, then I would not use one.
Inserting data into a temp table (even if it's a table variable which despite the myths, is not a purely "in memory" structure) will involve work in tempdb (which can be a bottleneck). To then just SELECT * as-is and return it without any special processing is unnecessary and I think bloats the code. When you just need to return data without any special processing, then a temp table approach seems a bit "round the houses". If I thought there was a reason to justify the use of a temp table, I would run some like-for-like performance tests to compare with vs without temp tables - then compare the stats (duration, reads, writes, CPU). Doing actual performance tests is the best way to be as confident as possible that whatever approach you choose, is the best. Especially as you don't have to be using temp tables for there to be work pushed over into tempdb - i.e. depending on your queries, it might involve work in tempdb anyway.
To clarify, I'm not saying one is better than the other full stop. As with most things, it depends on scenario. In the scenario described, it just sounds like you'd be adding in an extra step which doesn't seem to add any functional value and I can't see you'd gain anything other than creating a slightly more complicated/lengthy query.

One advantage with temp tables i can think of is that you can apply indexes to them. So that should help when dealing with lots of data where you need to get results back as quick as possible.

For what it is worth, I just did a performance comparison between two approaches to retrieve identical datasets:
SELECT c1, c2, c3 FROM ... ON ... WHERE
UNION ALL
SELECT c1, c2, c3 FROM ... ON ... WHERE /*(repeated 8 times)*/
vs
CREATE TABLE #Temp (c1 int, c2 varchar(20), c3 bit)
INSERT INTO #Temp (c1, c2, c3) SELECT (c1,c2,c3) FROM ... WHERE... /*(repeat 8 times)*/
SELECT c1, c2, c3 FROM #Temp
The second approach (the temporary table) was about 5% slower than the union, and when I artificially scaled up the number of repeats, the second approach became even slower.

Not specific to union all..
Use of temp table might have an advantage from a concurrency POV depending on query, isolation level and performance of clients/net link where use of a temp table could serve to minimize read lock times. Just don't use SELECT ..INTO.. to create the table.
In the general case UNION ALL avoids overhead of an unecessary work table.

I tend to use only UNION ALL where I have a limited number of UNIONS - and a relatively limited number of columns returned, table typed variables are another possibility (especially for 2014 on) - and allow you to enforce commonality of structure if similar result sets are built in more than one location.
UNION ALL avoids intermediate steps but:
1) it can lead to bloated, hard to maintain code
2) it can lead to unmanageable query plans - if they get too big then the plan viewing tools in sql server can't actually view them
3) if parts of a complex union are similar, or may be used elsewhere in your system consider using table valued functions or stored procs to facillitate code re-use whether you go for TTV, UNION ALL or Temp Tables

Related

Can joining with an iTVF be as fast as joining with a temp table?

Scenario
Quick background on this one: I am attempting to optimize the use of an inline table-valued function uf_GetVisibleCustomers(#cUserId). The iTVF wraps a view CustomerView and filters out all rows containing data for customers whom the provided requesting user is not permitted to see. This way, should selection criteria ever change in the future for certain user types, we won't have to implement that new condition a hundred times (hyperbole) all over the SQL codebase.
Performance is not great, however, so I want to fix that before encouraging use of the iTVF. Changed database object names here just so it's easier to demonstrate (hopefully).
Queries
In attempting to optimize our iTVF uf_GetVisibleCustomers, I've noticed that the following SQL …
CREATE TABLE #tC ( idCustomer INT )
INSERT #tC
SELECT idCustomer
FROM [dbo].[uf_GetVisibleCustomers]('requester')
SELECT T.fAmount
FROM [Transactions] T
JOIN #tC C ON C.idCustomer = T.idCustomer
… is orders of magnitude faster than my original (IMO more readable, likely to be used) SQL here…
SELECT T.fAmount
FROM [Transactions] T
JOIN [dbo].[uf_GetVisibleCustomers]('requester') C ON C.idCustomer = T.idCustomer
I don't get why this is. The former (top block of SQL) returns ~700k rows in 17 seconds on a fairly modest development server. The latter (second block of SQL) returns the same number of rows in about ten minutes when there is no other user activity on the server. Maybe worth noting that there is a WHERE clause, however I have omitted it here for simplicity; it is the same for both queries.
Execution Plan
Below is the execution plan for the first. It enjoys automatic parallelism as mentioned while the latter query isn't worth displaying here because it's just massive (expands the entire iTVF and underlying view, subqueries). Anyway, the latter also does not execute in parallel (AFAIK) to any extent.
My Questions
Is it possible to achieve performance comparable to the first block without a temp table?
That is, with the relative simplicity and human-readability of the slower SQL.
Why is a join to a temp table faster than a join to iTVF?
Why is it faster to use a temp table than an in-memory table populated the same way?
Beyond those explicit questions, if someone can point me in the right direction toward understanding this better in general then I would be very grateful.
Without seeing the DDL for your inline function - it's hard to say what the issue is. It would also help to see the actual execution plans for both queries (perhaps you could try: https://www.brentozar.com/pastetheplan/). That said, I can offer some food for thought.
As you mentioned, the iTVF accesses the underlying tables, views and associated indexes. If your statistics are not up-to-date you can get a bad plan, that won't happen with your temp table. On that note, too, how long does it take to populate that temp table?
Another thing to look at (again, this is why DDL is helpful) is: are the data type's the same for Transactions.idCustomer and #TC.idCustomer? I see a hash match in the plan you posted which seems bad for a join between two IDs (a nested loops or merge join would be better). This could be slowing both queries down but would appear to have a more dramatic impact on the query that leverages your iTVF.
Again this ^^^ is speculation based on my experience. A couple quick things to try (not as a perm fix but for troubleshooting):
1. Check to see if re-compiling your query when using the iTVF speeds things up (this would be a sign of a bad stats or a bad execution plan being cached and re-used)
2. Try forcing a parallel plan for the iTVF query. You can do this by adding OPTION (QUERYTRACEON 8649) to the end of your query of by using make_parallel() by Adam Machanic.

How can I reuse the results of a WITH clause for two queries in Redshift / Postgres?

(Asking about Redshift's subset of PostgreSQL)
This is an "I want a pony" question: I know how to do this using explicit temporary tables, but would prefer to avoid that.
I would like to run two completely different selects in one query (i.e. generate two tables in one ;-terminated query). To be explicit, I would like to (ab)use the WITH syntax to make use of the temporary tables generated internally by Redshift. To whit:
with TempTable as ( some query ),
AnotherTempTable as ( some other query ),
YetAnotherTable as ( some query based on TempTable and AnotherTempTable )
select something-joined-with-The-tables-above, (comma?)
...
select something-else-joined-with-The-tables-above;
Being able to "output" the results of the WITH clauses would not only be fine, but ideal. I.e. a yet undiscovered modifier to select:
with Stuff as (select blah from ... force_output) ...
And once again, I want two different tables put out, not just one humongously-wide OUTER JOIN with lots of empty space.
The point behind this is to rely on as much Redshift internals as possible. The theory is that the less you explicitly state, the more Redshift can optimize internally. This SHOULD provide for a faster and less resource-intensive query, as well as make use of any "new" optimizations that might be silently dropped in by Amazon.
This is ESPECIALLY important on very large datasets, which Redshift is meant for.
Please note that this is NOT a duplicate of this question. This is about reusing the results of with clauses (i.e. embedded selects)

Is using "select *" for a cursor in PL/SQL considered bad programming?

Often I use cursors in this way:
for rec in (select * from MY_TABLE where MY_COND = ITION) loop
if rec.FIELD1 = 'something' then
do_something();
end if;
if rec.FIELD2 <> 'somethingelse' then
blabla();
end if;
end loop;
My team leader told me not to use select * because it is bad programming, but I don't understand why (in this context).
Using select * in your code is what I would call lazy programming, with several nasty side effects. How much you experience those side effects, will differ, but it's never positive.
I'll use some of the points already mentioned in other answers, but feel free to edit my answer and add some more negative points about using select *.
You are shipping more data from the SQL engine to your code than necessary, which has a negative effect on performance.
The information you get back needs to be placed in variables (a record variable for example). This will take more PGA memory than necessary.
By using select * you will never use an index alone to retrieve the wanted information, you'll always have to visit the table as well (provided no index exists which holds all columns of the table). Again, with a negative effect on performance.
Less clear for people maintaining your code what your intention is. They need to delve into the code to spot all occurrences of your record variable to know what is being retrieved.
You will not use SQL functions to perform calculations, but always rely on PL/SQL or Java calculations. You are possibly missing out on some great SQL improvements like analytic functions, model clause, recursive subquery factoring and the like.
From Oracle11 onwards, dependencies are being tracked on column level, meaning that when you use select *, your code is being marked in the data dictionary as "dependent on all columns" of that table. Your procedure will be invalidated when something happens to one of those columns. So using select * means your code will be invalidated more often than necessary.
Again, feel free to add your own points.
Selecting more fields than you need has several drawbacks:
Less clear - select c1, c2 shows at a glance which columns are pulled, without the need to pore over the code.
...also less clear for the people responsible for administration/tuning of the DB - they might only see the queries in logs, better not force them to analyze the code that generated the queries.
prevents some query optimizations - select c2 from t where c2<=5 when you have index on c2 has a chance to pull the c2 value from the index itself, without fetching the records. The select * ... makes this impossible.
select * will pull in every single field from your table. If you need them all, then it's acceptable. However, more often than not, you won't need all of them, so why bother bringing in all that extra data?
Instead select only the fields you care about.
SELECT * is problematic if the query depends on the order or number of the columns, e.g.:
INSERT INTO X (A, B)
SELECT * FROM T
WHERE B.NR = 113
But it your case, it's not really problematic. It could be optimized if it's really pulling in much more data than required. But in most cases, it makes no difference.
The select * construct is likely to have a performance hit bringing in more information than necessary. In addition the code is likely to generate maintenance problems. As the database changes, bringing in all the fields can have unlooked for effects.
EDIT
Unlooked for effects are mainly those listed by Codo and Rob van Wijk -
if the query depends on the order of columns;
lack of clarity for later changes to the code
non-use of indices.
I was not aware of the column level dependency tracking mentioned by Rob, and had in mind that if a change was made to a column, it could invalidate the code (extra columns being retrieved causing overflows; or a query depending on the presence of a particular column.
These unlooked for effects are together the cause of the maintenance problems mentioned.

Is it better to use one complex query or several simpler ones?

Which option is better:
Writing a very complex query having large number of joins, or
Writing 2 queries one after the other, applying the obtained result set of the processed query on other.
Generally, one query is better than two, because the optimizer has more information to work with and may be able to produce a more efficient query plan than either separately. Additionally, using two (or more) queries typically means you'll be running the second query multiple times, and the DBMS might have to generate the query plan for the query repeatedly (but not if you prepare the statement and pass the parameters as placeholders when the query is (re)executed). This means fewer back and forth exchanges between the program and the DBMS. If your DBMS is on a server on the other side of the world (or country), this can be a big factor.
Arguing against combining the two queries, you might end up shipping a lot of repetitive data between the DBMS and the application. If each of 10,000 rows in table T1 is joined with an average of 30 rows from table T2 (so there are 300,000 rows returned in total), then you might be shipping a lot of data repeatedly back to the client. If the row size of (the relevant projection of) T1 is relatively small and the data from T2 is relatively large, then this doesn't matter. If the data from T1 is large and the data from T2 is small, then this may matter; measure before deciding.
When I was a junior DB person I once worked for a year in a marketing dept where I had so much free time I did each task 2 or 3 different ways. I made a habit of writing one mega-select that grabbed everything in one go and comparing it to a script that built interim tables of selected primary keys and then once I had the correct keys went and got the data values.
In almost every case the second method was faster. the cases where it wasn't were when dealing with a small number of small tables. Where it was most noticeably faster was of course large tables and multiple joins.
I got into the habit of select the required primary keys from tableA, select the required primary keys from tableB, etc. Join them and select the final set of primary keys. Use the selected primary keys to go back to the tables and get the data values.
As a DBA I now understand that this method resulted in less purging of the data cache and played nicer with others using the DB (as mentioned by Amir Raminfar).
It does however require the use of temporary tables which some places / DBA don't like (unfairly in my mind)
Depends a lot on the actual query and the actual database i.e. SQL, Oracle mySQL.
At large companies, they prefer option 2 because option 1 will hog the database cpu. This results in all other connections being slow and everything being a bottle neck. That being said, it all depends on your data and the ammount you are joining. If you are joining on 10000 to 1000 then you are going to get back 10000 x 1000 records. (Assuming an inner join)
Possible duplicate MySQL JOIN Abuse? How bad can it get?
Assuming "better" means "faster", you can easily test these scenarios in a junit test. Note that a determining factor that you may not be able to get from a unit test is network latency. If the database sits right next to your machine where you run the unit test, you may see no difference in performance that is attributed to the network. If your production servers are in another town, country, or continent from the database, network traffic becomes more of a bottleneck. You do not want to go back and forth across the wire- you more likely want to make one round trip and get everything at once.
Again, it all depends :)
It could depend on many things: ,
the indexes you have set up
how many tables,
what the actual query is,
how big the data set is,
what the underlying DB is,
what table engine you are using
The best thing to do would probably test both methods on a variety of test data and see which one bottle necks.
If you are using MySQL, ( and Oracle maybe? ) you can use
EXPLAIN SELECT .....
and it will give you a lot of info on how it will execute the query, and therefor how you can improve it etc.

Create a large dataset for speed testing

I need a Microsoft SQL Server 2005 or above stored procedure that will create a large number of rows (example: one million) so that I can then try various things like seeing how much slower SELECT * is compared to selecting each individual field name, or selecting from a view that selects from another view rather than selecting directly from the tables.
Does that make sense?
If it is just the number of rows you want and you don't mind having the same content in each row, then you can do this in SQL Server Management Studio easily. Write your insert statement to insert a single row, then use:
GO 1000000
This will execute the batch the number of times specified after the GO statement.
If you need different data per row (or cannot have duplicate data because of indices, etc..), then there are tools such as SQL Data Generator will help. They enable you to define the type of data that gets generated so that the tool generates realistic data.
I can tell you right now how much slower it is to perform SELECT * instead of SELECT specific_column_names. If the columns you are selecting are not covered by any index, it will make hardly any difference at all; if the columns you would normally be selecting are covered by an index, and the table contains any significant amount of data, it will be an order of magnitude slower, maybe worse.
Here's a quick and dirty example. First create the test schema and data:
CREATE TABLE #TestTable
(
ID int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
Name varchar(50) NOT NULL,
Age int NOT NULL
)
INSERT #TestTable (Name, Age)
SELECT 'John', s1.number % 10 + 25
FROM master.dbo.spt_values s1
CROSS JOIN master.dbo.spt_values s2
WHERE s1.type = 'P' AND s2.type = 'P'
AND s2.number < 20
CREATE INDEX IX_#TestTable_Age ON #TestTable (Age)
Now run this query in SSMS and turn on the actual execution plan:
SELECT ID
FROM #TestTable
WHERE Age = 30
SELECT *
FROM #TestTable
WHERE Age = 30
The first SELECT is executed as an index seek, which on my machine is 7% of the total cost. On the second query, the optimizer decides that the IX_#TestTable_Age index isn't worth it and does a clustered index scan instead, using up 93% of the total cost, or 13 times as expensive as the non-SELECT * version.
If we force a nested loop key lookup, to mimic the absence of a clustered index or a very large clustered index, it gets even worse:
SELECT *
FROM #TestTable
WITH (INDEX(IX_#TestTable_Age))
WHERE Age = 30
This takes more than 100 times as long as the covering query. Compared to the very first query, the cost is simply astronomical.
Why I bothered to write all that information:
Before you start going out and "testing" things, you need to shake off the common misconception that the exact order in which you write your query statements, or irrelevant factors like views selecting from other views, actually makes any appreciable difference if your database is even remotely optimized.
Indexing is the first thing that matters in the area of database performance. How you use them is the second thing that matters. The way in which you write your query may matter - such as performing a SELECT * when your WHERE condition is on anything other than the clustered index, or using non-sargable functions like DATEPART in your WHERE condition, but for the most part, chucking a bunch of random data into a table without seriously thinking about how the table will actually be used is going to give you mostly-meaningless results in terms of performance.
Data generators are useful when you are planning a large project and need to perform scalability tests. If you are simply experimenting, trying to understand performance differences between different types of queries in an abstract sense, then I would have to say that you'll be better off just grabbing a copy of the Northwind or AdventureWorks database and banging around on that one - it's already normalized and indexed and you'll be able to glean meaningful information about query performance in an actual production database.
But even more importantly than that, before you even start to think about performance in a SQL database, you need to actually start reading about performance and understand what factors affect performance. As I mentioned, the number one factor is indexing. Other factors including sort orders, selectivity, join types, cursor types, plan caching, and so on. Don't just go and start fooling around, thinking you'll learn how best to optimize a database.
Educate yourself before fumbling around. I would start with the slightly-dated but still comprehensive Improving SQL Server Performance article from Microsoft Patterns and Practices. Also read about Indexing Basics and Covering Indexes. Then go to sites like SQL Server Performance and try to absorb whatever you can from the articles.
Then, and only then, should you start playing around with large-scale test data. If you're still not completely sure why a SELECT * can hurt performance then it is way too early to be running tests.
Take a look at http://databene.org/databene-benerator. It's free, quick, provides realistic data and you have the option of using your own plugins.

Resources