I have to run a SELECT statement across several tables. I am sure the tables return different records. I am anyway using UNION ALL.
Is it better to use UNION or of UNION ALL in performance terms when I am sure the tables return different records?
UNION ALL will perform better than UNION when you're not concerned about eliminating duplicate records because you're avoiding an expensive distinct sort operation. See: SQL SERVER ā Difference Between Union vs. Union All ā Optimal Performance Comparison
UNION ALL always is faster, because UNION exclude duplicated entries
UNION implement internally two queries.
1.SELECT which will return a dataset
2.DISTINCT.
Anyone who has studied database internals can easily understand that a DISTINCT clause is extremely costly in terms of processing.
If you are pretty sure that the resultant dataset need not have unique rows then we can skip UNION and use UNION ALL instead.
UNION ALL will be same as UNION except that it doesn't fire a DISTINCT internally sparing us costly operations
It is better to use UNION ALL when you know you want all the result rows, whether or not you know they'll be distinct or not. UNION without "all" will always perform the "distinct check", regardless of what the data actually is.
Why is UNION ALL faster? Because UNION must do a sort to remove the duplicates. If you do not need to remove duplicates then UNION ALL is the better option, however UNION does have a purpose and should be used when appropriate.
I'm going to go out on a limb and suggest that it depends on your data.
If performance is measured end-to-end (from the moment the client sends the first byte of the request to the moment it gets the last byte of the response) then you have the following two extremes:
The vast minority (say 1%) of the result set contains duplicates
The vast majority (say 99%) of the result set contains duplicates
In case 1, UNION ALL will be faster simply because it does not need to sort the data (to remove duplicates) before returning it.
In case 2, UNION will be faster because it's much quicker to remove duplicates in memory than sending them over the wire. If your result set contains 1 million rows with only 2 unique values then your network time will be much smaller once those duplicates have been removed.
Related
We have a bunch of queries where we UNION data from 3 tables at query time ( we get data from 3 sources ) .
I was wondering if query performance would be any better if we were to merge the data into one table ( with a column source so we know where it came from ).
New table would be much bigger so Iām not sure if we should expect any better performance. Is there a general guidance around this?
There should be no significant difference scanning 3 tables VS scanning 1 table with the merged content.
However, please make sure you're using UNION ALL and not UNION. According to the SQL standard, UNION in SQL eliminates duplicate records, and the process of doing that can be very expensive.
Using UNION where UNION ALL should be used is one of the most common mistakes I've seen in SQL, unfortunately. I blame the standard, not the users though :)
See e.g. here for more discussion.
I have a query that needs to incorporate conditional logic. There are 4 cases that need to be considered, and the resulting sets are disjoint.
I can implement the query using either a single SELECT and CASE/WHEN statements or using multiple SELECT statements and UNION ALL.
In general, is one of these implementations likely to be faster than the other? If so, why?
A Union does that many of selects together so a case when will be better generally IMHO if the from statement is not that complex and all other things are being equal. But they are NOT similar sql results:
A 'Case when ...' will add another horizontal row and by default a union of a select must have that amount of columns in the set being union'd so it will add more rows. For instance if you queried three separate tables and then union'd them together you are doing three selects, however if you just did three case whens it would be efficient if you were querying one table. But you could be querying five. Without knowing the source the answer really is: 'it depends'.
I just set the ole 'set statistics time on' when doing quick timing of the SQL engine to see. People can argue semantics but the engine does not lie when it tells you what is going on. SQL 2005 and higher I believe also has the 'include actual execution plan' in the menu bar. It is a nice looking little three squares icon in the shape of an L with the L point being in the upper left. If you have something very complex and are getting really into fine tuning that is the tool of choice to examine what the engine is doing under the hood with your query.
This really depends entirely on what the logic and data you expect to be selecting from look like. If you're running this SELECT against huge datasets and the logic is fairly simple like WHEN Val Between A and B THEN C you'll probably get a little bit of an uplift putting the logic in your where clause and doing a UNION ALL but not a ton of difference. On a comparatively small data set, it might not make any difference at all. It also might depend on whether or not you see this code being set in stone, or subject to periodically change. UNION ALL will certainly be quite a few more lines of code, because you're basically writing the same query over and over with different WHERE clauses, but it also may be easier to read and maintain.
Well I have a sorted table by id and I want to retrieve n rows offset m rows, but when I use it without orderby it is producing unpredictable results and with order by id its taking too much of time, since my table is already ordered, I just want to retrieve n rows leaving first m rows.
The documentation says:
The query planner takes LIMIT into account when generating a query plan, so you are very likely to get different plans (yielding different row orders) depending on what you use for LIMIT and OFFSET. Thus, using different LIMIT/OFFSET values to select different subsets of a query result will give inconsistent results unless you enforce a predictable result ordering with ORDER BY. This is not a bug; it is an inherent consequence of the fact that SQL does not promise to deliver the results of a query in any particular order unless ORDER BY is used to constrain the order.
So what you are trying cannot really be done.
I suppose you could play with the planner options or rearrange the query in a clever way to trick the planner into using a plan that suits you, but without actually showing the query, it's hard to say.
SELECT * FROM mytable LIMIT 100 OFFSET 0
You really shouldn't rely on implicit ordering though, because you may not be able to predict the exact order of how data goes into the database.
As pointed out above, SQL does not guarantee anything about order unless you have an ORDER BY clause. LIMIT can still be useful in such a situation, but I can't think of any use for OFFSET. It sounds like you don't have an index on id, because if you do, the query should be extremely fast, clustered or not. Take another look at that. (Also check CLUSTER, which may improve your performance at the margin.)
REPEAT: this is not something about Postgresql. Its behavior here is conforming.
Generally speaking, for combining a lot of data is it better to use a temp table/temp variable as a staging area or should I just stick to "UNION ALL"?
Assumptions:
No further processing is needed, the results are sent directly to the client.
The client waits for the complete recordset, so streaming results isn't necessary.
I would stick to UNION ALL. If there's no need to do intermediary processing, thus requiring a temp table, then I would not use one.
Inserting data into a temp table (even if it's a table variable which despite the myths, is not a purely "in memory" structure) will involve work in tempdb (which can be a bottleneck). To then just SELECT * as-is and return it without any special processing is unnecessary and I think bloats the code. When you just need to return data without any special processing, then a temp table approach seems a bit "round the houses". If I thought there was a reason to justify the use of a temp table, I would run some like-for-like performance tests to compare with vs without temp tables - then compare the stats (duration, reads, writes, CPU). Doing actual performance tests is the best way to be as confident as possible that whatever approach you choose, is the best. Especially as you don't have to be using temp tables for there to be work pushed over into tempdb - i.e. depending on your queries, it might involve work in tempdb anyway.
To clarify, I'm not saying one is better than the other full stop. As with most things, it depends on scenario. In the scenario described, it just sounds like you'd be adding in an extra step which doesn't seem to add any functional value and I can't see you'd gain anything other than creating a slightly more complicated/lengthy query.
One advantage with temp tables i can think of is that you can apply indexes to them. So that should help when dealing with lots of data where you need to get results back as quick as possible.
For what it is worth, I just did a performance comparison between two approaches to retrieve identical datasets:
SELECT c1, c2, c3 FROM ... ON ... WHERE
UNION ALL
SELECT c1, c2, c3 FROM ... ON ... WHERE /*(repeated 8 times)*/
vs
CREATE TABLE #Temp (c1 int, c2 varchar(20), c3 bit)
INSERT INTO #Temp (c1, c2, c3) SELECT (c1,c2,c3) FROM ... WHERE... /*(repeat 8 times)*/
SELECT c1, c2, c3 FROM #Temp
The second approach (the temporary table) was about 5% slower than the union, and when I artificially scaled up the number of repeats, the second approach became even slower.
Not specific to union all..
Use of temp table might have an advantage from a concurrency POV depending on query, isolation level and performance of clients/net link where use of a temp table could serve to minimize read lock times. Just don't use SELECT ..INTO.. to create the table.
In the general case UNION ALL avoids overhead of an unecessary work table.
I tend to use only UNION ALL where I have a limited number of UNIONS - and a relatively limited number of columns returned, table typed variables are another possibility (especially for 2014 on) - and allow you to enforce commonality of structure if similar result sets are built in more than one location.
UNION ALL avoids intermediate steps but:
1) it can lead to bloated, hard to maintain code
2) it can lead to unmanageable query plans - if they get too big then the plan viewing tools in sql server can't actually view them
3) if parts of a complex union are similar, or may be used elsewhere in your system consider using table valued functions or stored procs to facillitate code re-use whether you go for TTV, UNION ALL or Temp Tables
Query 1 = select top 5 i.item_id from ITEMS i
Query 2 = select top 5 i.item_id, i.category_id from ITEMS i
Even if I remove the top 5 clause they still return different rows.
if I run "select top 5 i.* from ITEMS i" this returns a completely different result set !!
Because the results of a "TOP N" qualified SELECT are indeterminate if you do not have an ORDER BY clause.
Without an ORDER BY clause, you cannot predict what order you will get results. There is probably an interesting underlying reason for why SQL Server processes those queries differently, but from a user's perspective the solution is simply to impose the ORDER BY clause that's relevant to your query, thus guaranteeing you'll know which five items come first.
Since you're not specifying an ORDER BY clause, the optimizer will determine the most efficient way to do the query you're asking to do. This means there might be a different indexing done for the two columns you've indicated in your two queries, which results in what you're seeing.
The reason is simple: you did not specify an ORDER BY clause. So, for example, the optimizer could choose to use different indexes to satisfy two different queries, if there is a lean index that contains ItemID but not CategoryID it can be touched to satisfy the first query. A very common question, has a canned naswer:
Without ORDER BY, there is no default sort order.