I have a query that are using UNION, but it works slow.
Select col1, col2 From table_1 (INDEX idx MRU)
where (condition)
UNION
Select col1,col2 From table 2(INDEX idx MEU)
where (condition)
how can I make it work faster?
Um, it depends on how big the tables are, how selective the 'where' clauses are, etc.
You might try doing a "set showplan on" and then run your query which will give you output from the query optimizer. Understanding the showplan output can be non-trivial though. Also, make sure you've run "update statistics" on the tables to help the query optimizer make good decisions.
One note, the union operator implicitly performs a distinct operation. That means you'll be sorting the results in tempdb. If you don't care about duplicates, use "union all" instead.
Ben
Related
This behavior surprised me a little bit.
When you generate a uuid in a CTE (to make a row id, etc) and reference it in the future you'll find that it changes. It seems that generate_uuid() is being called twice instead of once. Anyone know why this is the case w/ BigQuery and what this is called?
I was using generate_uuid() to create a row_id and was finding that in my eventual joins that no matches were occurring because of this. Best way to get around it I've found is by just creating a table from the first CTE which cements the uuid in place for future use.
Still curious to know more about the why and what behind this.
with _first as (
select generate_uuid() as row_id
)
,_second as (
select * from _first
)
select row_id from _first
union all
select row_id from _second
curious to know more about the why and what behind this
This is by design:
WITH clauses are not materialized. Placing all your queries in WITH clauses and then running UNION ALL is a misuse of the WITH clause.
If a query appears in more than one WITH clause, it executes in each clause.
You can see in documentation - Do not treat WITH clauses as prepared statements
Let's say I have a View like this
CREATE VIEW MyView
AS
SELECT Id, Name FROM Source1
UNION
SELECT Id, Name FROM Source2
Then I query the View
SELECT Id, Name From MyView WHERE Name = 'Sally'
Will SQL Server internally first Select from Source1 and Source2 all the Data and then apply the where or will it put the where for each Select statement?
SQL Server can move predicates around as it sees fit in order to optimize a query. Views are effectively macros that are expanded into the body of the query before optimization occurs.
What it will do in any particular case isn't 100% possible to predict - because in SQL, you tell the system what you want, not how to do it.
For a trivial example like this, I would expect it to evaluate the predicate against the base tables and then perform the union, but only an examination of the query plan on your database, with your tables and indexes could answer the question for sure.
Depends on the optimizer, cardinalities, indices available etc but yes it will apply the criteria to base tables where appropriate.
Note that your UNION as oppose to a UNION ALL requires a SORT to remove duplicates.
Please have these two types of query in your mind:
--query1
Select someFields
From someTables
Where someWhereClues
Union all
Select someFields
FROM some Tables
Where someWhereClues
--query2
Select * FROM (
Select someFields
From someTables
Union all
Select someFields
FROM someTables
) DT
Where someMixedWhereClues
Note :
In both queries final result fields are same
I thought the 1st. query is faster or its performance is better!
But after some researches I confused by this note:
SQL Server (as a sample of RDBMS) first reads whole data then seek records. => so in both queries all records will read and seek.
Please Help me on my misunderstandings, and on if there is any other differences between query1 and query2 ?
Edit: adding sample plans:
select t.Name, t.type from sys.tables t where t.type = 'U'
union all
select t.Name, t.type from sys.objects t where t.type = 'U'
select * from (
select t.Name, t.type from sys.tables t
union all
select t.Name, t.type from sys.objects t
) dt
where dt.type = 'U'
Execution Plans are:
both are same and 50%
The SQL Server query optimizer, optimizes both queries so you get nearly the same performance.
The first one cannot be slower. Here is the reasoning:
If the WHERE clauses in the first can efficiently use an INDEX, there will be fewer rows to collect together in the UNION. Fewer rows --> faster.
The second one does not have an INDEX on the UNION, hence the WHERE cannot be optimized in that way.
Here are things that could lead to the first being slower. But I see them as exceptions, not the rule.
A different amount of parallelism is achieved.
Different stuff happens to be cached at the time you run the queries.
Caveat: I am assuming all three WHERE clauses are identical (as your example shows).
As a rule of thumb, I will always consider using the first type of the query.
In made-up samples and queries with simple WHERE predicates both will use the same plan. But in a more complex query, with more complicated predicates, the optimizer might not come up with an equally efficient solution for the second type of query (it's just an optimizer, and is bound by resource and time constraints). The more complex the query is, the less chance is the optimizer finds the best execution plan (as it will eventually time-out and choose the least worst plan found so far). And it gets even worse if the predicates are ORed.
SQLServer will optimize both of those queries down to the same thing, as shown in the execution plans you posted. It's able to do this because in this case the queries are fairly simple; in another case it's possible for it to turn out differently. As long as you're composing a query, you should try to follow the same general rules that the optimizer does, and filter as soon as possible to limit the resultset that returns. By telling it that you first want to only get 'U' records, and then combine those results, you will prepare the query for later revisions which could invalidate the optimizer's choices which led to the same execution plan.
In short, you don't have to force simple queries to be optimal, but it's a good habit to have, and it will help when creating more complex queries.
In my practice 1st option was never slower than the 2nd. I think that optimizer is smart enough to optimize these plans more or less in the same manner. However I made some tests and the 1st option was always better. For example:
CREATE TABLE #a ( a INT, b INT );
WITH Numbers ( I ) AS (
SELECT 1000
UNION ALL
SELECT I + 1
FROM Numbers
WHERE I < 5000
)
INSERT INTO #a ( a )
SELECT I
FROM Numbers
ORDER BY CRYPT_GEN_RANDOM(4)
OPTION ( MAXRECURSION 0 );
WITH Numbers ( I ) AS (
SELECT 1000
UNION ALL
SELECT I + 1
FROM Numbers
WHERE I < 5000
)
INSERT INTO #a ( b )
SELECT I
FROM Numbers
ORDER BY CRYPT_GEN_RANDOM(4)
OPTION ( MAXRECURSION 0 );
SELECT a, b
FROM #a
WHERE a IS NOT NULL
UNION ALL
SELECT a, b
FROM #a
WHERE b IS NOT NULL
SELECT *
FROM (
SELECT a, b
FROM #a
UNION ALL
SELECT a, b
FROM #a
) c
WHERE a IS NOT NULL
OR b IS NOT NULL
The result is 47% vs 53%
In my experience, there is no straightforward answer to this and it varies based on the nature of the underlying query. As you have shown, the optimizer comes up with the same execution plan in both of those scenarios, however that is not always the case. The performance is usually similar, but sometimes the performance can vary drastically depending on the query. In general I only take a closer look at it when performance is bad for no good reason.
In our case we have some business logic that looks into several tables in a certain order, so that the first non null value from one table is used. While the look up is not hard, but it does take several lines of SQL code to accomplish. I have read about scalar valued functions in SQL Server, but don't know if the re-compliation issue affects me enough to do it in a less convenient way.
So what's the general rule of thumb?
Would you rather have something like
select id, udfGetFirstNonNull(id), from mytable
Or is table-valued functions any better than scalar?
select id,
(select firstNonNull from udfGetFirstNonNull(id)) as firstNonNull
from myTable
The scalar udf will look up for each row in myTable which can run exponentially longer as data increases. Effectively you have a CURSOR. If you have a few rows, it won't matter of course.
I do the same myself where I don't expect a lot of rows (more than a few hundred).
However, I would consider a table value function where I've placed "foo" here. "foo" could also be a CTE in a UDF too (not tested):
select id,
(select firstNonNull from udfGetFirstNonNull(id)) as firstNonNull
from
myTable M
JOIN
(SELECT value, id as firstNonNull
FROM OtherTable
WHERE value IS NOT NULL
GROUP BY id
ORDER BY value) foo ON M.id = foo.id
Your first query is fine. One place I work for is absolutely obsessed with speed and optimization, and they use UDF's heavily in this way.
I think for readibility and maintainability, I would prefer to use the scalar function, as that is what it is returning.
The Table - Query has 2 columns (functionId, depFunctionId)
I want all values that are either in functionid or in depfunctionid
I am using this:
select distinct depfunctionid from Query
union
select distinct functionid from Query
How to do it better?
I think that's the best you'll get.
Thats as good as it gets I think...
Lose the DISTINCT clauses, as your UNION (vs UNION ALL) will take care of removing duplicates.
An alternative - but perhaps less clear and probably with the same execution plan - would be to do a FULL JOIN across the 2 columns.
SELECT
COALESCE(Query1.FunctionId, Query2.DepFunctionId) as FunctionId
FROM Query as Query1
FULL OUTER JOIN Query as Query2 ON
Query1.FunctionId = Query2.DepFunctionId
I am almost sure you can loose the distinct's.
When you use UNION instead of UNION ALL, duplicated results are thrown away.
It all depends on how heavy your inline view query is. The key for a better perfomance would be to execute only once, but that is not possible given the data that it returns.
If you do it like this :
select depfunctionid , functionid from Query
group by depfunctionid , functionid
It is very likely that you'll get repeated results for depfunctionid or functionid.
I may be wrong, but it seems to me that you're trying to retrieve a tree of dependencies. If thats the case, I personally would try to use a materialized path approach.
If the materialized path is stored in a self referencing table name, I would retrieve the tree using something like
select asrt2.function_id
from a_self_referencig_table asrt1,
a_self_referencig_table asrt2
where asrt1.function_name = 'blah function'
and asrt2.materialized_path like (asrt1.materialized_path || '%')
order by asrt2.materialized_path, asrt2.some_child_node_ordering_column
This would retrieved the whole tree in the proper order. What sucks is having to construct the materialized path based on the function_id and parent_function_id (or in your case, functionid and depfunctionid), but a trigger could take care of it quite easily.