SELF Referential SQL Query - sql-server

I have a table in my MS SQL Database called PolicyTransactions. This table has two important columns:
trans_id INT IDENTITY(1,1),
policy_id INT NOT NULL,
I need help writing a query that will, for each trans_id/policy_id in the table, join it to the last previous trans_id for that policy_id. This seems like a simple enough query, but for some reason I can't get it the gel in my brain right now.
Thanks!

I cooked this up for you.... Hopefully its what you're looking for: http://sqlfiddle.com/#!6/e7dc39/8
Basically, a cross apply is different from a subquery or regular join. It is a query that gets executed per each row that the outer portion of the query returns. This is why it has visibility into the outer tables (a subquery would not have this ability) and this is why its using the old school join syntax (old school meaning the join condition on _ = _ is in the where clause).
Just be really careful with this solution as cross apply isn't necessarily the fastest thing on earth. However, if the indexing on the tables is decent, that tiny query should run pretty quickly.
Its the only way I could think of to solve it, but it doesn't mean its the only way!
just a super quick edit: If you notice, some rows are not returned because they are the FIRST policy and therefore don't have a tran_id less than them with the same policy number. If you want to simulate an outer join with an apply, use outer apply :)

If you are using SQL Server 2012 or later you should use the LAG() function. See snippet below, I feel that its much cleaner than the other answer given here.
SELECT trans_id, policy_id, LAG(trans_id) OVER (PARTITION BY policy_id ORDER BY trans_id)
FROM PolicyTransaction

Related

Joins - Difference in condition : Placed to left or right

In the following two queries, the only difference is the condition is swapped.
Will it make any performance difference?
Which one is advisable? I have searched over web with no luck. Please help.
First Query :
select order_date, order_amount
from customers
join orders
on customers.customer_id = orders.customer_id
where customer_id = 3
Second Query :
select order_date, order_amount
from customers
join orders
on orders.customer_id = customers.customer_id
where customer_id = 3
Prdp's comment sums up the answer beautifully. The answer is no. But to further clarify and give you some more info:
SQL Server uses TSQL which is a declarative language. To steal from this post, the definition of declarative is:
Programming paradigm that expresses the desired result of computation
without describing the steps to achieve it (also abbreviated with
"describe what, not how")
What this basically translates to is that you tell SQL Server what you want returned and provide the logic for things like the joins, and SQL Server will figure out the best way to do it. If it has to rearrange joins, do implicit conversions, it will in order to produce an optimal plan.

Execution of TSQL statement

I am aware of the sequence of the execution of SQL statements but I still want to make sure few things with the help of SQL experts here. I have a big SQL query which returns thousands of rows. Here is the minimized version of the query which I wrote and think that it is correct.
Select *
from property
inner join tenant t on (t.hproperty = p.hmy **and p.hmy = 7**)
inner join commtenant ct on ct.htenant = t.hmyperson
where 1=1
My colleague says that above query is equivalent to below query performance wise(He is very confident about it)
Select *
from property
inner join tenant t on (t.hproperty = p.hmy)
inner join commtenant ct on ct.htenant = t.hmyperson
where **p.hmy = 7**
Could anybody help me with the explanation about why above queries are not equivalent or equivalent? Thanks.
If you want to know if two queries are equivalent, learn how to look at the execution plans in SQL Server Management Studio. You can put the two queries in different windows, look at the estimated execution plans, and see for yourself if they are the same.
In this case, they probably are the same. SQL is intended to be a descriptive language, not a procedural language. That is, it describes the output you want, but the SQL engine is allowed to rewrite the query to be as efficient as possible. The two forms you have describe the same output. Do note that if there were a left outer join instead of an inner join, then the queries would be different.
In all likelihood, the engine will read the table and filter the records during the read or use an index for the read. The key idea, though, is that the output is the same and SQL Server can recognize this.
"p.hmy = 7" is not a join condition, as it relates only to a single table. As such, it doesn't really belong in the ON clause of the join. Since you are not adding any information by placing the condition in the ON clause, having it in the WHERE clause (in which it really belongs) will not make any difference to the query plan generated. If in doubt, look at the query plans.

How can I force a subquery to perform as well as a #temp table?

I am re-iterating the question asked by Mongus Pong Why would using a temp table be faster than a nested query? which doesn't have an answer that works for me.
Most of us at some point find that when a nested query reaches a certain complexity it needs to broken into temp tables to keep it performant. It is absurd that this could ever be the most practical way forward and means these processes can no longer be made into a view. And often 3rd party BI apps will only play nicely with views so this is crucial.
I am convinced there must be a simple queryplan setting to make the engine just spool each subquery in turn, working from the inside out. No second guessing how it can make the subquery more selective (which it sometimes does very successfully) and no possibility of correlated subqueries. Just the stack of data the programmer intended to be returned by the self-contained code between the brackets.
It is common for me to find that simply changing from a subquery to a #table takes the time from 120 seconds to 5. Essentially the optimiser is making a major mistake somewhere. Sure, there may be very time consuming ways I could coax the optimiser to look at tables in the right order but even this offers no guarantees. I'm not asking for the ideal 2 second execute time here, just the speed that temp tabling offers me within the flexibility of a view.
I've never posted on here before but I have been writing SQL for years and have read the comments of other experienced people who've also just come to accept this problem and now I would just like the appropriate genius to step forward and say the special hint is X...
There are a few possible explanations as to why you see this behavior. Some common ones are
The subquery or CTE may be being repeatedly re-evaluated.
Materialising partial results into a #temp table may force a more optimum join order for that part of the plan by removing some possible options from the equation.
Materialising partial results into a #temp table may improve the rest of the plan by correcting poor cardinality estimates.
The most reliable method is simply to use a #temp table and materialize it yourself.
Failing that regarding point 1 see Provide a hint to force intermediate materialization of CTEs or derived tables. The use of TOP(large_number) ... ORDER BY can often encourage the result to be spooled rather than repeatedly re evaluated.
Even if that works however there are no statistics on the spool.
For points 2 and 3 you would need to analyse why you weren't getting the desired plan. Possibly rewriting the query to use sargable predicates, or updating statistics might get a better plan. Failing that you could try using query hints to get the desired plan.
I do not believe there is a query hint that instructs the engine to spool each subquery in turn.
There is the OPTION (FORCE ORDER) query hint which forces the engine to perform the JOINs in the order specified, which could potentially coax it into achieving that result in some instances. This hint will sometimes result in a more efficient plan for a complex query and the engine keeps insisting on a sub-optimal plan. Of course, the optimizer should usually be trusted to determine the best plan.
Ideally there would be a query hint that would allow you to designate a CTE or subquery as "materialized" or "anonymous temp table", but there is not.
Another option (for future readers of this article) is to use a user-defined function. Multi-statement functions (as described in How to Share Data between Stored Procedures) appear to force the SQL Server to materialize the results of your subquery. In addition, they allow you to specify primary keys and indexes on the resulting table to help the query optimizer. This function can then be used in a select statement as part of your view. For example:
CREATE FUNCTION SalesByStore (#storeid varchar(30))
RETURNS #t TABLE (title varchar(80) NOT NULL PRIMARY KEY,
qty smallint NOT NULL) AS
BEGIN
INSERT #t (title, qty)
SELECT t.title, s.qty
FROM sales s
JOIN titles t ON t.title_id = s.title_id
WHERE s.stor_id = #storeid
RETURN
END
CREATE VIEW SalesData As
SELECT * FROM SalesByStore('6380')
Having run into this problem, I found out that (in my case) SQL Server was evaluating the conditions in incorrect order, because I had an index that could be used (IDX_CreatedOn on TableFoo).
SELECT bar.*
FROM
(SELECT * FROM TableFoo WHERE Deleted = 1) foo
JOIN TableBar bar ON (bar.FooId = foo.Id)
WHERE
foo.CreatedOn > DATEADD(DAY, -7, GETUTCDATE())
I managed to work around it by forcing the subquery to use another index (i.e. one that would be used when the subquery was executed without the parent query). In my case I switched to PK, which was meaningless for the query, but allowed the conditions from the subquery to be evaluated first.
SELECT bar.*
FROM
(SELECT * FROM TableFoo WITH (INDEX([PK_Id]) WHERE Deleted = 1) foo
JOIN TableBar bar ON (bar.FooId = foo.Id)
WHERE
foo.CreatedOn > DATEADD(DAY, -7, GETUTCDATE())
Filtering by the Deleted column was really simple and filtering the few results by CreatedOn afterwards was even easier. I was able to figure it out by comparing the Actual Execution Plan of the subquery and the parent query.
A more hacky solution (and not really recommended) is to force the subquery to get executed first by limiting the results using TOP, however this could lead to weird problems in the future if the results of the subquery exceed the limit (you could always set the limit to something ridiculous). Unfortunately TOP 100 PERCENT can't be used for this purpose since SQL Server just ignores it.

Should I be worried if a subquery returns too much data?

Here is the example:
SELECT <columns>
FROM (..........<subquery>..........) AS xxx
INNER JOIN(s) with xxx.............
LEFT OUTER JOIN(s) with xxx........
WHERE <filter conditions>
Please correct me if I'm wrong:
Is that <subquery> a derived table?
Is it a problem if it returns too much data (say millions of rows) regarding server memory, since i know that WHERE clause is applied to the final result set and leaving the server processing too much from the subquery even if the final result has 10 rows?
What if there was no inner join (to reduce the data) and only left outer join, does that make things even worse/slow since it has to make the join with all the rows from the subquery?
If (2) is a problem then one solution I think of would be to limit the data returned by the subquery by adding other joins inside which would make things slower (I've tried that). Any other thoughts on this?
What if I can't limit the result from the subquery since the where clause depends on the joins from after the subquery?
To clarify things out, the reason the subquery returns too much data is because I'm trying to combine data from multiple tables using UNION ALL (with no filtering conditions) and then, foreach row returned by the subquery, join to get the info I need to use it in the WHERE clause. Another way to do this is to do all the joins that you see outside the subquery for each of the UNION ALL from inside the subquery, which yes, does limit the result sets but makes more joins which, as I said, slow things down. In other words, I have to choose between a subquery that does this:
(
SELECT * FROM A UNION ALL
SELECT * FROM B UNION ALL
SELECT * FROM C...
) AS xxx
left outer join T with xxx
AND
SELECT * FROM A
LEFT OUTER JOIN T ...
WHERE....
UNION ALL
SELECT * FROM B
LEFT OUTER JOIN T ...
WHERE....
UNION ALL
SELECT * FROM C
LEFT OUTER JOIN T ...
WHERE....
Yes it is.
No, the query optimizer treats the whole query as one block. It doesn't run a derived table then run the outer statement on the result. It 'optimizes through' derived tables.
Again, no. Having a derived table doesn't mean bad performance. You always have to look at your query as a whole.
It's not a problem.
Then that's just fine. Trust the query optimizer. Have you ever met the people that wrote it? They are scary intelligent.
In each individual case, it is worth looking at your query execution plan and finding pain points. Looks for things that are doing scans when they could be doing seeks, and that will usually give you a significant boost. Things do scans and not seeks when:
There is no index to seek upon
The thing you are seeking is the result of a function (e.g. WHERE function(field) = value)
The optimizer decides that a scan is actually faster.
But the bottom line answer to the question is - no, you should not be worried that derived tables would contain a lot of data if you selected them out in isolation.

is index still effective after data has been selected?

I have two tables that I want to join, they both have index on the column I am trying to join.
QUERY 1
SELECT * FROM [A] INNER JOIN [B] ON [A].F = [B].F;
QUERY 2
SELECT * FROM (SELECT * FROM [A]) [A1] INNER JOIN (SELECT * FROM B) [B1] ON [A1].F=[B1].F
the first query clearly will utilize the index, what about the second one?
after the two select statements in the brackets are executed, then join would occur, but my guess is the index wouldn't help to speed up the query because it is pretty much a new table..
The query isn't executed quite so literally as you suggest, where the inner queries are executed first and then their results are combined with the outer query. The optimizer will take your query and will look at many possible ways to get your data through various join orders, index usages, etc. etc. and come up with a plan that it feels is optimal enough.
If you execute both queries and look at their respective execution plans, I think you will find that they use the exact same one.
Here's a simple example of the same concept. I created my schema as so:
CREATE TABLE A (id int, value int)
CREATE TABLE B (id int, value int)
INSERT INTO A (id, value)
VALUES (1,900),(2,800),(3,700),(4,600)
INSERT INTO B (id, value)
VALUES (2,800),(3,700),(4,600),(5,500)
CREATE CLUSTERED INDEX IX_A ON A (id)
CREATE CLUSTERED INDEX IX_B ON B (id)
And ran queries like the ones you provided.
SELECT * FROM A INNER JOIN B ON A.id = B.id
SELECT * FROM (SELECT * FROM A) A1 INNER JOIN (SELECT * FROM B) B1 ON A1.id = B1.id
The plans that were generated looked like this:
Which, as you can see, both utilize the index.
Chances are high that the SQL Server Query Optimizer will be able to detect that Query 2 is in fact the same as Query 1 and use the same indexed approach.
Whether this happens depends on a lot of factors: your table design, your table statistics, the complexity of your query, etc. If you want to know for certain, let SQL Server Query Analyzer show you the execution plan. Here are some links to help you get started:
Displaying Graphical Execution Plans
Examining Query Execution Plans
SQL Server uses predicate pushing (a.k.a. predicate pushdown) to move query conditions as far toward the source tables as possible. It doesn't slavishly do things in the order you parenthesize them. The optimizer uses complex rules--what is essentially a kind of geometry--to determine the meaning of your query, and restructure its access to the data as it pleases in order to gain the most performance while still returning the same final set of data that your query logic demands.
When queries become more and more complicated, there is a point where the optimizer cannot exhaustively search all possible execution plans and may end up with something that is suboptimal. However, you can pretty much assume that a simple case like you have presented is going to always be "seen through" and optimized away.
So the answer is that you should get just as good performance as if the two queries were combined. Now, if the values you are joining on are composite, that is they are the result of a computation or concatenation, then you are almost certainly not going to get the predicate push you want that will make the index useful, because the server won't or can't do a seek based on a partial string or after performing reverse arithmetic or something.
May I suggest that in the future, before asking questions like this here, you simply examine the execution plan for yourself to validate that it is using the index? You could have answered your own question with a little experimentation. If you still have questions, then come post, but in the meantime try to do some of your own research as a sign of respect for the people who are helping you.
To see execution plans, in SQL Server Management Studio (2005 and up) or SQL Query Analyzer (SQL 2000) you can just click the "Show Execution Plan" button on the menu bar, run your query, and switch to the tab at the bottom that displays a graphical version of the execution plan. Some little poking around and hovering your mouse over various pieces will quickly show you which indexes are being used on which tables.
However, if things aren't as you expect, don't automatically think that the server is making a mistake. It may decide that scanning your main table without using the index costs less--and it will almost always be right. There are many reasons that scanning can be less expensive, one of which is a very small table, another of which is that the number of rows the server statistically guesses it will have to return exceeds a significant portion of the table.
These both queries are same. The second query will be transformed just same as first one during transformation.
However, if you have specific requirement I would suggest that you put the whole code.Then It would be much easier to answer your question.

Resources