Why does this subquery NOT cause an error? [duplicate] - sql-server

This question already has answers here:
sql server 2008 management studio not checking the syntax of my query
(2 answers)
Closed 8 years ago.
I'm confused by an SQL query, and honestly, its one of those things that I'm not even sure how to google for. Thus StackOverflow.
I have what I think is a simple query.
SELECT Id
FROM Customer
WHERE Id IN (SELECT Id from #CustomersWithCancelledOrders)
Here's where I find the weirdness. There is no column called Id in the #CustomersWithCancelledOrders table variable. But there isn't an error.
What this results in is the Ids for all Customers. Every single one. Which obviously defeats the point of doing a sub-query in the first place.
It's like its using the Id column from the outer table (Customers), but I don't understand why it would do that. Is there ever a reason you would want to do that? Am I missing something incredibly obvious?
SQLFiddle of the weirdness. It's not the best SQL Fiddle, as I couldn't find a way to return multiple result sets on that website, but it demonstrates how I ran across the issue.
I suppose what I'm looking for is a name for the "feature" above, some sort of information about why it does what it does and what the incorrect query actually means.
I've updated the above question to use a slightly better example. Its still contrived, but its closer to the script I wrote when I actually encountered the issue.
After doing some reading on correlated subqueries, it looks like my typo (using the wrong Id column in the subquery) changes the behaviour of the subquery.
Instead of evaluating the results of the subquery once and then treating those results as a set (which was what I intended) it evaluates the subquery for every row in the outer query.
This means that the subquery evaluates to a set of different results for every row, and that set of results is guaranteed to have the customer Id of that row in it. The subquery returns a set consisting of the Id of the row repeated X number of times, where X is the number of rows in the table variable that is being selected from.
...
Its really hard to write down a concise description of my understanding of the issue. Sorry. I think I'm good now though.

It's intended behaviour because in a sub query you can access the 'outer queries' column names. Meaning you can use Id from Table within the Subquery and the query therefore thinks you are using Id.
That's why you should qualify with aliases or fully qualified names when working with sub queries.
For example; check out
http://support.microsoft.com/kb/298674

SELECT ID
FROM [Table]
WHERE ID IN (SELECT OtherTable.ID FROM OtherTable)
This will generate an error. As Allan S. Hanses said, in the subquery you can use colums from the main query.
See this example
SELECT ID
FROM [Table]
WHERE ID IN (SELECT ID)

The query is a correlated sub-query and is most often used to limit the results of the outer query based on a column returned by the sub query; hence the 'correlated'.
In this example the ID in the inner query is actually the ID from the table in the outer query. This makes the query valid but probably doesn't give you any useful results as it isn't actually correlating between the outer and inner queries.

Related

MsSql Get order of insertion

If i run a select query on a MS Sql database, without Order by clause, will it return the order of insertion?
I need to be sure that a specific row in my database has been inserted prior to another. By running a select all without any order by, this statement above would be true. I am not sure tough that the order of the rows returned is in fact the order in that they have been added to the database. All tests I made point in that direction and seem to confirm this assumption, but I was not able to find a official statement or confirmation of this anywhere.
I was able to solve the problem by using this:
select mycolumns, %%physloc%% as pl from mytable order by pl desc
Rows that have been written one after the other will have very similar values in this column, unless it has been written something totally different. In my case it solved the problem.

SQL Server : Tables vs Cursors

I'm asking for a high level understanding of what these two things are.
From what I've read, it seems that in general, a query with an ORDER BY clause returns a cursor, and basically cursors have order to them whereas tables are literally a set where order is not guaranteed.
What I don't really understand is, why are these two things talked about like two separate animals. To me, it seems like cursors are a subset of tables. The book I'm reading vaguely mentioned that
"Some language elements and operations in SQL expect to work with
table results of queries and not with cursors; examples include table
expressions and set operators"
My question would be... why not? Why won't SQL handle it like a table anyways even if it's given an ordered set?
Just to clarify, I will type out the paragraph from the book:
A query with an ORDER BY clause results in what standard SQL calls a cursor - a nonrelational result with order guaranteed among rows. You're probably wondering why it matters whether a query returns a table result or a cursor. Some language elements and operations in SQL expect to work with table results of queries and not with cursors; examples include table expressions and set operators..."
A table is a result set. It has columns and rows. You can join to it with other tables to either filter or combine the data in ONE operation:
SELECT *
FROM TABLE1 T1
JOIN TABLE2 T2
ON T1.PK = T2.PK
A cursor is a variable that stores a result set. It has columns, but the rows are inaccessible - except the top one! You can't access the records directly, rather you must fetch them ONE ROW AT A TIME.
DECLARE TESTCURSOR CURSOR
FOR SELECT * FROM Table1
OPEN TESTCURSOR
FETCH NEXT FROM TESTCURSOR
You can also fetch them into variables, if needed, for more advanced processing.
Please let me know if that doesn't clarify it for you.
With regard to this sentence,
"Some language elements and operations in SQL expect to work with
table results of queries and not with cursors; examples include table
expressions and set operators"
I think the author is just saying that there are cases where it doesn't make sense to use an ORDER BY in a fragment of a query, because the ORDER BY should be on the outer query, where it will actually affect the final result of the query.
For instance, I can't think of any point in putting an ORDER BY on a CTE ("table expression") or on the Subquery in an IN( ) expression. UNLESS (in both cases) a TOP n was used as well.
When you create a VIEW, SQL Server will actually not allow you to use an ORDER BY unless a TOP n is also used. Otherwise the ORDER BY should be specified when Selecting from the VIEW, not in the code of the VIEW itself.

SELF Referential SQL Query

I have a table in my MS SQL Database called PolicyTransactions. This table has two important columns:
trans_id INT IDENTITY(1,1),
policy_id INT NOT NULL,
I need help writing a query that will, for each trans_id/policy_id in the table, join it to the last previous trans_id for that policy_id. This seems like a simple enough query, but for some reason I can't get it the gel in my brain right now.
Thanks!
I cooked this up for you.... Hopefully its what you're looking for: http://sqlfiddle.com/#!6/e7dc39/8
Basically, a cross apply is different from a subquery or regular join. It is a query that gets executed per each row that the outer portion of the query returns. This is why it has visibility into the outer tables (a subquery would not have this ability) and this is why its using the old school join syntax (old school meaning the join condition on _ = _ is in the where clause).
Just be really careful with this solution as cross apply isn't necessarily the fastest thing on earth. However, if the indexing on the tables is decent, that tiny query should run pretty quickly.
Its the only way I could think of to solve it, but it doesn't mean its the only way!
just a super quick edit: If you notice, some rows are not returned because they are the FIRST policy and therefore don't have a tran_id less than them with the same policy number. If you want to simulate an outer join with an apply, use outer apply :)
If you are using SQL Server 2012 or later you should use the LAG() function. See snippet below, I feel that its much cleaner than the other answer given here.
SELECT trans_id, policy_id, LAG(trans_id) OVER (PARTITION BY policy_id ORDER BY trans_id)
FROM PolicyTransaction

How can I force a subquery to perform as well as a #temp table?

I am re-iterating the question asked by Mongus Pong Why would using a temp table be faster than a nested query? which doesn't have an answer that works for me.
Most of us at some point find that when a nested query reaches a certain complexity it needs to broken into temp tables to keep it performant. It is absurd that this could ever be the most practical way forward and means these processes can no longer be made into a view. And often 3rd party BI apps will only play nicely with views so this is crucial.
I am convinced there must be a simple queryplan setting to make the engine just spool each subquery in turn, working from the inside out. No second guessing how it can make the subquery more selective (which it sometimes does very successfully) and no possibility of correlated subqueries. Just the stack of data the programmer intended to be returned by the self-contained code between the brackets.
It is common for me to find that simply changing from a subquery to a #table takes the time from 120 seconds to 5. Essentially the optimiser is making a major mistake somewhere. Sure, there may be very time consuming ways I could coax the optimiser to look at tables in the right order but even this offers no guarantees. I'm not asking for the ideal 2 second execute time here, just the speed that temp tabling offers me within the flexibility of a view.
I've never posted on here before but I have been writing SQL for years and have read the comments of other experienced people who've also just come to accept this problem and now I would just like the appropriate genius to step forward and say the special hint is X...
There are a few possible explanations as to why you see this behavior. Some common ones are
The subquery or CTE may be being repeatedly re-evaluated.
Materialising partial results into a #temp table may force a more optimum join order for that part of the plan by removing some possible options from the equation.
Materialising partial results into a #temp table may improve the rest of the plan by correcting poor cardinality estimates.
The most reliable method is simply to use a #temp table and materialize it yourself.
Failing that regarding point 1 see Provide a hint to force intermediate materialization of CTEs or derived tables. The use of TOP(large_number) ... ORDER BY can often encourage the result to be spooled rather than repeatedly re evaluated.
Even if that works however there are no statistics on the spool.
For points 2 and 3 you would need to analyse why you weren't getting the desired plan. Possibly rewriting the query to use sargable predicates, or updating statistics might get a better plan. Failing that you could try using query hints to get the desired plan.
I do not believe there is a query hint that instructs the engine to spool each subquery in turn.
There is the OPTION (FORCE ORDER) query hint which forces the engine to perform the JOINs in the order specified, which could potentially coax it into achieving that result in some instances. This hint will sometimes result in a more efficient plan for a complex query and the engine keeps insisting on a sub-optimal plan. Of course, the optimizer should usually be trusted to determine the best plan.
Ideally there would be a query hint that would allow you to designate a CTE or subquery as "materialized" or "anonymous temp table", but there is not.
Another option (for future readers of this article) is to use a user-defined function. Multi-statement functions (as described in How to Share Data between Stored Procedures) appear to force the SQL Server to materialize the results of your subquery. In addition, they allow you to specify primary keys and indexes on the resulting table to help the query optimizer. This function can then be used in a select statement as part of your view. For example:
CREATE FUNCTION SalesByStore (#storeid varchar(30))
RETURNS #t TABLE (title varchar(80) NOT NULL PRIMARY KEY,
qty smallint NOT NULL) AS
BEGIN
INSERT #t (title, qty)
SELECT t.title, s.qty
FROM sales s
JOIN titles t ON t.title_id = s.title_id
WHERE s.stor_id = #storeid
RETURN
END
CREATE VIEW SalesData As
SELECT * FROM SalesByStore('6380')
Having run into this problem, I found out that (in my case) SQL Server was evaluating the conditions in incorrect order, because I had an index that could be used (IDX_CreatedOn on TableFoo).
SELECT bar.*
FROM
(SELECT * FROM TableFoo WHERE Deleted = 1) foo
JOIN TableBar bar ON (bar.FooId = foo.Id)
WHERE
foo.CreatedOn > DATEADD(DAY, -7, GETUTCDATE())
I managed to work around it by forcing the subquery to use another index (i.e. one that would be used when the subquery was executed without the parent query). In my case I switched to PK, which was meaningless for the query, but allowed the conditions from the subquery to be evaluated first.
SELECT bar.*
FROM
(SELECT * FROM TableFoo WITH (INDEX([PK_Id]) WHERE Deleted = 1) foo
JOIN TableBar bar ON (bar.FooId = foo.Id)
WHERE
foo.CreatedOn > DATEADD(DAY, -7, GETUTCDATE())
Filtering by the Deleted column was really simple and filtering the few results by CreatedOn afterwards was even easier. I was able to figure it out by comparing the Actual Execution Plan of the subquery and the parent query.
A more hacky solution (and not really recommended) is to force the subquery to get executed first by limiting the results using TOP, however this could lead to weird problems in the future if the results of the subquery exceed the limit (you could always set the limit to something ridiculous). Unfortunately TOP 100 PERCENT can't be used for this purpose since SQL Server just ignores it.

Sql serve Full Text Search with Containstable is very slow when Used in JOIN!

I am using sql 2008 full text search and I am having serious issues with performance depending on how I use Contains or ContainsTable.
Here are sample: (table one has about 5000 records and there is a covered index on table1 which has all the fields in the where clause. I tried to simplify the statements so forgive me if there is syntax issues.)
Scenario 1:
select * from table1 as t1
where t1.field1=90
and t1.field2='something'
and Exists(select top 1 * from containstable(table1,*, 'something') as t2
where t2.[key]=t1.id)
results: 10 second (very slow)
Scenario 2:
select * from table1 as t1
join containstable(table1,*, 'something') as t2 on t2.[key] = t1.id
where t1.field1=90
and t1.field2='something'
results: 10 second (very slow)
Scenario 3:
Declare #tbl Table(id uniqueidentifier primary key)
insert into #tbl select {key] from containstable(table1,*, 'something')
select * from table1 as t1
where t1.field1=90
and t1.field2='something'
and Exists(select id from #tbl as tbl where id=req1.id)
results: fraction of a second (super fast)
Bottom line, it seems if I use Containstable in any kind of join or where clause condition of a select statement that also has other conditions, the performance is really bad. In addition if you look at profiler, the number of reads from the database goes to the roof. But if I first do the full text search and put results in a table variable and use that variable everything goes super fast. The number of reads are also much lower. It seems in "bad" scenarios, somehow it gets stuck in a loop which causes it to read many times from teh database but of course I don't understant why.
Now the question is first of all whyis that happening? and question two is that how scalable table variables are? what if it results to 10s of thousands of records? is it still going to be fast.
Any ideas?
Thanks
I spent quite sometime on this issue, and based on running many scenarios, this is what I figured out:
if you have Contains or ContainsTable anywhere in your query, that is the part that gets executed first and rather independently. Meaning that even if the rest of the conditions limit your search to only one record, neither Contains nor containstable care about that. So this is like a parallel execution.
Now since fulltext search only returns a Key field, it immediately looks for the Key as the first field of other indexes chosen for the query. So for the example above, it looks for the index with [key], field1, field2. The problem is that it chooses an index for the rest of query based on the fields in the where clause. so for the example above it picks the covered index that I have which is something like field1, field2, Id. (Id of the table is the same as the [Key] returned from the full text search). So summary is:
executes containstable
executes the rest of the query and pick an index based on where clause of the query
It tries to merge these two. Therefore, if the index that it picked for the rest of the query starts with the [key] field, it is fine. However, if the index doesn't have the [key] field as the first key, it starts doing loops. It does not even do a table scan, otherwise going through 5000 records would not be that slow. The way it does the loop is that it runs the loop for the total number of results from FTS multiplied by total number of results from the rest of the query. So if the FTS is returning 2000 records and the rest of the query returns 3000, it loops 2000*3000= 6,000,000. I donot understand why.
So in my case it does the full text search, then it does he rest of the query but picks the covered index that I have which is based on field1, field2,id (which is wrong) and as the result it screws up. If I change my covered index to Id, field1, field2 everything would be very fast.
My expection was that FTS returns bunch of [key], the rest of the query return bunch of [Id] and then the Id should be matched against [key].
Of course, I tried to simplify my query here, but the actual query is much more complicated and I cannot just change the index. I also do have scenarios where the text passed in full text is blank and in those scenarios I donot even want to join with containstable.
In those cases changing my covered index to have the id field as the first field, will generate disaster.
Anyways, for now I chose the temp table solution since it is working for me. I am also limiting the result to a few thousand which helps with the potential performance issues of table variables when the number of records go too high.
thanks
Normally it works very fast:
select t1.*, t2.Rank
from containstable(table1, field2, 'something') as t2
join table1 as t1 ON t1.id = t2.Key AND t1.field1=90
order by t2.Rank desc
There is a big difference where you put your search criteria: in JOIN or in WHERE.
I'm going to take a guess here that your issue is the same as on the other thread I linked to. Are you finding the issue arises with multiple word search terms?
If so my answer from that thread will apply.
From http://technet.microsoft.com/en-us/library/cc721269.aspx#_Toc202506240
The most important thing is that the
correct join type is picked for
full-text query. Cardinality
estimation on the FulltextMatch STVF
is very important for the right plan.
So the first thing to check is the
FulltextMatch cardinality estimation.
This is the estimated number of hits
in the index for the full-text search
string. For example, in the query in
Figure 3 this should be close to the
number of documents containing the
term ‘word’. In most cases it should
be very accurate but if the estimate
was off by a long way, you could
generate bad plans. The estimation for
single terms is normally very good,
but estimating multiple terms such as
phrases or AND queries is more complex
since it is not possible to know what
the intersection of terms in the index
will be based on the frequency of the
terms in the index. If the cardinality
estimation is good, a bad plan
probably is caused by the query
optimizer cost model. The only way to
fix the plan issue is to use a query
hint to force a certain kind of join
or OPTIMIZE FOR.
So it simply cannot know from the information it stores whether the 2 search terms together are likely to be quite independent or commonly found together. Maybe you should have 2 separate procedures one for single word queries that you let the optimiser do its stuff on and one for multi word search terms that you force a "good enough" plan on (sys.dm_fts_index_keywords might help if you want to do a rough estimate of cardinality yourself).
If you are getting the issue with single word queries this passage from the linked article might apply.
In SQL Server 2008 full-text search we have the ability to alter the plan that is
generated based on a cardinality estimation of the search term used. If the query plan is fixed (as it is in a parameterized query inside a stored procedure), this step does
not take place. Therefore, the compiled plan always serves this query, even if this plan is not ideal for a given search term.
So you might need to use the RECOMPILE option.

Resources