SQL Server : Tables vs Cursors - sql-server

I'm asking for a high level understanding of what these two things are.
From what I've read, it seems that in general, a query with an ORDER BY clause returns a cursor, and basically cursors have order to them whereas tables are literally a set where order is not guaranteed.
What I don't really understand is, why are these two things talked about like two separate animals. To me, it seems like cursors are a subset of tables. The book I'm reading vaguely mentioned that
"Some language elements and operations in SQL expect to work with
table results of queries and not with cursors; examples include table
expressions and set operators"
My question would be... why not? Why won't SQL handle it like a table anyways even if it's given an ordered set?
Just to clarify, I will type out the paragraph from the book:
A query with an ORDER BY clause results in what standard SQL calls a cursor - a nonrelational result with order guaranteed among rows. You're probably wondering why it matters whether a query returns a table result or a cursor. Some language elements and operations in SQL expect to work with table results of queries and not with cursors; examples include table expressions and set operators..."

A table is a result set. It has columns and rows. You can join to it with other tables to either filter or combine the data in ONE operation:
SELECT *
FROM TABLE1 T1
JOIN TABLE2 T2
ON T1.PK = T2.PK
A cursor is a variable that stores a result set. It has columns, but the rows are inaccessible - except the top one! You can't access the records directly, rather you must fetch them ONE ROW AT A TIME.
DECLARE TESTCURSOR CURSOR
FOR SELECT * FROM Table1
OPEN TESTCURSOR
FETCH NEXT FROM TESTCURSOR
You can also fetch them into variables, if needed, for more advanced processing.
Please let me know if that doesn't clarify it for you.

With regard to this sentence,
"Some language elements and operations in SQL expect to work with
table results of queries and not with cursors; examples include table
expressions and set operators"
I think the author is just saying that there are cases where it doesn't make sense to use an ORDER BY in a fragment of a query, because the ORDER BY should be on the outer query, where it will actually affect the final result of the query.
For instance, I can't think of any point in putting an ORDER BY on a CTE ("table expression") or on the Subquery in an IN( ) expression. UNLESS (in both cases) a TOP n was used as well.
When you create a VIEW, SQL Server will actually not allow you to use an ORDER BY unless a TOP n is also used. Otherwise the ORDER BY should be specified when Selecting from the VIEW, not in the code of the VIEW itself.

Related

HANA SQL CTE WHERE CONDITION

I'm writing a scripted calculation view on HANA using SQL.
Looking for some performance booster alternatives for the logic that I have implemented in a while loop. Simplified version of code is below.
It is trying to get similar looking vendors in table B for vendors from table A.
Please bear with me for inaccurate syntax.
v = select vendor, vendorname from A;
while --set a counter here
vendorname = capture the record from v for row number represented by counter here
t = select vendor, vendorname from v where (read single vendor from counter row)
union all
select vendor, vendorname from B where contains(vendorname,:vendorname,fuzzy(0.3))
union all
select vendor, vendorname from t
endwhile
This query dies when there are thousands of records in both the tables. So after reading few blogs, I realized that I'm going in wrong direction that is using loop.
To make this little faster, I came across something called CTE.
When I tried to implement the same code using CTE I'm not allowed to do so.
Sample code I'm trying to write is below. Can anybody please help me get this right? The syntax is not accepted by system.
t = with mytab ("Vendor", "VendorName")
AS ( select "Vendor", "VendorName" from "A" WHERE ( "Updated_Date" >= :From_Date AND "Updated_Date" <= :To_Date ) )
select * from "B" WHERE CONTAINS ("VendorName", mytab."VendorName",FUZZY(0.3));
The SQL error for this syntax is:
SQL: invalid identifier: MYTAB
I would like to know:
Whether such operation with CTE is allowed. If yes, what is the correct syntax in HANA SQL?
If No, how do I achieve the desired result without looping through one table?
Thanks,
Anup
CTE's are allowed in SAP HANA - you might want to check the HANA SQL reference if you're looking for syntax.
But as you're in a SQLScript context anyhow, you might as well use table variables instead.
What I'm not sure about is, what you are actually trying to do. Provide a description of your usage scenario, if possible.
Ok, based on your comments, the following approach could work for you.
Note, in my example I use a copy of the USERS system table, so you will have to fit the query to your tables.
do
begin
declare user_names nvarchar(5000);
select string_agg(user_name,' ') into user_names
from cusers
where user_name in ('SYS', 'SYSTEM');
select *
from cusers
where contains (user_name, :user_names, fuzzy(0.3));
end;
What I do here is to get all the potential names for which I want to do a fuzzy lookup into a variable user_names (separated by a space). For this I use the STRING_AGG() aggregation function.
After the first statement is finished, :user_names contains SYSTEM SYS in my example.
Now, CONTAINS allows to search multiple columns for multiple search terms at once (you may want to re-check the reference documentation for details here), so
CONTAINS (<column_name>, 'term1 term2 term3')
looks for all three terms in the column .
With that we feed the string SYS SYSTEM into the second query and the CONTAINS clause.
That works fine for me, avoids a join and runs over the table to be searched only once.
BTW: no idea where you get that statement about table variables in read-only procedures from - it's wrong. Of course you can use table variables, in fact it's recommended to make use of them.

Why does this subquery NOT cause an error? [duplicate]

This question already has answers here:
sql server 2008 management studio not checking the syntax of my query
(2 answers)
Closed 8 years ago.
I'm confused by an SQL query, and honestly, its one of those things that I'm not even sure how to google for. Thus StackOverflow.
I have what I think is a simple query.
SELECT Id
FROM Customer
WHERE Id IN (SELECT Id from #CustomersWithCancelledOrders)
Here's where I find the weirdness. There is no column called Id in the #CustomersWithCancelledOrders table variable. But there isn't an error.
What this results in is the Ids for all Customers. Every single one. Which obviously defeats the point of doing a sub-query in the first place.
It's like its using the Id column from the outer table (Customers), but I don't understand why it would do that. Is there ever a reason you would want to do that? Am I missing something incredibly obvious?
SQLFiddle of the weirdness. It's not the best SQL Fiddle, as I couldn't find a way to return multiple result sets on that website, but it demonstrates how I ran across the issue.
I suppose what I'm looking for is a name for the "feature" above, some sort of information about why it does what it does and what the incorrect query actually means.
I've updated the above question to use a slightly better example. Its still contrived, but its closer to the script I wrote when I actually encountered the issue.
After doing some reading on correlated subqueries, it looks like my typo (using the wrong Id column in the subquery) changes the behaviour of the subquery.
Instead of evaluating the results of the subquery once and then treating those results as a set (which was what I intended) it evaluates the subquery for every row in the outer query.
This means that the subquery evaluates to a set of different results for every row, and that set of results is guaranteed to have the customer Id of that row in it. The subquery returns a set consisting of the Id of the row repeated X number of times, where X is the number of rows in the table variable that is being selected from.
...
Its really hard to write down a concise description of my understanding of the issue. Sorry. I think I'm good now though.
It's intended behaviour because in a sub query you can access the 'outer queries' column names. Meaning you can use Id from Table within the Subquery and the query therefore thinks you are using Id.
That's why you should qualify with aliases or fully qualified names when working with sub queries.
For example; check out
http://support.microsoft.com/kb/298674
SELECT ID
FROM [Table]
WHERE ID IN (SELECT OtherTable.ID FROM OtherTable)
This will generate an error. As Allan S. Hanses said, in the subquery you can use colums from the main query.
See this example
SELECT ID
FROM [Table]
WHERE ID IN (SELECT ID)
The query is a correlated sub-query and is most often used to limit the results of the outer query based on a column returned by the sub query; hence the 'correlated'.
In this example the ID in the inner query is actually the ID from the table in the outer query. This makes the query valid but probably doesn't give you any useful results as it isn't actually correlating between the outer and inner queries.

How can I force a subquery to perform as well as a #temp table?

I am re-iterating the question asked by Mongus Pong Why would using a temp table be faster than a nested query? which doesn't have an answer that works for me.
Most of us at some point find that when a nested query reaches a certain complexity it needs to broken into temp tables to keep it performant. It is absurd that this could ever be the most practical way forward and means these processes can no longer be made into a view. And often 3rd party BI apps will only play nicely with views so this is crucial.
I am convinced there must be a simple queryplan setting to make the engine just spool each subquery in turn, working from the inside out. No second guessing how it can make the subquery more selective (which it sometimes does very successfully) and no possibility of correlated subqueries. Just the stack of data the programmer intended to be returned by the self-contained code between the brackets.
It is common for me to find that simply changing from a subquery to a #table takes the time from 120 seconds to 5. Essentially the optimiser is making a major mistake somewhere. Sure, there may be very time consuming ways I could coax the optimiser to look at tables in the right order but even this offers no guarantees. I'm not asking for the ideal 2 second execute time here, just the speed that temp tabling offers me within the flexibility of a view.
I've never posted on here before but I have been writing SQL for years and have read the comments of other experienced people who've also just come to accept this problem and now I would just like the appropriate genius to step forward and say the special hint is X...
There are a few possible explanations as to why you see this behavior. Some common ones are
The subquery or CTE may be being repeatedly re-evaluated.
Materialising partial results into a #temp table may force a more optimum join order for that part of the plan by removing some possible options from the equation.
Materialising partial results into a #temp table may improve the rest of the plan by correcting poor cardinality estimates.
The most reliable method is simply to use a #temp table and materialize it yourself.
Failing that regarding point 1 see Provide a hint to force intermediate materialization of CTEs or derived tables. The use of TOP(large_number) ... ORDER BY can often encourage the result to be spooled rather than repeatedly re evaluated.
Even if that works however there are no statistics on the spool.
For points 2 and 3 you would need to analyse why you weren't getting the desired plan. Possibly rewriting the query to use sargable predicates, or updating statistics might get a better plan. Failing that you could try using query hints to get the desired plan.
I do not believe there is a query hint that instructs the engine to spool each subquery in turn.
There is the OPTION (FORCE ORDER) query hint which forces the engine to perform the JOINs in the order specified, which could potentially coax it into achieving that result in some instances. This hint will sometimes result in a more efficient plan for a complex query and the engine keeps insisting on a sub-optimal plan. Of course, the optimizer should usually be trusted to determine the best plan.
Ideally there would be a query hint that would allow you to designate a CTE or subquery as "materialized" or "anonymous temp table", but there is not.
Another option (for future readers of this article) is to use a user-defined function. Multi-statement functions (as described in How to Share Data between Stored Procedures) appear to force the SQL Server to materialize the results of your subquery. In addition, they allow you to specify primary keys and indexes on the resulting table to help the query optimizer. This function can then be used in a select statement as part of your view. For example:
CREATE FUNCTION SalesByStore (#storeid varchar(30))
RETURNS #t TABLE (title varchar(80) NOT NULL PRIMARY KEY,
qty smallint NOT NULL) AS
BEGIN
INSERT #t (title, qty)
SELECT t.title, s.qty
FROM sales s
JOIN titles t ON t.title_id = s.title_id
WHERE s.stor_id = #storeid
RETURN
END
CREATE VIEW SalesData As
SELECT * FROM SalesByStore('6380')
Having run into this problem, I found out that (in my case) SQL Server was evaluating the conditions in incorrect order, because I had an index that could be used (IDX_CreatedOn on TableFoo).
SELECT bar.*
FROM
(SELECT * FROM TableFoo WHERE Deleted = 1) foo
JOIN TableBar bar ON (bar.FooId = foo.Id)
WHERE
foo.CreatedOn > DATEADD(DAY, -7, GETUTCDATE())
I managed to work around it by forcing the subquery to use another index (i.e. one that would be used when the subquery was executed without the parent query). In my case I switched to PK, which was meaningless for the query, but allowed the conditions from the subquery to be evaluated first.
SELECT bar.*
FROM
(SELECT * FROM TableFoo WITH (INDEX([PK_Id]) WHERE Deleted = 1) foo
JOIN TableBar bar ON (bar.FooId = foo.Id)
WHERE
foo.CreatedOn > DATEADD(DAY, -7, GETUTCDATE())
Filtering by the Deleted column was really simple and filtering the few results by CreatedOn afterwards was even easier. I was able to figure it out by comparing the Actual Execution Plan of the subquery and the parent query.
A more hacky solution (and not really recommended) is to force the subquery to get executed first by limiting the results using TOP, however this could lead to weird problems in the future if the results of the subquery exceed the limit (you could always set the limit to something ridiculous). Unfortunately TOP 100 PERCENT can't be used for this purpose since SQL Server just ignores it.

is index still effective after data has been selected?

I have two tables that I want to join, they both have index on the column I am trying to join.
QUERY 1
SELECT * FROM [A] INNER JOIN [B] ON [A].F = [B].F;
QUERY 2
SELECT * FROM (SELECT * FROM [A]) [A1] INNER JOIN (SELECT * FROM B) [B1] ON [A1].F=[B1].F
the first query clearly will utilize the index, what about the second one?
after the two select statements in the brackets are executed, then join would occur, but my guess is the index wouldn't help to speed up the query because it is pretty much a new table..
The query isn't executed quite so literally as you suggest, where the inner queries are executed first and then their results are combined with the outer query. The optimizer will take your query and will look at many possible ways to get your data through various join orders, index usages, etc. etc. and come up with a plan that it feels is optimal enough.
If you execute both queries and look at their respective execution plans, I think you will find that they use the exact same one.
Here's a simple example of the same concept. I created my schema as so:
CREATE TABLE A (id int, value int)
CREATE TABLE B (id int, value int)
INSERT INTO A (id, value)
VALUES (1,900),(2,800),(3,700),(4,600)
INSERT INTO B (id, value)
VALUES (2,800),(3,700),(4,600),(5,500)
CREATE CLUSTERED INDEX IX_A ON A (id)
CREATE CLUSTERED INDEX IX_B ON B (id)
And ran queries like the ones you provided.
SELECT * FROM A INNER JOIN B ON A.id = B.id
SELECT * FROM (SELECT * FROM A) A1 INNER JOIN (SELECT * FROM B) B1 ON A1.id = B1.id
The plans that were generated looked like this:
Which, as you can see, both utilize the index.
Chances are high that the SQL Server Query Optimizer will be able to detect that Query 2 is in fact the same as Query 1 and use the same indexed approach.
Whether this happens depends on a lot of factors: your table design, your table statistics, the complexity of your query, etc. If you want to know for certain, let SQL Server Query Analyzer show you the execution plan. Here are some links to help you get started:
Displaying Graphical Execution Plans
Examining Query Execution Plans
SQL Server uses predicate pushing (a.k.a. predicate pushdown) to move query conditions as far toward the source tables as possible. It doesn't slavishly do things in the order you parenthesize them. The optimizer uses complex rules--what is essentially a kind of geometry--to determine the meaning of your query, and restructure its access to the data as it pleases in order to gain the most performance while still returning the same final set of data that your query logic demands.
When queries become more and more complicated, there is a point where the optimizer cannot exhaustively search all possible execution plans and may end up with something that is suboptimal. However, you can pretty much assume that a simple case like you have presented is going to always be "seen through" and optimized away.
So the answer is that you should get just as good performance as if the two queries were combined. Now, if the values you are joining on are composite, that is they are the result of a computation or concatenation, then you are almost certainly not going to get the predicate push you want that will make the index useful, because the server won't or can't do a seek based on a partial string or after performing reverse arithmetic or something.
May I suggest that in the future, before asking questions like this here, you simply examine the execution plan for yourself to validate that it is using the index? You could have answered your own question with a little experimentation. If you still have questions, then come post, but in the meantime try to do some of your own research as a sign of respect for the people who are helping you.
To see execution plans, in SQL Server Management Studio (2005 and up) or SQL Query Analyzer (SQL 2000) you can just click the "Show Execution Plan" button on the menu bar, run your query, and switch to the tab at the bottom that displays a graphical version of the execution plan. Some little poking around and hovering your mouse over various pieces will quickly show you which indexes are being used on which tables.
However, if things aren't as you expect, don't automatically think that the server is making a mistake. It may decide that scanning your main table without using the index costs less--and it will almost always be right. There are many reasons that scanning can be less expensive, one of which is a very small table, another of which is that the number of rows the server statistically guesses it will have to return exceeds a significant portion of the table.
These both queries are same. The second query will be transformed just same as first one during transformation.
However, if you have specific requirement I would suggest that you put the whole code.Then It would be much easier to answer your question.

How do I filter one of the columns in a SQL Server SQL Query

I have a table (that relates to a number of other tables) where I would like to filter ONE of the columns (RequesterID) - that column will be a combobox where only people that are not sales people should be selectable.
Here is the "unfiltered" query, lets call it QUERY 1:
SELECT RequestsID, RequesterID, ProductsID
FROM dbo.Requests
If using a separate query, lets call it QUERY 2, to filter RequesterID (which is a People related column, connected to People.PeopleID), it would look like this:
SELECT People.PeopleID
FROM People INNER JOIN
Roles ON People.RolesID = Roles.RolesID INNER JOIN
Requests ON People.PeopleID = Requests.RequesterID
WHERE (Roles.Role <> N'SalesGuy')
ORDER BY Requests.RequestsID
Now, is there a way of "merging" the QUERY 2 into QUERY 1?
(dbo.Requests in QUERY 1 has RequesterID populated as a Foreign Key from dbo.People, so no problem there... The connections are all right, just not know how to write the SQL query!)
UPDATE
Trying to explain what I mean in a bit more... :
The result set should be a number of REQUESTS - and the number of REQUESTS should not be limited by QUERY 2. QUERY 2:s only function is to limit the selectable subset in column Requests.RequesterID - and no, it´s not that clear, but in the C# VS2008 implementation I use Requests.RequesterID to eventually populate a ComboBox with [Full name], which is another column in the People table - and in that column I don´t want SalesGuy to show up as possible to select; here I´m trying to clear it out EVEN MORE... (but with wrong syntax, of course)
SELECT RequestsID, (RequesterID WHERE RequesterID != 8), ProductsID
FROM dbo.Requests
Yes, RequesterID 8 happens to be the SalesGuy :-)
here is a very comprehensive article on how to handle this topic:
Dynamic Search Conditions in T-SQL by Erland Sommarskog
it covers all the issues and methods of trying to write queries with multiple optional search conditions. This main thing you need to be concerned with is not the duplication of code, but the use of an index. If your query fails to use an index, it will preform poorly. There are several techniques that can be used, which may or may not allow an index to be used.
here is the table of contents:
Introduction
The Case Study: Searching Orders
The Northgale Database
Dynamic SQL
Introduction
Using sp_executesql
Using the CLR
Using EXEC()
When Caching Is Not Really What You Want
Static SQL
Introduction
x = #x OR #x IS NULL
Using IF statements
Umachandar's Bag of Tricks
Using Temp Tables
x = #x AND #x IS NOT NULL
Handling Complex Conditions
Hybrid Solutions – Using both Static and Dynamic SQL
Using Views
Using Inline Table Functions
Conclusion
Feedback and Acknowledgements
Revision History
if you are on the proper version of SQL Server 2008, there is an additional technique that can be used, see: Dynamic Search Conditions in T-SQL Version for SQL 2008 (SP1 CU5 and later)
If you are on that proper release of SQL Server 2008, you can just add OPTION (RECOMPILE) to the query and the local variable's value at run time is used for the optimizations.
Consider this, OPTION (RECOMPILE) will take this code (where no index can be used with this mess of ORs):
WHERE
(#search1 IS NULL or Column1=#Search1)
AND (#search2 IS NULL or Column2=#Search2)
AND (#search3 IS NULL or Column3=#Search3)
and optimize it at run time to be (provided that only #Search2 was passed in with a value):
WHERE
Column2=#Search2
and an index can be used (if you have one defined on Column2)
How about this? Since the query already joins on the requests table you can simply add the columns to the select-list like so :
SELECT Requests.RequestsID, Requests.RequesterID, Requests.ProductsID
FROM People INNER JOIN
Roles ON People.RolesID = Roles.RolesID INNER JOIN
Requests ON People.PeopleID = Requests.RequesterID
WHERE (Roles.Role <> N'SalesGuy')
ORDER BY Requests.RequestsID
You can in fact select any column from any of the joined tables (Roles, Requests, People, etc.)
It becomes clear if you just replace People.PeopleId with * and it will show you everything retrieved from the tables.

Resources