is index still effective after data has been selected? - sql-server

I have two tables that I want to join, they both have index on the column I am trying to join.
QUERY 1
SELECT * FROM [A] INNER JOIN [B] ON [A].F = [B].F;
QUERY 2
SELECT * FROM (SELECT * FROM [A]) [A1] INNER JOIN (SELECT * FROM B) [B1] ON [A1].F=[B1].F
the first query clearly will utilize the index, what about the second one?
after the two select statements in the brackets are executed, then join would occur, but my guess is the index wouldn't help to speed up the query because it is pretty much a new table..

The query isn't executed quite so literally as you suggest, where the inner queries are executed first and then their results are combined with the outer query. The optimizer will take your query and will look at many possible ways to get your data through various join orders, index usages, etc. etc. and come up with a plan that it feels is optimal enough.
If you execute both queries and look at their respective execution plans, I think you will find that they use the exact same one.
Here's a simple example of the same concept. I created my schema as so:
CREATE TABLE A (id int, value int)
CREATE TABLE B (id int, value int)
INSERT INTO A (id, value)
VALUES (1,900),(2,800),(3,700),(4,600)
INSERT INTO B (id, value)
VALUES (2,800),(3,700),(4,600),(5,500)
CREATE CLUSTERED INDEX IX_A ON A (id)
CREATE CLUSTERED INDEX IX_B ON B (id)
And ran queries like the ones you provided.
SELECT * FROM A INNER JOIN B ON A.id = B.id
SELECT * FROM (SELECT * FROM A) A1 INNER JOIN (SELECT * FROM B) B1 ON A1.id = B1.id
The plans that were generated looked like this:
Which, as you can see, both utilize the index.

Chances are high that the SQL Server Query Optimizer will be able to detect that Query 2 is in fact the same as Query 1 and use the same indexed approach.
Whether this happens depends on a lot of factors: your table design, your table statistics, the complexity of your query, etc. If you want to know for certain, let SQL Server Query Analyzer show you the execution plan. Here are some links to help you get started:
Displaying Graphical Execution Plans
Examining Query Execution Plans

SQL Server uses predicate pushing (a.k.a. predicate pushdown) to move query conditions as far toward the source tables as possible. It doesn't slavishly do things in the order you parenthesize them. The optimizer uses complex rules--what is essentially a kind of geometry--to determine the meaning of your query, and restructure its access to the data as it pleases in order to gain the most performance while still returning the same final set of data that your query logic demands.
When queries become more and more complicated, there is a point where the optimizer cannot exhaustively search all possible execution plans and may end up with something that is suboptimal. However, you can pretty much assume that a simple case like you have presented is going to always be "seen through" and optimized away.
So the answer is that you should get just as good performance as if the two queries were combined. Now, if the values you are joining on are composite, that is they are the result of a computation or concatenation, then you are almost certainly not going to get the predicate push you want that will make the index useful, because the server won't or can't do a seek based on a partial string or after performing reverse arithmetic or something.
May I suggest that in the future, before asking questions like this here, you simply examine the execution plan for yourself to validate that it is using the index? You could have answered your own question with a little experimentation. If you still have questions, then come post, but in the meantime try to do some of your own research as a sign of respect for the people who are helping you.
To see execution plans, in SQL Server Management Studio (2005 and up) or SQL Query Analyzer (SQL 2000) you can just click the "Show Execution Plan" button on the menu bar, run your query, and switch to the tab at the bottom that displays a graphical version of the execution plan. Some little poking around and hovering your mouse over various pieces will quickly show you which indexes are being used on which tables.
However, if things aren't as you expect, don't automatically think that the server is making a mistake. It may decide that scanning your main table without using the index costs less--and it will almost always be right. There are many reasons that scanning can be less expensive, one of which is a very small table, another of which is that the number of rows the server statistically guesses it will have to return exceeds a significant portion of the table.

These both queries are same. The second query will be transformed just same as first one during transformation.
However, if you have specific requirement I would suggest that you put the whole code.Then It would be much easier to answer your question.

Related

How can I force a subquery to perform as well as a #temp table?

I am re-iterating the question asked by Mongus Pong Why would using a temp table be faster than a nested query? which doesn't have an answer that works for me.
Most of us at some point find that when a nested query reaches a certain complexity it needs to broken into temp tables to keep it performant. It is absurd that this could ever be the most practical way forward and means these processes can no longer be made into a view. And often 3rd party BI apps will only play nicely with views so this is crucial.
I am convinced there must be a simple queryplan setting to make the engine just spool each subquery in turn, working from the inside out. No second guessing how it can make the subquery more selective (which it sometimes does very successfully) and no possibility of correlated subqueries. Just the stack of data the programmer intended to be returned by the self-contained code between the brackets.
It is common for me to find that simply changing from a subquery to a #table takes the time from 120 seconds to 5. Essentially the optimiser is making a major mistake somewhere. Sure, there may be very time consuming ways I could coax the optimiser to look at tables in the right order but even this offers no guarantees. I'm not asking for the ideal 2 second execute time here, just the speed that temp tabling offers me within the flexibility of a view.
I've never posted on here before but I have been writing SQL for years and have read the comments of other experienced people who've also just come to accept this problem and now I would just like the appropriate genius to step forward and say the special hint is X...
There are a few possible explanations as to why you see this behavior. Some common ones are
The subquery or CTE may be being repeatedly re-evaluated.
Materialising partial results into a #temp table may force a more optimum join order for that part of the plan by removing some possible options from the equation.
Materialising partial results into a #temp table may improve the rest of the plan by correcting poor cardinality estimates.
The most reliable method is simply to use a #temp table and materialize it yourself.
Failing that regarding point 1 see Provide a hint to force intermediate materialization of CTEs or derived tables. The use of TOP(large_number) ... ORDER BY can often encourage the result to be spooled rather than repeatedly re evaluated.
Even if that works however there are no statistics on the spool.
For points 2 and 3 you would need to analyse why you weren't getting the desired plan. Possibly rewriting the query to use sargable predicates, or updating statistics might get a better plan. Failing that you could try using query hints to get the desired plan.
I do not believe there is a query hint that instructs the engine to spool each subquery in turn.
There is the OPTION (FORCE ORDER) query hint which forces the engine to perform the JOINs in the order specified, which could potentially coax it into achieving that result in some instances. This hint will sometimes result in a more efficient plan for a complex query and the engine keeps insisting on a sub-optimal plan. Of course, the optimizer should usually be trusted to determine the best plan.
Ideally there would be a query hint that would allow you to designate a CTE or subquery as "materialized" or "anonymous temp table", but there is not.
Another option (for future readers of this article) is to use a user-defined function. Multi-statement functions (as described in How to Share Data between Stored Procedures) appear to force the SQL Server to materialize the results of your subquery. In addition, they allow you to specify primary keys and indexes on the resulting table to help the query optimizer. This function can then be used in a select statement as part of your view. For example:
CREATE FUNCTION SalesByStore (#storeid varchar(30))
RETURNS #t TABLE (title varchar(80) NOT NULL PRIMARY KEY,
qty smallint NOT NULL) AS
BEGIN
INSERT #t (title, qty)
SELECT t.title, s.qty
FROM sales s
JOIN titles t ON t.title_id = s.title_id
WHERE s.stor_id = #storeid
RETURN
END
CREATE VIEW SalesData As
SELECT * FROM SalesByStore('6380')
Having run into this problem, I found out that (in my case) SQL Server was evaluating the conditions in incorrect order, because I had an index that could be used (IDX_CreatedOn on TableFoo).
SELECT bar.*
FROM
(SELECT * FROM TableFoo WHERE Deleted = 1) foo
JOIN TableBar bar ON (bar.FooId = foo.Id)
WHERE
foo.CreatedOn > DATEADD(DAY, -7, GETUTCDATE())
I managed to work around it by forcing the subquery to use another index (i.e. one that would be used when the subquery was executed without the parent query). In my case I switched to PK, which was meaningless for the query, but allowed the conditions from the subquery to be evaluated first.
SELECT bar.*
FROM
(SELECT * FROM TableFoo WITH (INDEX([PK_Id]) WHERE Deleted = 1) foo
JOIN TableBar bar ON (bar.FooId = foo.Id)
WHERE
foo.CreatedOn > DATEADD(DAY, -7, GETUTCDATE())
Filtering by the Deleted column was really simple and filtering the few results by CreatedOn afterwards was even easier. I was able to figure it out by comparing the Actual Execution Plan of the subquery and the parent query.
A more hacky solution (and not really recommended) is to force the subquery to get executed first by limiting the results using TOP, however this could lead to weird problems in the future if the results of the subquery exceed the limit (you could always set the limit to something ridiculous). Unfortunately TOP 100 PERCENT can't be used for this purpose since SQL Server just ignores it.

Sql query gets too slow

Few days ago I wrote one query and it gets executes quickly but now a days it takes 1 hrs.
This query run on my SQL7 server and it takes about 10 seconds.
This query exists on another SQL7 server and until last week it took about
10 seconds.
The configuration of both servers are same. Only the hardware is different.
Now, on the second server this query takes about 30 minutes to extract the s
ame details, but anybody has changed any details.
If I execute this query without Where, it'll show me the details in 7
seconds.
This query still takes about same time if Where is problem
Without seeing the query and probably the data I can't do a lot other than offer tips.
Can you put more constraints on the query. If you can reduce the amount of data involved then this will speed up the query.
Look at the columns used in your joins, where and having clauses and order by. Check that the tables that the columns belong to contain indices for these columns.
Do you need to use the user defined function or can it be done another way?
Are you using subquerys? If so can these be pulled out into separate views?
Hope this helps.
Without knowing how much data is going into your tables, and not knowing your schema, it's hard to give a definitive answer but things to look at:
Try running UPDATE STATS or DBCC REINDEX.
Do you have any indexes on the tables? If not, try adding indexes to columns used in WHERE clauses and JOIN predicates.
Avoid cross table OR clauses (i.e, where you do WHERE table1.col1 = #somevalue OR table2.col2 = #someothervalue). SQL can't use indexes effectively with this construct and you may get better performance by splitting the query into two and UNION'ing the results.
What do your functions (UDFs) do and how are you using them? It's worth noting that dropping them in the columns part of a query gets expensive as the function is executed per row returned: thus if a function does a select against the database, then you end up running n + 1 queries against the database (where n = number of rows returned in the main select). Try and engineer the function out if possible.
Make sure your JOINs are correct -- where you're using a LEFT JOIN, revisit the logic and see if it needs to be a LEFT or whether it can be turned into an INNER JOIN. Sometimes people use LEFT JOINs, but when you examine the logic in the rest of the query, it can sometimes be apparent that the LEFT JOIN gives you nothing (because, for example, someone may had added a WHERE col IS NOT NULL predicate against the joined table). INNER JOINs can be faster, so it's worth reviewing all of these.
It would be a lot easier to suggest things if we could see the query.

Sql serve Full Text Search with Containstable is very slow when Used in JOIN!

I am using sql 2008 full text search and I am having serious issues with performance depending on how I use Contains or ContainsTable.
Here are sample: (table one has about 5000 records and there is a covered index on table1 which has all the fields in the where clause. I tried to simplify the statements so forgive me if there is syntax issues.)
Scenario 1:
select * from table1 as t1
where t1.field1=90
and t1.field2='something'
and Exists(select top 1 * from containstable(table1,*, 'something') as t2
where t2.[key]=t1.id)
results: 10 second (very slow)
Scenario 2:
select * from table1 as t1
join containstable(table1,*, 'something') as t2 on t2.[key] = t1.id
where t1.field1=90
and t1.field2='something'
results: 10 second (very slow)
Scenario 3:
Declare #tbl Table(id uniqueidentifier primary key)
insert into #tbl select {key] from containstable(table1,*, 'something')
select * from table1 as t1
where t1.field1=90
and t1.field2='something'
and Exists(select id from #tbl as tbl where id=req1.id)
results: fraction of a second (super fast)
Bottom line, it seems if I use Containstable in any kind of join or where clause condition of a select statement that also has other conditions, the performance is really bad. In addition if you look at profiler, the number of reads from the database goes to the roof. But if I first do the full text search and put results in a table variable and use that variable everything goes super fast. The number of reads are also much lower. It seems in "bad" scenarios, somehow it gets stuck in a loop which causes it to read many times from teh database but of course I don't understant why.
Now the question is first of all whyis that happening? and question two is that how scalable table variables are? what if it results to 10s of thousands of records? is it still going to be fast.
Any ideas?
Thanks
I spent quite sometime on this issue, and based on running many scenarios, this is what I figured out:
if you have Contains or ContainsTable anywhere in your query, that is the part that gets executed first and rather independently. Meaning that even if the rest of the conditions limit your search to only one record, neither Contains nor containstable care about that. So this is like a parallel execution.
Now since fulltext search only returns a Key field, it immediately looks for the Key as the first field of other indexes chosen for the query. So for the example above, it looks for the index with [key], field1, field2. The problem is that it chooses an index for the rest of query based on the fields in the where clause. so for the example above it picks the covered index that I have which is something like field1, field2, Id. (Id of the table is the same as the [Key] returned from the full text search). So summary is:
executes containstable
executes the rest of the query and pick an index based on where clause of the query
It tries to merge these two. Therefore, if the index that it picked for the rest of the query starts with the [key] field, it is fine. However, if the index doesn't have the [key] field as the first key, it starts doing loops. It does not even do a table scan, otherwise going through 5000 records would not be that slow. The way it does the loop is that it runs the loop for the total number of results from FTS multiplied by total number of results from the rest of the query. So if the FTS is returning 2000 records and the rest of the query returns 3000, it loops 2000*3000= 6,000,000. I donot understand why.
So in my case it does the full text search, then it does he rest of the query but picks the covered index that I have which is based on field1, field2,id (which is wrong) and as the result it screws up. If I change my covered index to Id, field1, field2 everything would be very fast.
My expection was that FTS returns bunch of [key], the rest of the query return bunch of [Id] and then the Id should be matched against [key].
Of course, I tried to simplify my query here, but the actual query is much more complicated and I cannot just change the index. I also do have scenarios where the text passed in full text is blank and in those scenarios I donot even want to join with containstable.
In those cases changing my covered index to have the id field as the first field, will generate disaster.
Anyways, for now I chose the temp table solution since it is working for me. I am also limiting the result to a few thousand which helps with the potential performance issues of table variables when the number of records go too high.
thanks
Normally it works very fast:
select t1.*, t2.Rank
from containstable(table1, field2, 'something') as t2
join table1 as t1 ON t1.id = t2.Key AND t1.field1=90
order by t2.Rank desc
There is a big difference where you put your search criteria: in JOIN or in WHERE.
I'm going to take a guess here that your issue is the same as on the other thread I linked to. Are you finding the issue arises with multiple word search terms?
If so my answer from that thread will apply.
From http://technet.microsoft.com/en-us/library/cc721269.aspx#_Toc202506240
The most important thing is that the
correct join type is picked for
full-text query. Cardinality
estimation on the FulltextMatch STVF
is very important for the right plan.
So the first thing to check is the
FulltextMatch cardinality estimation.
This is the estimated number of hits
in the index for the full-text search
string. For example, in the query in
Figure 3 this should be close to the
number of documents containing the
term ‘word’. In most cases it should
be very accurate but if the estimate
was off by a long way, you could
generate bad plans. The estimation for
single terms is normally very good,
but estimating multiple terms such as
phrases or AND queries is more complex
since it is not possible to know what
the intersection of terms in the index
will be based on the frequency of the
terms in the index. If the cardinality
estimation is good, a bad plan
probably is caused by the query
optimizer cost model. The only way to
fix the plan issue is to use a query
hint to force a certain kind of join
or OPTIMIZE FOR.
So it simply cannot know from the information it stores whether the 2 search terms together are likely to be quite independent or commonly found together. Maybe you should have 2 separate procedures one for single word queries that you let the optimiser do its stuff on and one for multi word search terms that you force a "good enough" plan on (sys.dm_fts_index_keywords might help if you want to do a rough estimate of cardinality yourself).
If you are getting the issue with single word queries this passage from the linked article might apply.
In SQL Server 2008 full-text search we have the ability to alter the plan that is
generated based on a cardinality estimation of the search term used. If the query plan is fixed (as it is in a parameterized query inside a stored procedure), this step does
not take place. Therefore, the compiled plan always serves this query, even if this plan is not ideal for a given search term.
So you might need to use the RECOMPILE option.

Select statement performance

I'm having a performance issue with a select statement I'm executing.
Here it is:
SELECT Material.*
FROM Material
INNER JOIN LineInfo ON Material.LineInfoCtr = LineInfo.ctr
INNER JOIN Order_Header ON LineInfo.Order_HeaderCtr = Order_Header.ctr
WHERE (Order_Header.jobNum = 'ttest')
AND (Order_Header.revision_number = 0)
AND (LineInfo.lineNum = 46)
The statement is taking 5-10 seconds to execute depending on server load.
Some table stats:
- Material has 2,030,xxx records.
- Lineinfo has 190,xxx records
- Order_Header has 2,5xx records.
My statement is returning a total of 18 rows containing about 20-25 fields of data. Returning a single field or all of them makes no difference. Is this performance typical? Is there something I could do to improve it?
I've tried using a sub select to retrieve the foreign key, the IN clause and I found one post where a fella said using a left outer join helped him. For me, they all yield the same 5 to 10 seconds of execution time.
This is MS SQL server 2005 accessed through MS SQL management studio. Times are the elapsed time in query analyzer.
Any ideas?
The first thing you should do is analyze the query plan, to see what indexes (if any) SQL Server is using.
You can probably benefit from some covering indexes in this query, since you only use columns in Lineinfo and Order_Header for the join and the query restriction (the WHERE clause).
I do not see anything special in your query so, if indexes are correct, it should perform much more faster than that,, the number of rows is not very high.
Do you have indexes on the table involved in the query and have you tried to use the "display execution plan" option of the Query Analyzer. Basically you need to run the query, loop at the execution plan and add indexes so that you do not see any full table scan operation.
If you run from SQL Management studio then you have the option to tune automatically the query adding indexes but I would suggest trying optimize on your own to better understand what you're doing.
Regards
Massimo
It won't affect performance, but don't write a query such as "SELECT * FROM X". Eschew the star notation and spell out the individual columns. The code that calls this will still work that way, even if the schema is changed by adding a column.
Indexes are key here, as others have already said.
The order of the WHERE clauses can help. Execute the one that eliminates the greatest number of rows from consideration first.
Taking all suggestions and rolling them together I was able to setup some indexes and now it's taking less than a second to execute. Honestly, it's almost immediate.
My problem was that by clicking on the table properties I saw that the primary key was indexed and I mistakenly thought that's what everyone had been talking about. I looked at the execution plan and ran the tuning assistant and putting the two together, I realized that you could index the foreign keys too. That is now done and things are exceptionally snappy.
Thanks for the help, and sorry for such a newb question.

SQL Server query taking up 100% CPU and runs for hours

I have a query that has been running every day for a little over 2 years now and has typically taken less than 30 seconds to complete. All of a sudden, yesterday, the query started taking 3+ hours to complete and was using 100% CPU the entire time.
The SQL is:
SELECT
#id,
alpha.A, alpha.B, alpha.C,
beta.X, beta.Y, beta.Z,
alpha.P, alpha.Q
FROM
[DifferentDatabase].dbo.fnGetStuff(#id) beta
INNER JOIN vwSomeData alpha ON beta.id = alpha.id
alpha.id is a BIGINT type and beta.id is an INT type. dbo.fnGetStuff() is a simple SELECT statement with 2 INNER JOINs on tables in the same DB, using a WHERE id = #id. The function returns approximately 11000 results.
The view vwSomeData is a simple SELECT statement with two INNER JOINs that returns about 590000 results.
Both the view and the function will complete in less than 10 seconds when executed by themselves. Selecting the results of the function into a temporary table first and then joining on that makes the query finish in < 10 seconds.
How do I troubleshoot what's going on? I don't see any locks in the activity manager.
Look at the query plan. My guess is that there is a table scan or more in the execution plan. This will cause huge amounts of I/O for the few record you get in the result.
You could use the SQL Server Profiler tool to monitor what queries are running on SQL Server. It doesn't show the locks, but it can for instance also give you hints on how to improve your query by suggesting indexes.
If you've got a reasonably recent version of SQL Server Management Studio, it has a Database Tuning Adviser as well, under Tools. It takes a trace from the Profiler and makes some, sometimes highly useful, suggestions. Makes sure there's not too many queries - it takes a long time to build advice.
I'm not an expert on it, but have had some luck with it in the past.
Do you need to use a function? Can you re-write the entire thing into a stored procedure in which you pass in the #ID as a parameter.
Even if your table has indexes because you pass the #ID as a variable to the WHERE clause potentially greatly increasing the amount of time for the query to run.
The reason the indexes may not be used is because the Query Analyzer does not know the value of the variables when it selects an access method to perform the query. Because this is a batch file, only one pass is made of the Transact-SQL code, preventing the Query Optimizer from knowing what it needs to know in order to select an access method that uses the indexes.
You might want to consider an INDEX query hint if you cannot re-write the SQL.
it might also be possible, since this just started happening, that the INDEXes have become fragmented and might need to be rebuilt.
I've had similar problems with joining functions that return large datasets. I had to do what you've already suggested. Put the results in a temp table and join on that.
Look at the estimated plan, this will probably shed some light. Typically when query cost gets orders of magnitude more expensive it is because a loop or merge join is being used where a hash join is more appropriate. If you see a loop or merge join in the estimated plan, look at the number of rows it expects to process - is it far smaller than the number of rows you know will actually be in play? You can also specify a hint to use a hash join and see if it performs much better. If so, try updating statistics and see if it goes back to a hash join without a hint.
SELECT
#id,
alpha.A, alpha.B, alpha.C,
beta.X, beta.Y, beta.Z,
alpha.P, alpha.Q
FROM
[DifferentDatabase].dbo.fnGetStuff(#id) beta
INNER HASH JOIN vwSomeData alpha ON beta.id = alpha.id
-- having no idea what type of schema is in place and just trying to throw out ideas:
Like others have said... use Profiler and find the source of pain... but I'm thinking it is the function on the other database. Since that function might be a source of pain, have you thought about a little denormalization or anything on [DifferentDatabase]. I think you'll find a bit more scalability in joining to a more flattened table with indexes than a costly function.
Run this command:
SET SHOWPLAN_ALL ON
Then run your query. It will display the execution plan, look for a "SCAN" on an index or a table. That is most likely what is happening to your query now. If that is the case, try to figure out why it is not using indexes now (refresh statistics, etc)

Resources