In a new mission, I am faced with a SQL Server 2000 :( where I found plenty of large tables without any clustered index. So I suggested changing that. When testing, we found - and double checked - that at least one query was not returning the same result when the PK index was clustered and when it was not.
I know the query is ugly; it is generated by the GUI where the user can select fields and conditions for a custom report. Here is the query:
SELECT DISTINCT p.*, pcc.PatentCostCentreLink_pk, pcc.Client_fk,
pcc.Division_fk, pcc.CostCentre_fk, pcc.Reference, pcc.DecisionMaker
FROM dbo.Patent AS p
LEFT OUTER JOIN dbo.PatentCostCentreLink AS pcc ON p.Patent_pk = pcc.Patent_fk
WHERE (pcc.Client_fk = 2787) AND (pcc.Division_fk IS NULL)
AND (pcc.CostCentre_fk = 20066) AND (pcc.Reference LIKE 'P1049%')
My question is: with the same tables - except for changing 1 PK from non-clustered to clustered - why/how is it possible that the same query returns different result sets ? (23 rows with non clustered index, 1 row with clustered index).
Remarks about the nonsense in the query are useless. I know it's bad.
Note: the changed index is PK_PatentCostCentreLink, on dbo.PatentCostCentreLink.PatentCostCentreLink_pk (identity column).
Note2: when removing the DISTINCT or changing the JOIN to INNER then both databases return the same result (23 rows), as expected. But has I mentioned, that's is another question.
I would check a couple of things:
Make sure you use the latest service pack / hotfix available. AFAIR, it should be 8.0.2253, unless your company has extended support access or whatever it was back then. Check SQL Server Builds for details.
Make sure that your data is not corrupted. I can't recall the details now, but dbcc checkdb() command on 2000 version misses some discrepancies, so it would be better to attach / restore the database on the 2005 instance and checking it there.
Perform any maintenance that might affect this: rebuilding indices, updating statistics, etc.
Semantically, this query will result in inner rather than outer join (the WHERE part contains conditions for outer table), so there can be no reasonable explanation for this behaviour. So, unless anything mentioned above will help, chances are you've hit some heap-related bug that nobody will fix... Time for upgrade? :)
Related
I have tblClaims(ClaimID, ValidityTo, ...) and tblClaimServices(ClaimServiceId, ClaimID, ValidityTo, ....) with an obvious foreign key on ClaimID. The ValidityTo is used for history, so actual data has ValidityTo=null.
These tables have respectively 3 million and 13 million rows.
The query:
select * from tblClaimServices where ClaimID=1234567 and ValidityTo is null
takes 5 seconds to execute !
Querying ... where ClaimID=1234567 is instantaneous.
Note that we're not doing select * but specifying almost all columns. This is an ORM (Django)
The explain plan shows that it's using a clustered index on (ClaimServiceID, ValidityTo) and then working hard to query the ClaimID within those. That's insane ! ValidityTo is null for 98% of the rows.
We created an index on (ClaimID, ValidityTo) but it wasn't used. We then created an index on ClaimID with an included column for validityto:
CREATE NONCLUSTERED INDEX idx_test1 ON tblClaimServices (ClaimID) include (ValidityTo) WHERE ValidityTo IS NULL
But wasn't used either. (So taking 5 seconds to find 0 to 10 rows)
However, using a hint
from tblClaimServices with (index(idx_test1))
Does work great. Instant results.
Now, I can't and don't want to have to include hints. SQL Server should be able to use an index that is so specific ! And it would require me to update an old app that uses a ORM and including the hints there would be a major pain. And make the app pretty fragile or become very slow in other queries.
How can I improve SQL Server's decision to use that proper index ?
I discovered that this strange behavior disappeared when the database was in 2012 compatibility mode. In a more recent mode, the database avoids using the date index validity_to.
We do have a similar field that points from soft-deleted records to the current one which is an integer. Replacing the date condition ( is null ) with the integer makes all these queries use the index properly and return results immediately.
I am still not 100% sure why the index isn't used for the validity_to but my problem is solved.
I have two tables that I want to join, they both have index on the column I am trying to join.
QUERY 1
SELECT * FROM [A] INNER JOIN [B] ON [A].F = [B].F;
QUERY 2
SELECT * FROM (SELECT * FROM [A]) [A1] INNER JOIN (SELECT * FROM B) [B1] ON [A1].F=[B1].F
the first query clearly will utilize the index, what about the second one?
after the two select statements in the brackets are executed, then join would occur, but my guess is the index wouldn't help to speed up the query because it is pretty much a new table..
The query isn't executed quite so literally as you suggest, where the inner queries are executed first and then their results are combined with the outer query. The optimizer will take your query and will look at many possible ways to get your data through various join orders, index usages, etc. etc. and come up with a plan that it feels is optimal enough.
If you execute both queries and look at their respective execution plans, I think you will find that they use the exact same one.
Here's a simple example of the same concept. I created my schema as so:
CREATE TABLE A (id int, value int)
CREATE TABLE B (id int, value int)
INSERT INTO A (id, value)
VALUES (1,900),(2,800),(3,700),(4,600)
INSERT INTO B (id, value)
VALUES (2,800),(3,700),(4,600),(5,500)
CREATE CLUSTERED INDEX IX_A ON A (id)
CREATE CLUSTERED INDEX IX_B ON B (id)
And ran queries like the ones you provided.
SELECT * FROM A INNER JOIN B ON A.id = B.id
SELECT * FROM (SELECT * FROM A) A1 INNER JOIN (SELECT * FROM B) B1 ON A1.id = B1.id
The plans that were generated looked like this:
Which, as you can see, both utilize the index.
Chances are high that the SQL Server Query Optimizer will be able to detect that Query 2 is in fact the same as Query 1 and use the same indexed approach.
Whether this happens depends on a lot of factors: your table design, your table statistics, the complexity of your query, etc. If you want to know for certain, let SQL Server Query Analyzer show you the execution plan. Here are some links to help you get started:
Displaying Graphical Execution Plans
Examining Query Execution Plans
SQL Server uses predicate pushing (a.k.a. predicate pushdown) to move query conditions as far toward the source tables as possible. It doesn't slavishly do things in the order you parenthesize them. The optimizer uses complex rules--what is essentially a kind of geometry--to determine the meaning of your query, and restructure its access to the data as it pleases in order to gain the most performance while still returning the same final set of data that your query logic demands.
When queries become more and more complicated, there is a point where the optimizer cannot exhaustively search all possible execution plans and may end up with something that is suboptimal. However, you can pretty much assume that a simple case like you have presented is going to always be "seen through" and optimized away.
So the answer is that you should get just as good performance as if the two queries were combined. Now, if the values you are joining on are composite, that is they are the result of a computation or concatenation, then you are almost certainly not going to get the predicate push you want that will make the index useful, because the server won't or can't do a seek based on a partial string or after performing reverse arithmetic or something.
May I suggest that in the future, before asking questions like this here, you simply examine the execution plan for yourself to validate that it is using the index? You could have answered your own question with a little experimentation. If you still have questions, then come post, but in the meantime try to do some of your own research as a sign of respect for the people who are helping you.
To see execution plans, in SQL Server Management Studio (2005 and up) or SQL Query Analyzer (SQL 2000) you can just click the "Show Execution Plan" button on the menu bar, run your query, and switch to the tab at the bottom that displays a graphical version of the execution plan. Some little poking around and hovering your mouse over various pieces will quickly show you which indexes are being used on which tables.
However, if things aren't as you expect, don't automatically think that the server is making a mistake. It may decide that scanning your main table without using the index costs less--and it will almost always be right. There are many reasons that scanning can be less expensive, one of which is a very small table, another of which is that the number of rows the server statistically guesses it will have to return exceeds a significant portion of the table.
These both queries are same. The second query will be transformed just same as first one during transformation.
However, if you have specific requirement I would suggest that you put the whole code.Then It would be much easier to answer your question.
I have a SQL Server 2008 R2 database table with 12k address records where I am trying to filter out duplicate phone numbers and flag them using the following query
SELECT a1.Id, a2.Id
FROM Addresses a1
INNER JOIN Addresses a2 ON a1.PhoneNumber = a2.PhoneNumber
WHERE a1.Id < a2.Id
Note: I realize that there is another way to solve this problem by using EXISTS, but this is not part of the discussion.
The table has a Primary Key on the ID field with a Clustered Index, the fragmentation level is 0 and the phone number field is not null and has about 130 duplicates out of the 12k records. To make sure it is not a server or database instance issue I ran it on 4 different systems.
Execution of the query takes several minutes, sometimes several hours. After trying almost everything as one of my last steps I removed the Primary Key and ran the query without it and voila it executed in under 1 second. I added the Primary Key back and it still ran in under one second.
Does anybody have an idea what is causing this problem?
Is it possible that the primary key gets somehow corrupted?
EDIT: My apologies I had a couple of typos in the Sql Query
Out of data statistics. Dropping and recreating the PK will give up fresh statistics.
Too late now, but I'd have suggest running sp_updatestats to see what happened.
If you backup and restore a database onto different systems, statistics follow the data
I'd suspect a different plan too after non-indexed (I guess) columns PhoneNumber and CCAPhoneN
I'm guessing there are no indexes on PhoneNumber or PhoneNo.
You are joining on these fields, but if they aren't indexed it's forcing TWO table scans, one for each instance of the table in the query, then probably doing a hash match to find matching records.
Next step - get an exec plan and see what the pain points are.
Then, add indexes to those fields (assuming you see a Clustered Index Scan) and see if that fixes it.
I think your other issue is a red herring. The PK likely has nothing to do with it, but you may have gotten page caching (did you drop the buffers and clear the cache between runs?) which made the later runs faster.
I'm puzzled by the following. I have a DB with around 10 million rows, and (among other indices) on 1 column (campaignid_int) is an index.
Now I have 700k rows where the campaignid is indeed 3835
For all these rows, the connectionid is the same.
I just want to find out this connectionid.
use messaging_db;
SELECT TOP (1) connectionid
FROM outgoing_messages WITH (NOLOCK)
WHERE (campaignid_int = 3835)
Now this query takes approx 30 seconds to perform!
I (with my small db knowledge) would expect that it would take any of the rows, and return me that connectionid
If I test this same query for a campaign which only has 1 entry, it goes really fast. So the index works.
How would I tackle this and why does this not work?
edit:
estimated execution plan:
select (0%) - top (0%) - clustered index scan (100%)
Due to the statistics, you should explicitly ask the optimizer to use the index you've created instead of the clustered one.
SELECT TOP (1) connectionid
FROM outgoing_messages WITH (NOLOCK, index(idx_connectionid))
WHERE (campaignid_int = 3835)
I hope it will solve the issue.
Regards,
Enrique
I recently had the same issue and it's really quite simple to solve (at least in some cases).
If you add an ORDER BY-clause on any or some of the columns that's indexed it should be solved. That solved it for me at least.
You aren't specifying an ORDER BY clause in your query, so the optimiser is not being instructed as to the sort order it should be selecting the top 1 from. SQL Server won't just take a random row, it will order the rows by something and take the top 1, and it may be choosing to order by something that is sub-optimal. I would suggest that you add an ORDER BY x clause, where x being the clustered key on that table will probably be the fastest.
This may not solve your problem -- in fact I'm not sure I expect it to from the statistics you've given -- but (a) it won't hurt, and (b) you'll be able to rule this out as a contributing factor.
If the campaignid_int column is not indexed, add an index to it. That should speed up the query. Right now I presume that you need to do a full table scan to find the matches for campaignid_int = 3835 before the top(1) row is returned (filtering occurs before results are returned).
EDIT: An index is already in place, but since SQL Server does a clustered index scan, the optimizer has ignored the index. This is probably due to (many) duplicate rows with the same campaignid_int value. You should consider indexing differently or query on a different column to get the connectionid you want.
The index may be useless for 2 reasons:
700k in 10 million may be not selective enough
and /or
connectionid needs included so the entire query can used only an index
Otherwise, the optimiser decides it may as well use the PK/clustered index to both filter on campaignid_int and get connectionid, to avoid a bookmark lookup on 700k rows from the current index.
So, I suggest this...
CREATE NONCLUSTERED INDEX IX_Foo ON MyTable (campaignid_int) INCLUDE (connectionid)
This doesn't answer your question, but try using:
SET ROWCOUNT 1
SELECT connectionid
FROM outgoing_messages WITH (NOLOCK)
WHERE (campaignid_int = 3835)
I've seen top(x) perform very badly in certain situations as well. I'm sure it's doing a full table scan. Perhaps your index on that particular column needs to be rebuilt? The above is worth a try, however.
Your query does not work as you expect, because Sql Server keeps statistics about your index and in this particular case knows that there are a lot of duplicate rows with the identifier 3835, hence it figures that it would make more sense to just do a full index (or table) scan. When you test for an ID which resolves to only one row, it uses the index as expected, i.e. performs an index seek (the execution plan should verify this guess).
Possible solutions ? Make the index composite, if you have anything to compose it with, that is, e.g. compose it with the date the message was sent (if I understand your case correctly) and then select the top 1 entry from the list with the specified id ordered by the date. Though I'm not sure whether this would be better (for one, a composite index takes up more space) - just a guess.
EDIT: I just tried out the suggestion of making the index composite by adding a date column. If you do that and specify order by date in your query, an index seek is performed as expected.
but since I'm specifying 'top(1)' it
means: give me any row. Why would it
first crawl through the 700k rows just
to return one? – reinier 30 mins ago
Sorry, can't comment yet but the answer here is that SQL server is not going to understand the human equivalent of "Bring me the first one you find" when it hears "Top 1". Instead of the expected "Give me any row" SQL Server goes and fetches the first of all found rows.
Only time it knows that is after fetching all rows first, then discarding the rest. Very thorough but in your case not really fast.
Main issue as other said are your statistics and selectivity of your index. If you have another unique field in your table (like an identity column) then try an combined index on campaignid_int first, unique column second. As you only query on campaignid_int it has to be the first part of the key.
Sounds worth a try as this index should have a higher selectivity thus the optimizer can use this better than doing an index crawl.
I'm having a performance issue with a select statement I'm executing.
Here it is:
SELECT Material.*
FROM Material
INNER JOIN LineInfo ON Material.LineInfoCtr = LineInfo.ctr
INNER JOIN Order_Header ON LineInfo.Order_HeaderCtr = Order_Header.ctr
WHERE (Order_Header.jobNum = 'ttest')
AND (Order_Header.revision_number = 0)
AND (LineInfo.lineNum = 46)
The statement is taking 5-10 seconds to execute depending on server load.
Some table stats:
- Material has 2,030,xxx records.
- Lineinfo has 190,xxx records
- Order_Header has 2,5xx records.
My statement is returning a total of 18 rows containing about 20-25 fields of data. Returning a single field or all of them makes no difference. Is this performance typical? Is there something I could do to improve it?
I've tried using a sub select to retrieve the foreign key, the IN clause and I found one post where a fella said using a left outer join helped him. For me, they all yield the same 5 to 10 seconds of execution time.
This is MS SQL server 2005 accessed through MS SQL management studio. Times are the elapsed time in query analyzer.
Any ideas?
The first thing you should do is analyze the query plan, to see what indexes (if any) SQL Server is using.
You can probably benefit from some covering indexes in this query, since you only use columns in Lineinfo and Order_Header for the join and the query restriction (the WHERE clause).
I do not see anything special in your query so, if indexes are correct, it should perform much more faster than that,, the number of rows is not very high.
Do you have indexes on the table involved in the query and have you tried to use the "display execution plan" option of the Query Analyzer. Basically you need to run the query, loop at the execution plan and add indexes so that you do not see any full table scan operation.
If you run from SQL Management studio then you have the option to tune automatically the query adding indexes but I would suggest trying optimize on your own to better understand what you're doing.
Regards
Massimo
It won't affect performance, but don't write a query such as "SELECT * FROM X". Eschew the star notation and spell out the individual columns. The code that calls this will still work that way, even if the schema is changed by adding a column.
Indexes are key here, as others have already said.
The order of the WHERE clauses can help. Execute the one that eliminates the greatest number of rows from consideration first.
Taking all suggestions and rolling them together I was able to setup some indexes and now it's taking less than a second to execute. Honestly, it's almost immediate.
My problem was that by clicking on the table properties I saw that the primary key was indexed and I mistakenly thought that's what everyone had been talking about. I looked at the execution plan and ran the tuning assistant and putting the two together, I realized that you could index the foreign keys too. That is now done and things are exceptionally snappy.
Thanks for the help, and sorry for such a newb question.