I have a SQL Server 2008 R2 database table with 12k address records where I am trying to filter out duplicate phone numbers and flag them using the following query
SELECT a1.Id, a2.Id
FROM Addresses a1
INNER JOIN Addresses a2 ON a1.PhoneNumber = a2.PhoneNumber
WHERE a1.Id < a2.Id
Note: I realize that there is another way to solve this problem by using EXISTS, but this is not part of the discussion.
The table has a Primary Key on the ID field with a Clustered Index, the fragmentation level is 0 and the phone number field is not null and has about 130 duplicates out of the 12k records. To make sure it is not a server or database instance issue I ran it on 4 different systems.
Execution of the query takes several minutes, sometimes several hours. After trying almost everything as one of my last steps I removed the Primary Key and ran the query without it and voila it executed in under 1 second. I added the Primary Key back and it still ran in under one second.
Does anybody have an idea what is causing this problem?
Is it possible that the primary key gets somehow corrupted?
EDIT: My apologies I had a couple of typos in the Sql Query
Out of data statistics. Dropping and recreating the PK will give up fresh statistics.
Too late now, but I'd have suggest running sp_updatestats to see what happened.
If you backup and restore a database onto different systems, statistics follow the data
I'd suspect a different plan too after non-indexed (I guess) columns PhoneNumber and CCAPhoneN
I'm guessing there are no indexes on PhoneNumber or PhoneNo.
You are joining on these fields, but if they aren't indexed it's forcing TWO table scans, one for each instance of the table in the query, then probably doing a hash match to find matching records.
Next step - get an exec plan and see what the pain points are.
Then, add indexes to those fields (assuming you see a Clustered Index Scan) and see if that fixes it.
I think your other issue is a red herring. The PK likely has nothing to do with it, but you may have gotten page caching (did you drop the buffers and clear the cache between runs?) which made the later runs faster.
Related
I have tblClaims(ClaimID, ValidityTo, ...) and tblClaimServices(ClaimServiceId, ClaimID, ValidityTo, ....) with an obvious foreign key on ClaimID. The ValidityTo is used for history, so actual data has ValidityTo=null.
These tables have respectively 3 million and 13 million rows.
The query:
select * from tblClaimServices where ClaimID=1234567 and ValidityTo is null
takes 5 seconds to execute !
Querying ... where ClaimID=1234567 is instantaneous.
Note that we're not doing select * but specifying almost all columns. This is an ORM (Django)
The explain plan shows that it's using a clustered index on (ClaimServiceID, ValidityTo) and then working hard to query the ClaimID within those. That's insane ! ValidityTo is null for 98% of the rows.
We created an index on (ClaimID, ValidityTo) but it wasn't used. We then created an index on ClaimID with an included column for validityto:
CREATE NONCLUSTERED INDEX idx_test1 ON tblClaimServices (ClaimID) include (ValidityTo) WHERE ValidityTo IS NULL
But wasn't used either. (So taking 5 seconds to find 0 to 10 rows)
However, using a hint
from tblClaimServices with (index(idx_test1))
Does work great. Instant results.
Now, I can't and don't want to have to include hints. SQL Server should be able to use an index that is so specific ! And it would require me to update an old app that uses a ORM and including the hints there would be a major pain. And make the app pretty fragile or become very slow in other queries.
How can I improve SQL Server's decision to use that proper index ?
I discovered that this strange behavior disappeared when the database was in 2012 compatibility mode. In a more recent mode, the database avoids using the date index validity_to.
We do have a similar field that points from soft-deleted records to the current one which is an integer. Replacing the date condition ( is null ) with the integer makes all these queries use the index properly and return results immediately.
I am still not 100% sure why the index isn't used for the validity_to but my problem is solved.
In a new mission, I am faced with a SQL Server 2000 :( where I found plenty of large tables without any clustered index. So I suggested changing that. When testing, we found - and double checked - that at least one query was not returning the same result when the PK index was clustered and when it was not.
I know the query is ugly; it is generated by the GUI where the user can select fields and conditions for a custom report. Here is the query:
SELECT DISTINCT p.*, pcc.PatentCostCentreLink_pk, pcc.Client_fk,
pcc.Division_fk, pcc.CostCentre_fk, pcc.Reference, pcc.DecisionMaker
FROM dbo.Patent AS p
LEFT OUTER JOIN dbo.PatentCostCentreLink AS pcc ON p.Patent_pk = pcc.Patent_fk
WHERE (pcc.Client_fk = 2787) AND (pcc.Division_fk IS NULL)
AND (pcc.CostCentre_fk = 20066) AND (pcc.Reference LIKE 'P1049%')
My question is: with the same tables - except for changing 1 PK from non-clustered to clustered - why/how is it possible that the same query returns different result sets ? (23 rows with non clustered index, 1 row with clustered index).
Remarks about the nonsense in the query are useless. I know it's bad.
Note: the changed index is PK_PatentCostCentreLink, on dbo.PatentCostCentreLink.PatentCostCentreLink_pk (identity column).
Note2: when removing the DISTINCT or changing the JOIN to INNER then both databases return the same result (23 rows), as expected. But has I mentioned, that's is another question.
I would check a couple of things:
Make sure you use the latest service pack / hotfix available. AFAIR, it should be 8.0.2253, unless your company has extended support access or whatever it was back then. Check SQL Server Builds for details.
Make sure that your data is not corrupted. I can't recall the details now, but dbcc checkdb() command on 2000 version misses some discrepancies, so it would be better to attach / restore the database on the 2005 instance and checking it there.
Perform any maintenance that might affect this: rebuilding indices, updating statistics, etc.
Semantically, this query will result in inner rather than outer join (the WHERE part contains conditions for outer table), so there can be no reasonable explanation for this behaviour. So, unless anything mentioned above will help, chances are you've hit some heap-related bug that nobody will fix... Time for upgrade? :)
I am running SQL Server 2012 and this one query is killing my database performance.
My text message provider does not support scheduled text messages so I have a text message engine that picks up messages from the database and sends them at the scheduled time. I put this query together that gets the messages from the database and also changes their status so that they do not get picked up again.
The query works fine, it is just causing wait times on the CPU especially since it runs every other second. I installed a database performance software and it said this query accounts for 92% of instance execution time. The software also said that every single execution is doing 347,267 Logical Reads.
Any ideas on how to make this perform better?
Should I maybe select into a temporary table and update those results before returning them?
Here is the current query:
UPDATE TOP (30) dbo.Outgoing
SET Status = 2
OUTPUT INSERTED.OutgoingID, INSERTED.[Message], n.PhoneNumber, c.OptInStatus
FROM dbo.Outgoing o
JOIN Numbers n on n.NumberID = o.NumberID
LEFT JOIN Contacts c on c.ContactID = o.ContactID
WHERE Scheduled <= GETUTCDATE() AND SmsId IS NULL AND Status = 1
Here is the execution plan
There are three tables involved in this query: Outgoing, Numbers, & Contacts
Outgoing is the main table that this query deals with. There are only two indexes right now, a clustered primary key index on OutgoingID [PK, bigint, not null] and a non-clustered, non-unique index on SmsId [varchar(255), null] which is an identifier sent back from our text message provider once the messages are successfully received in their system. The Status column is just an integer column that relates to a few different statuses (Scheduled, Queued, Sent, Failed, Etc)
Numbers is just a simple table where we store unique cell phone numbers, some different formats of that number, and some basic information identifying the customer such as First name, carrier, etc. It just has a clustered primary key index on NumberID [bigint]. The PhoneNumber column is just a varchar(15).
The Contacts table just connects the individual person (phone number) to one of our merchants and keeps up with the number's opt in status, and other information related to the customer/merchant relationship. The only columns related to this query are OptInStatus [bit, not null] and ContactID [PK, bigint, not null]
--UPDATE--
Added a non-clustered index on the the Outgoing table with columns (Scheduled, SmsId, Status) and that seems to have brought down the execution time from 2+ second to milliseconds. I will check in with my performance monitoring software tomorrow to see how it has improved. Thank you everyone for the help so far!
As several commenters have already pointed out you need a new index on the dbo.Outgoing table. The server is struggling with finding the rows to update/output. This is most probably where the problem is:
WHERE Scheduled <= GETUTCDATE() AND SmsId IS NULL AND Status = 1
To improve performance you should create an index on dbo.Outgoing where you include these columns. This will make is easier for Sql Server to find the correct rows. It will on the other hand create some more work for the actual update though since there will be a new index that needs attention when updating.
While you're working on this, it would likely be a good idea to shorten the SmsId column unless you actually need it to be 255 chars long. Preferably before you create the index.
As an alternate solution you might think about having separate tables for the messages that are outgoing and those that are outgone. Then you can:
insert all records from Outgoing to Outgone
delete all records from Outgoing, with output clause like you are currently doing.
Make sure though that the insert and the delete operations are done in one transaction or you will soon have weird inconsistencies in the database.
it is just causing wait times on the CPU especially since it runs every other second.
Get rid of the TOP 30 and run it much less often than once every other second... maybe every two or three minutes.
You can enable max degree of parallelism of your sql server for faster processing
I have a table with around 115k rows. Something like this:
Table: People
Column: ID PRIMARY KEY INT IDENTITY NOT NULL
Column: SpecialCode NVARCHAR(255) NULL
Column: IsActive BIT NOT NULL
Initially, I had an index defined like so:
PK_IDX (clustered) -- clustered index on primary key
IDX_SpecialCode (non clustered, non-unique) -- index on the SpecialCode column
And I'm doing an update like so:
Update People set IsActive = 0
Where SpecialCode not in ('...enormous list of special codes....')
This enormous list is essentially 99% of the users in the table.
This update takes forever on my server. As a test I trimmed the list of special codes in the "not in" clause to something like 1% of the users in the table, and my execution plan ends up using an INDEX SCAN on the PK_IDX index instead of the IDX_SpecialCode index that I thought it'd use.
So, I thought that maybe I needed to modify the IDX_SpecialCode so that it included the column "IsActive" in it. I did so and I still see the execution plan defaulting to the PK_IDX index scan and my query still takes a very long time to run.
So - what is the more correct way to do an update of this nature? I have the list of user's I want to exclude from the update, but was trying to avoid loading all employees special codes from the database, filtering out those not in my list on my application side, and then running my query with an in clause, which will be a much much smaller list in my actual usage.
Thanks
If you have the employees you want to exclude, why not just populate an indexed table with those PK_IDs and do a:
Update People
set IsActive = 0
Where NOT EXISTS (SELECT NULL
FROM lookuptable l
WHERE l.PK = People.PK)
You are getting index scans because SQL Server is not stupid, and realizes that it makes more sense to just look at the whole table instead of checking for 100 different criteria one at a time. If your stats are up to date the optimizer knows about how much of the table is covered by your IN statement and will do a table or clustered index scan if it thinks it will be faster.
With SQL-Server indexes are ignored when you use the NOT clause. That is why you are seeing the execution plan ignoring your index. <- Ref: page 6. MCTS Exam 70-433 Database Development SQL 2008 (I'm reading it at the moment)
It might be worth taking a look at Full text indexes although I don't know whether the same will happen with that (I haven't got access to a box with it set up to test at the moment)
hth
Is there any way you could use the IDs of the users you wish to exclude instead of their code - even on indexed values comparing ids may be faster than strings.
I think that the problem is your SpecialCode NVARCHAR(255). Strings comparison in Sql Server are very slow. Consider change your query to work with the IDs. And also, try to avoid the NVarchar. if dont care about Unicode, use Varchar instead.
Also, check your database collation to see if it matches the instance collation. Make sure you are not having hard disk performance issues.
I am trying to insert thousands of rows into a table and performance is not acceptable. Rows on a particular table take 300ms per row to insert.
I know that tools exist to profile queries run against SQL Server (SQL Server Profile, Database Tuning Advisor), but how would I profile insert and update statements to determine slow running inserts? Am I forced to use perfmon while the queries run and deduce the issue with counters?
I would first check the query plan of a single insert to understand the costs associated to that operation - it is not known from the question whether the insert is selecting the data from elsewhere.
I would then check the table indexing for the following:
how many indexes are in place (apart from filtered indexes, each index will be inserted into as well)
whether a clustered index is present or are we inserting into a heap.
if the clustered index key means we will be getting a hotspot benefit on the end of the table or causing a large quantity of page splits.
This is all SQL schema based issues, assuming there is no problems within SQL, you can start checking disk IO counters to check for disk queue lengths and response time. Not forgetting the Log drive response time since each insert will be logged.
These kind of problems are very difficult to nail down as any 1 perscriptive thing / silver bullet you can give advice over, just a range of things you should be checking.
I'm betting that the problem is with the selects and not necessarily the updates. Have you tried profiling the select part of the update statement to make sure there isn't a problem there first?