Like query is locking the table? - sql-server

I have a column I'm doing a like '%X%' query on, I did it this way for simplicity with the knowledge that I'd have to revisit it as the amount of data grew.
However, one thing I did not expect was for it to lock the table, but it appears to be doing so. The query is slow, and if the query is running, other queries will not finish until it's done.
What I'm asking for are some general ideas as to why this might be happening, and advice on how I can drill down into SQL Server to get a better feel for exactly what's going on with respect to the locks.
I should mention that all of the queries are SELECT except for a service I have waking every 60 seconds to check for various things and potentially INSERT rows. This table is never updated, CRD only, not CRUD, but the issue is consistent rather than intermittent.

This is happening because LIKE '%X%' will force a complete scan. If there is an index that it can use, then the engine will scan the index, but it will need to read every row. If there is no nonclustered index, then the engine will perform a clustered index scan, or a table scan if the table is a heap.
This happens because of the LIKE %somevalue% construction. If you had a phonebook, say, and you were asked to find everyone with an X somewhere in the middle of their name, you'd have to read every entry in the index. If, on the other hand, it was LIKE 'X%', you'd know you only had to look at the entries beginning with 'X'.
Since it has to scan the entire table, it is likely to escalate the rowlocks to a table lock - this is not a flaw, SQL Server is almost always right when it determines that it would be more efficient for it to place a single lock on the table than 100,000 row locks, but it does block write operations.
My suggestion is that you try to find a better query than one involving LIKE '%X%'.

Related

SQL Server: nonclustered indexes after MERGE (insert / update)

I'm new to SQL Server, trying to optimize a procedure I received from ex-colleague (and I can't ask him).
At the final step, the procedure updates a large table using a MERGE statement. After that, it drops two nonclustered indexes and creates them again. What is the purpose of doing that? Aren't collected statistics for optimizer being recollected regularly? Is recreating indexes the only way to provide optimizer with fresh statistics?
Thanks
After that, it drops two nonclustered indexes and creates them again. What is the purpose
of doing that?
To organize them into less space without possibly page splits that may have happened during the merge. Generally it is NOT needed - like at all. It MAY make sense, but it is much better to actually analyze the index statistics about page splits before doing that unless you can be sure it is beneficial on every load.
Aren't collected statistics for optimizer being recollected regularly?
They are, but reoganizing the indices may make them more efficient. As data changes, data in index pages changes and when it overflows then a page is split. This leads to the index (nto the statistics on it) being unbalanced over time, which may lead to additional IO load.
Is recreating indexes the only way to provide optimizer with fresh statistics?
No. But you do not do it for statistics in the first place. You can just update the statistics if you want this. You do it to get an efficient index.

Alternatives to SQL COUNT(*) practices?

I was looking for improvements to PostgreSQL/InnoDB MVCC COUNT(*) problem when I found an article about implementing a work around in PostgreSQL. However, the author made a statement that caught my attention:
MySQL zealots tend to point to
PostgreSQL’s slow count() as a
weakness, however, in the real world,
count() isn’t used very often, and if
it’s really needed, most good database
systems provide a framework for you to
build a workaround.
Are there ways to skip using COUNT(*) in the way you design your applications?
Is it true that most applications are designed so they don't need it? I use COUNT() on most of my pages since they all need pagination. What is this guy talking about? Is that why some sites only have a "next/previous" link?
Carrying this over into the NoSQL world, is this also something that has to be done there since you can't COUNT() records very easily?
I think when the author said
however, in the real world, count() isn’t used very often
they specifically meant an unqualified count(*) isn't used very often, which is the specific case that MyISAM optimises.
My own experience backs this up- apart from some dubious Munin plugins, I can't think of the last time I did a select count(*) from sometable.
For example, anywhere I'm doing pagination, it's usually the output of some search. Which implies there will be a WHERE clause to limit the results anyway- so I might be doing something like select count(*) from sometable where conditions followed by select ... from sometable limit n offset m. Neither of which can use the direct how-many-rows-in-this-table shortcut.
Now it's true that if the conditions are purely index conditions, then some databases can merge together the output of covering indices to avoid looking at the table data too. Which certainly decreases the number of blocks looked at... if it works. It may be that, for example, this is only a win if the query can be satisfied with a single index- depends on the db implementation.
Still, this is by no means always the case- a lot of our tables have an active flag which isn't indexed, but often is filtered on, so would require a heap check anyway.
If you just need an idea of whether a table has data in it or not, Postgresql and many other systems do retain estimated statistics for each table: you can examine the reltuples and relpages columns in the catalogue for an estimate of how many rows the table has and how much space it is taking. Which is fine so long as ~6 significant figures is accurate enough for you, and some lag in the statistics being updated is tolerable. In my use case that I can remember (plotting the number of items in the collection) it would have been fine for me...
Trying to maintain an accurate row counter is tricky. The article you cited caches the row count in an auxiliary table, which introduces two problems:
a race condition between SELECT and INSERT populating the auxiliary table (minor, you could seed this administratively)
as soon as you add a row to the main table, you have an update lock on the row in the auxiliary table. now any other process trying to add to the main table has to wait.
The upshot is that concurrent transactions get serialised instead of being able to run in parallel, and you've lost the writers-dont-have-to-block-either benefits of MVCC- you should reasonably expect to be able to insert two independent rows into the same table at the same time.
MyISAM can cache the row count per table because it tacks on exclusive lock on the table when someone writes to it (iirc). InnoDB allows finer-grained locking-- but it doesn't try to cache the row count for the table. Of course if you don't care about concurrency and/or transactions, you can take shortcuts... but then you're moving away from Postgresql's main focus, where data integrity and ACID transactions are primary goals.
I hope this sheds some light. I must admit, I've never really felt the need for a faster "count(*)", so to some extent this is simply a "but it works for me" testament rather than a real answer.
While you're asking more of an application design than database question really, there is more detail about how counting executes in PostgreSQL and the alternatives to doing so at Slow Counting. If you must have a fast count of something, you can maintain one with a trigger, with examples in the references there. That costs you a bit on the insert/update/delete side in return for speeding that up. You have to know in advance what you will eventually want a count of for that to work though.

How to solve insert/delete deadlock on non-clustered index?

I got a deadlock problem and I found that it's caused by two stored procedures that is called by different threads (2 called web services).
Insert sp that insert data in X table.
Delete sp that delete data in X table.
Moreover, I got result that told me about deadlock happened in non-unique and non-clustered index of X table. Do you have any idea for solve this problem?
Update
From Read/Write deadlock, I think it error because of the following statements.
In insert statement, it get id(clustered index) and then non-clustered index.
In delete statment, it get non-clustered index before id.
So, I need to select id for delete statment like the following statement.
SELECT id FROM X WITH(NOLOCK) WHERE [condition]
PS. Both stored procedures are called in transaction.
Thanks,
We'd have to see some kind of code... you mention a transaction; what isolation level is it at? One thing to try is adding the (UPDLOCK) hint to any query that you use to find the row (or check existence); so you'll take out a write lock (rather than a read lock) from the start.
When contested, this should then cause (very brief) blocking rather than a deadlock.
Do the stored procedures modify anything, or just do reads? If the modify something, are there where clauses on the updates to that they're sufficiently granular? If you can try to update the rows in smaller batches, SQL Server is less likely to deadlock, since it will only lock small amounts of the index, instead of the index as a whole.
If it's possible, can you post the code here that's deadlocking? IF the stored procedures are too long, can you post the offending statements within them (if you know which they are)?
Without the deadlock info is more of a guess than a proper answer... Could be an index access order issue similar to the read-write deadlock.
It could be that the select queries are the actual problem, especially if they are the same tables in both stored procedures, but in a different order. It's important to remember that reading from a table will create (shared) locks. You might want to read up on lock types.
The same can happen at the index level, as Remus posted about. The article he linked offers a good explanation, but unfortunately no one hit wonder solution, because there isn't a single best solution for each case.
I'm not exactly an expert in this field myself really, but using lock hints, you may be able to ensure the same resources get locked in the same order, preventing a deadlock. You will probably need more information from your testers to effectively solve this though.
The quick way to get your application back doing what it's supposed to is to detect the deadlock error (1205) and rerun the transaction. Code for this can be found in the "TRY...CATCH" section of Books Online.
If you're deleting and inserting, then that's affecting the clustered index, and every non-clustered index on the table also needs to have an insert/delete. So it's definitely very possible for deadlocks to occur. I would start by looking at what your clustered indexes are - try having a surrogate key for your clustered index, for example.
Unfortunately it's almost impossible to completely solve your deadlock problem without more information.
Rob

SQL Server 2008 Performance: No Indexes vs Bad Indexes?

i'm running into a strange problem in Microsoft SQL Server 2008.
I have a large database (20 GB) with about 10 tables and i'm attempting to make a point regarding how to correctly create indexes.
Here's my problem: on some nested queries i'm getting faster results without using indexes! It's close (one or two seconds), but in some cases using no indexes at all seems to make these queries run faster... I'm running a Checkpoiunt and a DBCC dropcleanbuffers to reset the caches before running the scripts, so I'm kinda lost.
What could be causing this?
I know for a fact that the indexes are poorly constructed (think one index per relevant field), the whole point is to prove the importance of constructing them correctly, but it should never be slower than having no indexes at all, right?
EDIT: here's one of the guilty queries:
SET STATISTICS TIME ON
SET STATISTICS IO ON
USE DBX;
GO
CHECKPOINT;
GO
DBCC DROPCLEANBUFFERS;
GO
DBCC FREEPROCCACHE;
GO
SELECT * FROM Identifier where CarId in (SELECT CarID from Car where ManufactId = 14) and DataTypeId = 1
Identifier table:
- IdentifierId int not null
- CarId int not null
- DataTypeId int not null
- Alias nvarchar(300)
Car table:
- CarId int not null
- ManufactId int not null
- (several fields followed, all nvarchar(100)
Each of these bullet points has an index, along with some indexes that simultaneously store two of them at a time (e.g. CarId and DataTypeId).
Finally, The identifier table has over million entries, while the Car table has two or three million
My guess would be that SQL Server is incorrectly deciding to use an index, which is then forcing a bookmark lookup*. Usually when this happens (the incorrect use of an index) it's because the statistics on the table are incorrect.
This can especially happen if you've just loaded large amounts of data into one or more of the tables. Or, it could be that SQL Server is just screwing up. It's pretty rare that this happens (I can count on one hand the times I've had to force index use over a 15 year career with SQL Server), but the optimizer is not perfect.
* A bookmark lookup is when SQL Server finds a row that it needs on an index, but then has to go to the actual data pages to retrieve additional columns that are not in the index. If your result set returns a lot of rows this can be costly and clustered index scans can result in better performance.
One way to get rid of bookmark lookups is to use covering indexes - an index which has the filtering columns first, but then also includes any other columns which you would need in the "covered" query. For example:
SELECT
my_string1,
my_string2
FROM
My_Table
WHERE
my_date > '2000-01-01'
covering index would be (my_date, my_string1, my_string2)
Indexes don't really have any benefit until you have many records. I say many because I don't really know what that tipping over point is...It depends on the specific application and circumstances.
It does take time for the SQL Server to work with an index. If that time exceeds the benefit...This would especially be true in subqueries, where a small difference would be multiplied.
If it works better without the index, leave out the index.
Try DBCC FREEPROCCACHE to clear the execution plan cache as well.
This is an empty guess. Maybe if you have a lot of indexes, SQL Server is spending time on analyzing and picking one, and then rejecting all of them. If you had no indexes, the engine wouldn't have to waste it's time with this vetting process.
How long this vetting process actually takes, I have no idea.
For some queries, it is faster to read directly from the table (clustered index scan), than it is to read the index and fetch records from the table (index scan + bookmark lookup).
Consider that a record lives along with other records in a datapage. Datapage is the basic unit of IO. If the table is read directly, you could get 10 records for the cost of 1 IO. If the index is read directly, and then records are fetched from the table, you must pay 1 IO per record.
Generally SQL server is very good at picking the best way to access a table (direct vs index). There may be something in your query that is blinding the optimizer. Query hints can instruct the optimizer to use an index when it is wrong to do so. Join hints can alter the order or method of access of a table. Table Variables are considered to have 0 records by the optimizer, so if you have a large Table Variable - the optimizer may choose a bad plan.
One more thing to look out for - varchar vs nvarchar. Make sure all parameters are of the same type as the target columns. There's a case where SQL Server will convert the whole index to the parameter's type in the event of a type mismatch.
Normally SQL Server does a good job at deciding what index to use if any to retrieve the data in the fastest way. Quite often it will decide not to use any indexes as it can retrieve small amounts of data from small tables quicker without going away to the index (in some situations).
It sounds like in your case SQL may not be taking the most optimum route. Having lots of badly created indexes may be causing it to pick the wrong routes to get to the data.
I would suggest viewing the query plan in management studio to check what indexes its using, and where the time is being taken. This should give you a good idea where to start.
Another note is it maybe that these indexes have gotten fragmented over time and are now not performing to their best, it maybe worth checking this and rebuilding some of them if needed.
Check the execution plan to see if it is using one of these indexes that you "know" to be bad?
Generally, indexing slows down writing data and can help to speed up reading data.
So yes, I agree with you. It should never be slower than having no indexes at all.
SQL server actually makes some indexes for you (e.g. on primary key).
Indexes can become fragmented.
Too many indexes will always reduce performance (there are FAQs on why not to index every col in the db)
also there are some situations where indexes will always be slower.
run:
SET SHOWPLAN_ALL ON
and then run your query with and without the index usage, this will let you see what index if any are being used, where the "work" is going on etc.
No Sql Server analyzes both the indexes and the statistics before deciding to use an index to speed up a query. It is entirely possible that running a non-indexed version is faster than an indexed version.
A few things to try
ensure the indexes are created and rebuilt, and re-organized (defragmented).
ensure that the auto create statistics is turned on.
Try using Sql Profiler to capture a tuning profile and then using the Database Engine Tuning Advisor to create your indexes.
Surprisingly the MS Press Examination book for Sql administration explains indexes and statistics pretty well.
See Chapter 4 table of contents in this amazon reader preview of the book
Amazon Reader of Sql 2008 MCTS Exam Book
To me it sounds like your sql is written very poorly and thus not utilizing the indexes that you are creating.
you can add indexes till you're blue in the face but if your queries aren't optimized to use those indexes then you won't get any performance gain.
give us a sample of the queries you're using.
alright...
try this and see if you get any performance gains (with the pk indexes)
SELECT i.*
FROM Identifier i
inner join Car c
on i.CarID=c.CarID
where c.ManufactId = 14 and i.DataTypeId = 1

Cost of Inserts vs Update in SQL Server

I have a table with more than a millon rows. This table is used to index tiff images. Each image has fields like date, number, etc. I have users that index these images in batches of 500. I need to know if it is better to first insert 500 rows and then perform 500 updates or, when the user finishes indexing, to do the 500 inserts with all the data. A very important thing is that if I do the 500 inserts at first, this time is free for me because I can do it the night before.
So the question is: is it better to do inserts or inserts and updates, and why? I have defined a id value for each image, and I also have other indices on the fields.
Updates in Sql server result in ghosted rows - i.e. Sql crosses one row out and puts a new one in. The crossed out row is deleted later.
Both inserts and updates can cause page-splits in this way, they both effectively 'add' data, it's just that updates flag the old stuff out first.
On top of this updates need to look up the row first, which for lots of data can take longer than the update.
Inserts will just about always be quicker, especially if they are either in order or if the underlying table doesn't have a clustered index.
When inserting larger amounts of data into a table look at the current indexes - they can take a while to change and build. Adding values in the middle of an index is always slower.
You can think of it like appending to an address book: Mr Z can just be added to the last page, while you'll have to find space in the middle for Mr M.
Doing the inserts first and then the updates does seem to be a better idea for several reasons. You will be inserting at a time of low transaction volume. Since inserts have more data, this is a better time to do it.
Since you are using an id value (which is presumably indexed) for updates, the overhead of updates will be very low. You would also have less data during your updates.
You could also turn off transactions at the batch (500 inserts/updates) level and use it for each individual record, thus reducing some overhead.
Finally, test this out to see the actual performance on your server before making a final decision.
This isn't a cut and dry question. Krishna's and Galegian's points are spot on.
For updates, the impact will be lessened if the updates are affecting fixed-length fields. If updating varchar or blob fields, you may add a cost of page splits during update when the new value surpasses the length of the old value.
I think inserts will run faster. They do not require a lookup (when you do an update you are basically doing the equivalent of a select with the where clause). And also, an insert won't lock the rows the way an update will, so it won't interfere with any selects that are happening against the table at the same time.
The execution plan for each query will tell you which one should be more expensive. The real limiting factor will be the writes to disk, so you may need to run some tests while running perfmon to see which query causes more writes and causes the disk queue to get the longest (longer is bad).
I'm not a database guy, but I imagine doing the inserts in one shot would be faster because the updates require a lookup whereas the inserts do not.

Resources