SELECT with WHERE drops performance

SELECT with WHERE drops performance - sql-server

In my application I use queries like
SELECT column1, column2
FROM table
WHERE myDate >= date1 AND myDate <= date2
My application crashes and returns a timeout exception. I copy the query and run it in SSMS. The results pane displays ~ 40 seconds of execution time. Then I remove the WHERE part of the query and run. This time, the returned rows appear immediately in the results table, although the query continues to print more rows (there are 5 million rows in the table).
My question is: how can a WHERE clause affect query performance?
Note: I don't change the CommandTimeOut property in the application. Left by default.

Without a WHERE clause, SQL Server is told to just start returning rows, so that's what it does, starting from the first row it can find efficiently (which may be the "first" row in the clustered index, or in a covering non-clustered index).
When you limit it with a where clause, SQL Server first has to go find those rows. That's what you're waiting on, because you don't have an index on myDate (or date1/date2, which I'm not sure are columns or variables), it needs to examine every single row.
Another way to look at it is to think of a phone book, which is an outdated analogy but gets the job done. Without a WHERE clause, it's like you're asking me to read you off all of the names and numbers in the book. If you add a WHERE clause that is not supported by an index, like read me off the names and numbers of every person with the first name 'John', it's going to take me a lot longer to start returning rows because I can't even start until I find the first John.
Or a slightly different analogy is to think of the index in a book. If you ask me to read off the page numbers for all the terms that are indexed, I can do that from the index, just starting from the beginning and reading through until the end. If you ask me to read off all the page numbers for all the terms that aren't in the index, or a specific unindexed term (like "the"), or even all the page numbers for indexed terms that contain the letter a, I'm going to have a much harder time.

Related

H2 database performance strangeness --- or how to efficiently `count(*)`

The setup could not be simpler:
H2 version 1.3.176
One table, 10 columns of which two are a bit lengthy with 300 and 3500 characters a typical value length
Simple query: select count(*) from requestrepository where request_type = 'ADD'
Index is on the queried column.
Queried column is just varchar(20) (i.e. not one of the longer ones)
Queried column contains just two different values, with one appearing 200k times and the other appearing 12 million times.
DB runs off an SSD, current server hardware, current Java 8 (varied a bit but no change in result)
What I do: (0) run analyze, (1) delete one row by a key field, (2) insert one row for the key just deleted, (3) run the query cited above, count to 10 and repeat.
What I see: The query cited above takes between 3 and 5 seconds each time and explain analyze says:
SELECT
COUNT(*)
FROM PUBLIC.REQUESTREPOSITORY
/* PUBLIC.IX_REQUESTS: REQUEST_TYPE = 'ADD' */
/* scanCount: 12098748 */
WHERE REQUEST_TYPE = 'ADD'
/*
REQUESTREPOSITORY.IX_REQUESTS read: 126700
*/
I tried the same DB on different machines, hardware/linux/ssd, VM/Windows/netapp, but the tendency is always the same: the count(*) takes too(?) long.
And this is what I am not sure about. Is it to be expected that this takes long? I would have thought that at least for the second round, caches are filled and this should be much faster, but the explain analyze always lists 126700 reads.
Any hints about H2 parameters or settings how this may be improve are appreciated.
EDIT (not sure if this should rather go as an answer)
Meanwhile we tried a wide range of things, including mvstore, 1.4.x, parallel threads, computers with different disks, Linux, Windows. The situation is always the same. Take over 10 or 12 million rows, a varchar column with three status values, something like PROCESSING, ADD, DELETE, an index on the column and one status grossly overrepresented: Then something like count(*) where colname='ADD' takes between 1 and many seconds after each update of the table.
To prevent this from creating a problem, we finally fixed our own code, which did three count(*), one for each status, instead of one with a group by and was run every 5 seconds instead of just on demand. Certainly not the greatest design we had.
The only excuse I have is that it I am still surprised that a count(*) takes that long in such a setup. My hunch is that the count must be computed on the index by really counting after an update, whereas I expected that the count can be just read off the data structure somewhere. (No critique, I for myself would certainly not be able to implement a DB.)

Not sure about H2, but have you tried COUNT(request_type) instead of COUNT(*)?
SQL standard's COUNT(*) tends to take long time to compute, as it requires a full table scan to filter out rows that consist of NULL values only.
Using COUNT() on a single indexed column can speed things up. This way no table row need to be read, as the index is sufficient to decide whether the column's value is NULL.

KeyLookup in massive columns in sql server

I have one simple query which has multiple columns (more than 1000).
When i run with single column it gives me result in 2 seconds with proper index seek, logical read, cpu and every thing is under thresholds.
But when i select more than 1000 columns it takes 11 mins for the result and gives me key lookup.
You folks have you faced this type of issue?
Any suggestion on that issue?

Normally, I would suggest to add those columns in the INCLUDE fields of your non-clustered index. Adding them in the INCLUDE removes the LOOKUP in the execution plan. But as everything with SQL Server, it depends. Depending on how the table is used i.e, if you're updating the table more than just plain SELECTing on it, then the LOOKUP might be ok.

If this query is run once per year, the overhead of additional index is probably not worth it. If you need quick response time, that single time of the year when it needs to be run, look into 'pre executing' it and just present the result to the user.
The difference in your query plan might be because of join elimination (if your query contains JOINs with multiple tables) or just that the additional columns you are requesting do not exist in your currently existing indexes...

Problems with LIKE command in SQL Server

I'm trying to execute a query containing LIKE in a SQL Server 2008 database, but for some reason the query takes forever and times out.
The table contains around 47 million rows with aggregated log data, and I'm trying to find a logentry for a specific machine containing a specific application name. My query looks like this:
SELECT MgmtLogID,MgmtLogSeverity, MgmtLogSource, CAST(MgmtLogText as TEXT) as MgmtLogText, MgmtLogTime, MgmtLogHost
FROM [dbo].[MgmtLog]
--Fixed values
WHERE MgmtLogOrigin = 'EventLog' AND MgmtLogSeverity <= 3
--Values depending on what I'm searching for
AND MgmtLogHost = 'MY MACHINENAME'AND MgmtLogTime > 'MY START TIME' AND MgmtLogText LIKE '% KEYWORD TO SEARCH FOR %'
ORDER BY MgmtLogTime DESC
The query executes in around 1-2 seconds without the LIKE and returns around 10 rows. With the LIKE it should return 2 rows out of those ten so it shouldn't be that taxing but it times out. I'm guessing that it has something to do with the properties of MgmLogText, but I'm not sure what. It is an ntext field that has length 16 and uses Finnish_Swedish_CI_AS collation.
In the end I need to execute the query from a php script since I need to find log records for a arbitrary number of machines and/or applications

Depending on the field type of MgmtLogText, the indexes won't be used. Also, as mentioned by other commenters, the LIKE also prevents the index from being used.
Off the top of my head, I wonder if it would work if you use a subquery. The inside query should be the one without the LIKE which only returns 10 results. Then the outside query should be the one that uses the LIKE. That way the LIKE only has to search 10 rows instead of 47 million.
There's probably a more efficient way, but I was thinking something like this:
SELECT MgmtLogID,MgmtLogSeverity, MgmtLogSource, CAST(MgmtLogText as TEXT) as MgmtLogText, MgmtLogTime, MgmtLogHost
FROM [dbo].[MgmtLog]
WHERE MgmtLogID IN (
SELECT MgmtLogID
FROM [dbo.MgmtLog]
WHERE MgmtLogOrigin = 'EventLog' AND MgmtLogSeverity <= 3
AND MgmtLogHost = 'MY MACHINENAME'AND MgmtLogTime > 'MY START TIME'
ORDER BY MgmtLogTime DESC
)
AND MgmtLogText LIKE '%some value%'

Query clauses like where colname like '%something%' are not able to take advantage of indexes and usually result in a full scan of the possible rows to ascertain which ones should be delivered
Although, as ChrisC points out in a comment, it's somewhat surprising that the more efficient clauses aren't firt used to reduce the candidate rowset down to a manageable size before trying to use like - perhaps the statistics for the table are not up to date enough for the query analysis to decide this - best run whatever counts for an explain query under SQL Server.
The reason your non-like query is so fast is because it almost certainly has an index on MgmtLogHost and/or MgmtLogTime which can be used to quickly cull unneeded rows.
One way you can fix this is to use something like insert/update triggers to process the MgmtLogText data only when changed, to extract the application names out and put them in a separate table which can be far better optimised.
Even just using such a trigger to keep a lowercased version of the column (in another column) would be an improvement. Using a case-insensitive collation means that selects run slower since they have to allow for XYZZY and xyzzy being classed as equal. If instead you maintain a lower-cased version in the table and ensure the check is done against lower case, that effort disappears as you only have one case to worry about.
And, by doing all this in the trigger, you ensure that it's only done when necessary (when the data is changed), not every time you want to select. This amortises the cost over many selects.
You can also use something like full text indexing if your DBMS supports it but I've often thought that was like trying to kill mosquitos with a thermo-nuclear warhead.
Yes, there are situations where you may need full text indexing but, in the vast majority of cases, you can gain efficiency by being a little more selective.

Why is doing a top(1) on an indexed column in SQL Server slow?

I'm puzzled by the following. I have a DB with around 10 million rows, and (among other indices) on 1 column (campaignid_int) is an index.
Now I have 700k rows where the campaignid is indeed 3835
For all these rows, the connectionid is the same.
I just want to find out this connectionid.
use messaging_db;
SELECT TOP (1) connectionid
FROM outgoing_messages WITH (NOLOCK)
WHERE (campaignid_int = 3835)
Now this query takes approx 30 seconds to perform!
I (with my small db knowledge) would expect that it would take any of the rows, and return me that connectionid
If I test this same query for a campaign which only has 1 entry, it goes really fast. So the index works.
How would I tackle this and why does this not work?
edit:
estimated execution plan:
select (0%) - top (0%) - clustered index scan (100%)

Due to the statistics, you should explicitly ask the optimizer to use the index you've created instead of the clustered one.
SELECT TOP (1) connectionid
FROM outgoing_messages WITH (NOLOCK, index(idx_connectionid))
WHERE (campaignid_int = 3835)
I hope it will solve the issue.
Regards,
Enrique

I recently had the same issue and it's really quite simple to solve (at least in some cases).
If you add an ORDER BY-clause on any or some of the columns that's indexed it should be solved. That solved it for me at least.

You aren't specifying an ORDER BY clause in your query, so the optimiser is not being instructed as to the sort order it should be selecting the top 1 from. SQL Server won't just take a random row, it will order the rows by something and take the top 1, and it may be choosing to order by something that is sub-optimal. I would suggest that you add an ORDER BY x clause, where x being the clustered key on that table will probably be the fastest.
This may not solve your problem -- in fact I'm not sure I expect it to from the statistics you've given -- but (a) it won't hurt, and (b) you'll be able to rule this out as a contributing factor.

If the campaignid_int column is not indexed, add an index to it. That should speed up the query. Right now I presume that you need to do a full table scan to find the matches for campaignid_int = 3835 before the top(1) row is returned (filtering occurs before results are returned).
EDIT: An index is already in place, but since SQL Server does a clustered index scan, the optimizer has ignored the index. This is probably due to (many) duplicate rows with the same campaignid_int value. You should consider indexing differently or query on a different column to get the connectionid you want.

The index may be useless for 2 reasons:
700k in 10 million may be not selective enough
and /or
connectionid needs included so the entire query can used only an index
Otherwise, the optimiser decides it may as well use the PK/clustered index to both filter on campaignid_int and get connectionid, to avoid a bookmark lookup on 700k rows from the current index.
So, I suggest this...
CREATE NONCLUSTERED INDEX IX_Foo ON MyTable (campaignid_int) INCLUDE (connectionid)

This doesn't answer your question, but try using:
SET ROWCOUNT 1
SELECT connectionid
FROM outgoing_messages WITH (NOLOCK)
WHERE (campaignid_int = 3835)
I've seen top(x) perform very badly in certain situations as well. I'm sure it's doing a full table scan. Perhaps your index on that particular column needs to be rebuilt? The above is worth a try, however.

Your query does not work as you expect, because Sql Server keeps statistics about your index and in this particular case knows that there are a lot of duplicate rows with the identifier 3835, hence it figures that it would make more sense to just do a full index (or table) scan. When you test for an ID which resolves to only one row, it uses the index as expected, i.e. performs an index seek (the execution plan should verify this guess).
Possible solutions ? Make the index composite, if you have anything to compose it with, that is, e.g. compose it with the date the message was sent (if I understand your case correctly) and then select the top 1 entry from the list with the specified id ordered by the date. Though I'm not sure whether this would be better (for one, a composite index takes up more space) - just a guess.
EDIT: I just tried out the suggestion of making the index composite by adding a date column. If you do that and specify order by date in your query, an index seek is performed as expected.

but since I'm specifying 'top(1)' it
means: give me any row. Why would it
first crawl through the 700k rows just
to return one? – reinier 30 mins ago
Sorry, can't comment yet but the answer here is that SQL server is not going to understand the human equivalent of "Bring me the first one you find" when it hears "Top 1". Instead of the expected "Give me any row" SQL Server goes and fetches the first of all found rows.
Only time it knows that is after fetching all rows first, then discarding the rest. Very thorough but in your case not really fast.
Main issue as other said are your statistics and selectivity of your index. If you have another unique field in your table (like an identity column) then try an combined index on campaignid_int first, unique column second. As you only query on campaignid_int it has to be the first part of the key.
Sounds worth a try as this index should have a higher selectivity thus the optimizer can use this better than doing an index crawl.

TSQL "LIKE" or Regular Expressions?

I have a bunch (750K) of records in one table that I have to see they're in another table. The second table has millions of records, and the data is something like this:
Source table
9999-A1B-1234X, with the middle part potentially being longer than three digits
Target table
DescriptionPhrase9999-A1B-1234X(9 pages) - yes, the parens and the words are in the field.
Currently I'm running a .net app that loads the source records, then runs through and searches on a like (using a tsql function) to determine if there are any records. If yes, the source table is updated with a positive. If not, the record is left alone.
the app processes about 1000 records an hour. When I did this as a cursor sproc on sql server, I pretty much got the same speed.
Any ideas if regular expressions or any other methodology would make it go faster?

What about doing it all in the DB, rather than pulling records into your .Net app:
UPDATE source_table s SET some_field = true WHERE EXISTS
(
SELECT target_join_field FROM target_table t
WHERE t.target_join_field LIKE '%' + s.source_join_field + '%'
)
This will reduce the total number of queries from 750k update queries down to 1 update.

First I would redesign if at all possible. Better to add a column that contains the correct value and be able to join on it. If you still need the long one. you can use a trigger to extract the data into the column at the time it is inserted.
If you have data you can match on you need neither like '%somestuff%' which can't use indexes or a cursor both of which are performance killers. This should bea set-based task if you have designed properly. If the design is bad and can't be changed to a good design, I see no good way to get good performance using t-SQl and I would attempt the regular expression route. Not knowing how many different prharses and the structure of each, I cannot say if the regular expression route would be easy or even possible. But short of a redesign (which I strongly suggest you do), I don't see another possibility.
BTW if you are working with tables that large, I would resolve to never write another cursor. They are extremely bad for performance especially when you start taking about that size of record. Learn to think in sets not record by record processing.

One thing to be aware of with using a single update (mbeckish's answer) is that the transaction log (enabling a rollback if the query becomes cancelled) will be huge. This will drastically slow down your query. As such it is probably better to proces them in blocks of 1,000 rows or such like.
Also, the condition (b.field like '%' + a.field + '%') will need to check every single record in b (millions) for every record in a (750,000). That equates to more than 750 billion string comparisons. Not great.
The gut feel "index stuff" won't help here either. An index keeps things in order, so the first character(s) dictate the position in the index, not the ones you're interested in.
First Idea
For this reason I would actually consider creating another table, and parsing the long/messy value into something nicer. An example would be just to strip off any text from the last '(' onwards. (This assumes all the values follow that pattern) This would simplify the query condition to (b.field like '%' + a.field)
Still, an index wouldn't help here either though as the important characters are at the end. So, bizarrely, it could well be worth while storing the characters of both tables in reverse order. The index on you temporary table would then come in to use.
It may seem very wastefull to spent that much time, but in this case a small benefit would yield a greate reward. (A few hours work to halve the comparisons from 750billion to 375billion, for example. And if you can get the index in to play you could reduce this a thousand fold thanks to index being tree searches, not just ordered tables...)
Second Idea
Assuming you do copy the target table into a temp table, you may benefit extra from processing them in blocks of 1000 by also deleting the matching records from the target table. (This would only be worthwhile where you delete a meaningful amount from the target table. Such that after all 750,000 records have been checked, the target table is now [for example] half the size that it started at.)
EDIT:
Modified Second Idea
Put the whole target table in to a temp table.
Pre-process the values as much as possible to make the string comparison faster, or even bring indexes in to play.
Loop through each record from the source table one at a time. Use the following logic in your loop...
DELETE target WHERE field LIKE '%' + #source_field + '%'
IF (##row_count = 0)
[no matches]
ELSE
[matches]
The continuous deleting makes the query faster on each loop, and you're only using one query on the data (instead of one to find matches, and a second to delete the matches)

Try this --
update SourceTable
set ContainsBit = 1
from SourceTable t1
join (select TargetField from dbo.TargetTable t2) t2
on charindex(t1.SourceField, t2.TargetField) > 0

First thing is to make sure you have an index for that column on the searched table. Second is to do the LIKE without a % sign on the left side. Check the execution plan to see if you are not doing a table scan on every row.
As le dorfier correctly pointed out, there is little hope if you are using a UDF.

There are lots of ways to skin the cat - I would think that first it would be important to know if this is a one-time operation, or a regular task that needs to be completed regularly.
Not knowing all the details of you problem, if it was me, at this was a one-time (or infrequent operation, which it sounds like it is), I'd probably extract out just the pertinent fields from the two tables including the primary key from the source table and export them down to a local machine as text files. The files sizes will likely be significantly smaller than the full tables in your database.
I'd run it locally on a fast machine using a routine written in something like 'C'/C++ or another "lightweight" language that has raw processing power, and write out a table of primary keys that "match", which I would then load back into the sql server and use it as a basis of an update query (i.e. update source table where id in select id from temp table).
You might spend a few hours writing the routine, but it would run in a fraction of the time you are seeing in sql.
By the sounds of you sql, you may be trying to do 750,000 table scans against a multi-million records table.
Tell us more about the problem.

Holy smoke, what great responses!
system is on disconnected network, so I can't copy paste, but here's the retype
Current UDF:
Create function CountInTrim
(#caseno varchar255)
returns int
as
Begin
declare #reccount int
select #reccount = count(recId) from targettable where title like '%' + #caseNo +'%'
return #reccount
end
Basically, if there's a record count, then there's a match, and the .net app updates the record. The cursor based sproc had the same logic.
Also, this is a one time process, determining which entries in a legacy record/case management system migrated successfully into the new system, so I can't redesign anything. Of course, developers of either system are no longer available, and while I have some sql experience, I am by no means an expert.
I parsed the case numbers from the crazy way the old system had to make the source table, and that's the only thing in common with the new system, the case number format. I COULD attempt to parse out the case number in the new system, then run matches against the two sets, but with a possible set of data like:
DescriptionPhrase1999-A1C-12345(5 pages)
Phrase/Two2000-A1C2F-5432S(27 Pages)
DescPhraseThree2002-B2B-2345R(8 pages)
Parsing that became a bit more complex so I thought I'd keep it simpler.
I'm going to try the single update statement, then fall back to regex in the clr if needed.
I'll update the results. And, since I've already processed more than half the records, that should help.

Try either Dan R's update query from above:
update SourceTable
set ContainsBit = 1
from SourceTable t1
join (select TargetField
from dbo.TargetTable t2) t2
on charindex(t1.SourceField, t2.TargetField) > 0
Alternatively, if the timeliness of this is important and this is sql 2005 or later, then this would be a classic use for a calculated column using SQL CLR code with Regular Expressions - no need for a standalone app.