Character-matching queries in SQL - sql-server

I'm attempting to optimize a T-SQL stored procedure I have. It's for pulling records based on a VIN (a 17-character alphanumeric string); usually people only know a few of the digits—e.g. the first digit could be '1', '2', or 'J'; the second is 'H' but the third could be 'M' or 'G'; and so on.
This leads to a pretty convoluted query whose WHERE clause is something like
WHERE SUBSTRING(VIN,1,1) IN ('J','1','2')
AND SUBSTRING(VIN,2,1) IN ('H')
AND SUBSTRING(VIN,3,1) IN ('M','G')
AND SUBSTRING(VIN,4,1) IN ('E')
AND ... -- and so on for however many digits we need to search on
The table I'm querying on is huge (millions of records) so the queries I'm running that have this kind of WHERE clause can take hours to run if there are more than a couple digits being searched on, even if I'm only requesting the top 3000 records. I feel like there has to be a way to get this substring character matching to run faster. Hours are completely unacceptable; I'd like to have these kinds of queries run in just a few minutes.
I don't have any editing privileges on the database, sadly, so I can't add indexes or anything like that; all I can do is change my stored procedure (although I can try to beg the DBAs to modify the table).

You can use
WHERE VIN LIKE '[J12]H[MG]E%'
At least that should hopefully lead to 3 index seeks on the ranges JH%, 1H%, and 2H% rather than a full scan.
Edit Although testing locally I found that it does not do multiple index seeks as I had hoped it converts the above to a single seek on the larger range VIN >= '1' and VIN < 'K' with a residual predicate to evaluate the LIKE
I'm not sure whether it will do this for larger tables or not but otherwise it may well be worth trying to encourage this plan with
WHERE (VIN LIKE 'JH%' OR VIN LIKE '1H%' OR VIN LIKE '2H%')
AND VIN LIKE '[J12]H[MG]E%'

You could use the LIKE keyword
SELECT
*
FROM Table
WHERE VIN LIKE '[J12]H[MG]E%'
This would even allow you to work with instance where they know the second character is not 'A' by using [^A] in the statement, such as:
WHERE VIN LIKE '[J12][^A][MG]E%'
Reference
http://msdn.microsoft.com/en-us/library/ms179859.aspx

I like the LIKE answers, but here's another alternative (especially if your input isn't always the same).
I would do this as a series of queries on ever-smaller temp tables (Yes, I'm in love with temp tables- sue me.)
So I would do something like
SELECT [Fields]
INTO #tempResultsFirstTwoDigits
FROM VIN
WHERE [Clause]
Then keep moving down the chain digit by digit until you've searched each of the provided characters. So you might do this:
if len(#input) > 2
SELECT [Fields]
INTO #tempResultsThreeDigits
FROM VIN
WHERE Substring(VIN, 3, 1) = Substring(#input, 3, 1)
//NOTE: That where clause might be sped up by initializing a variable at
// the beginning of the SP for each character you got.
Else Select * From #tempResultsFirstTwoDigits
GOTO Stop //Where "Stop" just defines the end of the SP to skip any further checks
Again, LIKE might be a better answer for you, but I would try both approaches and benchmark both of them.

Related

DECIMAL Index on CHAR column

I am dealing with some legacy code that looks like this:
Declare #PolicyId int
;
Select top 1 #PolicyId = policyid from policytab
;
Select col002
From SOMETAB
Where (cast(Col001 as int) = #PolicyId)
;
The code is actually in a loop, but the problem is the same. col001 is a CHAR(10)
Is there a way to specify an index on SOMETAB.Col001 that would speed up this code?
Are there other ways to speed up this code without modifying it?
The context of that question is that I am guessing that a simple index on col001 will not speed up the code because the select statement doing a cast on the column.
I am looking for a solution that does not involve changing this code because this technique was used on several tables and in several scripts for each table.
Once I determine that it is hopeless to speed up this code without changing it I have several options. I am bringing that so this post can stay on the topic of speeding up the code without changing the code.
Shortcut to hopeless if you can not change the code, (cast(Col001 as int) = #PolicyId) is not SARGable.
Sargable
SARGable functions in SQL Server - Rob Farley
SARGable expressions and performance - Daniel Hutmachier
After shortcut, avoid loops when possible and keep your search arguments SARGable. Indexed persisted computed columns are an option if you must maintain the char column and must compare to an integer.
If you cannot change the table structure, cast your parameter to the data type you are searching on in your select statement.
Cast(#PolicyId as char(10)) . This is a code change, and a good place to start looking if you decide to change code based on sqlZim's answer.
Zim's advice is excellent, and searching on int will always be faster than char. But, you may find this method an acceptable alternative to any schema changes.
Is policy stored as an int in PolicyTab and char in SomeTab for certain?

SQL Server Performance tips for like %

I have a table with 5-10 million records which has 2 fields
example data
Row Field1 Field2
------------------
1 0712334 072342344
2 06344534 083453454
3 06344534 0845645565
Given 2 variables
variable1 : 0634453445645
variable2 : 08345345456756
I need to be able to query the table for best matches as fast as possible
The above example would produce 1 record (e.g row 2)
What would be the fastest way to query the database for matches?
Note : the data and variables are always in this format (i.e always a number, may or may not have a leading zero, and fields are not unique however the combination of both will be )
My initial thought was to do something like this
Select blah where Field1 + "%" like variable1 and Field2 + "%" like variable2
Please forgive my pseudo-code if it's not correct, as this is more a fact-finding mission. However I think I'm in the ball park.
Note : I don't think any indexing can help here, though a memory-based table I'm guessing would speed this up.
Can anyone think of a better way of solving the problem?
You can get a plan with a seek on an index on Field1 with query like this.
declare #V1 varchar(20) = '0634453445645'
declare #V2 varchar(20) = '08345345456756'
select Field1,
Field2
from YourTable
where Field1 like left(#V1, 4) + '%' and
#V1 like Field1 + '%' and
#V2 like Field2 + '%'
It does a range seek on the first four characters on Field1 and uses the full comparison on Field1 and Field2 in a residual predicate.
There is no performance tip. SImple like that.
%somethin% is table scan, Indices are not used due to the beginning %. Ful ltext indexing won't work as it is not a full text you seek but part of a word.
Getting a faster machine to handle the table scans and denormalizing is the only thing you can do. 5-10 million rows should be faste enough on a decent computer. Memory based table is not needed - just enough RAM to cache that table.
And that pretty much is it. Either find a way to get rid of the initial % or get hardware (mostly memory) fast enough to handle this.
OR - handle it OUTSIDE sql server. Load the 5-10 million rows into a search service and use a better data structure. SQL being generic has to make compromises. But again, the partial match will kill pretty much most approaches.
Postgres has trigram indexes http://www.postgresql.org/docs/current/interactive/pgtrgm.html
Maybe SQL Server has something like that?
What is the shortest length in Column 'Field1' and 'Field2'? Call this number 'N'.
Then create a select statement which asks for all substrings starting at the first character of length N to the length of each variable. Example (say, N=10)
select distinct * from myTable
where Field1 in ('0634453445','06344534456','063445344564', '0634453445645')
and Field2 in ('0834534545','08345345456','083453454567', '0834534545675','08345345456756')
Write a small script which creates the query for you. Of course there is much more to optimize but this requires (imho) changes in the structure of your table and I can imagine that this is something you don't want. At least you can give it a fast try.
Also, you should include the query plan when you try this approach in SSMS. The query plan will give you a nice hint in how to organize your index.

How to force table select to go over blocks

How can I make Sybase's database engine return an unsorted list of records in non-numeric order?
~~~
I have an issue where I need to reproduce an error in the application where I select from a table where the ID is generated in sequence, but the ID is not the last one in the selection.
Let me explain.
ID STATUS
_____________
1234 C
1235 C
1236 O
Above is 3 IDs. I had code where these would be the results of a
select #p_id = ID from table where (conditions).
However, there wasn't a clause to check for status = 'O' (open). Remember Sybase saves the last returned record into a variable.
~~~~~
I'm being asked to give the testing team something that will make the results not work. If Sybase selects the above in an unordered list, it could appear in ascending order, or, if the database engine needs to change blocks of stored data or something technical magic stuff, the order could be messed up. The original error was when the procedure would return say 1234 instead of 1236.
Is there a way that I can have a 100% guarantee that Sybase will search over a block of data and have to double back, effectively 'breaking' the ascending search, and returning not the last record, but any other one? (all records except the maximum will end up erroring, because they are all 'Closed')
I want some sort of magical SQL code that will make sure things don't search the table in exactly numeric order. Ideally I'd like to not have to change the procedure, as the test team want to see the exact same procedure breaking (as easy as plonking a order by id desc would fudge the results).
If you don't specify an order, there is no way to guarantee the return order of the results. It will be however the index is built - and can depend on the order of insertion, the type of index, and the content of index keys.
It's generally a bad idea to do those sorts of singleton SELECTs. You should always specify a specific record with the WHERE clause, or use a cursor, or TOPn or similar. The problem comes when someone tries to understand your code, because some databases when they see multiple hits take the first value, some take the last value, some take a random value (they call that "implementation-defined"), and some throw an error.
Is this by any chance related to 1156837? :)

Finding alphabetical position in a large list

I have an as400 table containing roughly 1 million rows of full names / company names which I would like to convert to use another datastore while still matching the speed of the original.
Currently, a user enters the search and almost instantaneously gets the alphabetical position of the search term in the table and and a page of matches. The user can then paginate either up or down through the records very quickly.
There is almost no updating of the data and approximately 50 inserts per week. I'm thinking that any database can maintain an alphabetical index of the names, but I'm unsure of how to quickly find the position of the search within the dataset. Any suggestions are greatly appreciated.
This sounds just like a regular pagination of results, except that instead of going to a specific page based on a page number or offset being requested, it goes to a specific page based on where the user's search fits in the results alphabetically.
Let's say you want to fetch 10 rows after this position, and 10 rows before.
If the user searches for 'Smith', you could do two selects such that:
SELECT
name
FROM
companies
WHERE
name < 'Smith'
ORDER BY
name DESC
LIMIT 10
and then
SELECT
name
FROM
companies
WHERE
name >= 'Smith'
ORDER BY
name
LIMIT 10
You could do a UNION to fetch that in one query, the above is just simplified.
The term the user searched for would fit half way through these results. If there are any exact matches, then the first exact match will be positioned such that it is eleventh.
Note that if the user searches for 'aaaaaaaa' then they'll probably just get the 10 first results with nothing before it, and for 'zzzzzzzz' they may get just the 10 last results.
I'm assuming that the SQL engine in question allows >= and < comparisons between strings (and can optimise that in indexes), but I haven't tested this, maybe you can't do this. If, like MySQL, it supports internationalized collations then you could even have the ordering done correctly for non-ascii characters.
If by "the position of the search" you mean the number of the record if they were enumerated alphabetically, you may want to try something like:
select count(*) from companies where name < 'Smith'
Most databases ought to optimize that reasonably well (but try it--theories you read on the web don't trump empirical data).
Just to add to the ordering suggestions:
Add an index to the name if this is your standard means of data retrieval.
You can paginate efficiently by combining LIMIT and OFFSET.

What makes a SQL statement sargable?

By definition (at least from what I've seen) sargable means that a query is capable of having the query engine optimize the execution plan that the query uses. I've tried looking up the answers, but there doesn't seem to be a lot on the subject matter. So the question is, what does or doesn't make an SQL query sargable? Any documentation would be greatly appreciated.
For reference: Sargable
The most common thing that will make a query non-sargable is to include a field inside a function in the where clause:
SELECT ... FROM ...
WHERE Year(myDate) = 2008
The SQL optimizer can't use an index on myDate, even if one exists. It will literally have to evaluate this function for every row of the table. Much better to use:
WHERE myDate >= '01-01-2008' AND myDate < '01-01-2009'
Some other examples:
Bad: Select ... WHERE isNull(FullName,'Ed Jones') = 'Ed Jones'
Fixed: Select ... WHERE ((FullName = 'Ed Jones') OR (FullName IS NULL))
Bad: Select ... WHERE SUBSTRING(DealerName,4) = 'Ford'
Fixed: Select ... WHERE DealerName Like 'Ford%'
Bad: Select ... WHERE DateDiff(mm,OrderDate,GetDate()) >= 30
Fixed: Select ... WHERE OrderDate < DateAdd(mm,-30,GetDate())
Don't do this:
WHERE Field LIKE '%blah%'
That causes a table/index scan, because the LIKE value begins with a wildcard character.
Don't do this:
WHERE FUNCTION(Field) = 'BLAH'
That causes a table/index scan.
The database server will have to evaluate FUNCTION() against every row in the table and then compare it to 'BLAH'.
If possible, do it in reverse:
WHERE Field = INVERSE_FUNCTION('BLAH')
This will run INVERSE_FUNCTION() against the parameter once and will still allow use of the index.
In this answer I assume the database has sufficient covering indexes. There are enough questions about this topic.
A lot of the times the sargability of a query is determined by the tipping point of the related indexes. The tipping point defines the difference between seeking and scanning an index while joining one table or result set onto another. One seek is of course much faster than scanning a whole table, but when you have to seek a lot of rows, a scan could make more sense.
So among other things a SQL statement is more sargable when the optimizer expects the number of resulting rows of one table to be less than the tipping point of a possible index on the next table.
You can find a detailed post and example here.
For an operation to be considered sargable, it is not sufficient for it to just be able to use an existing index. In the example above, adding a function call against an indexed column in the where clause, would still most likely take some advantage of the defined index. It will "scan" aka retrieve all values from that column (index) and then eliminate the ones that do not match to the filter value provided. It is still not efficient enough for tables with high number of rows.
What really defines sargability is the query ability to traverse the b-tree index using the binary search method that relies on half-set elimination for the sorted items array. In SQL, it would be displayed on the execution plan as a "index seek".

Resources