Very Slow Sql Server Query - sql-server

I have 2 tables resulted from merging the following tables:
Authors
-Aid bigint
-Surname nvarchar(500)
-Email nvarchar(500)
Articles
-ArId varchar(50)
-Year int
-……Some other fields……
ArticleAuthors
-ArId varchar(50)
-Aid bigint
Classifications
-ClassNumber int
-ClassDescription nvarchar(100)
ClassArticles
-ArId varchar(50)
-ClassNumber int
After denormalizing these tables the resulted tables were:
Articles
-FieldId int
-ArId varchar(50)
-ClassNumber int (Foreign key from the Classifications table)
-Year int
Authors
-FieldId int
-ArId varchar(50) (Foreign key from the Articles table)
-Aid bigint
-Surname nvarchar(500)
-Email nvarchar(500)
-Year int
Here are the conditions of the data within the resulted tables:
SQL Server 2008 database
The relationships between the two tables are applied physically
The Authors table has 50 million records
The Articles table has 20 million records
The author has written many articles during the same year with different emails
There are authors in the authors table with ArIds that don’t reference ArIds in the Articles table (Orphan records)
The values within the Year fields ranges from 2002 and 2009
The Articles table has a unique clustered index on the [FieldId and Year] fields and this index created on 9 partitions (1 partition per year)
The Authors table has a non-unique clustered index on the [Year, ArId, Aid] fields and this index is created on the same 9 partition as the Articles table (1 partition per year)
The question is:
We need to create a stored procedure that gets the following result from the two tables [Aid,Surname,Email] under the following conditions:
Authors that have written articles during and after a specific year (AND)
The total number of articles for the author is greater than a specific number (AND)
The count of the articles written by the author under a specific ClassNumber is greater than a specific percentage of the total number of his articles (AND)
Get only the most recent email of the author (in the last year during which he has written an article)
If the author has more than one email in the same year, get them all.
We need the query to take the least possible time
If anyone can help, Thank you very much.

Without having the data this is very difficult to work on, but I created the tables and duplicated the procedure to get a rough idea on the query plan and potential problems.
First noticable thing, the part of the query written as :
SELECT DISTINCT Aid
FROM Authors EAE
WHERE EAE.[Year] >= #year AND EAE.Email IS NOT NULL AND EAE.Email != ' '
Is going to table scan, you have Year as your partitioning key, but within each partition there is no index supporting the email clauses in the query. As a side note the EAE.Email != ' ' might not give you quite what you expect.
If ' ' != '' print 'true' else print 'false'
That will give false for most systems. (Based on ansi padding)
FROM Articles ED
INNER JOIN Authors EAD ON EAD.ArId = ED.ArId
WHERE EAD.Aid = [YearAuthors].Aid AND ED.ClassNumber = #classNumber
ED.ClassNumber will have no supporting index, causing a clustered index scan.
In the final select statement :
INNER JOIN Authors EA ON EA.Aid = #TT.Aid
This has no supporting index on the #TT side and doesn't appear to be one on the Authors Table side.
WHERE EA.Email IS NOT NULL AND EA.Email != ' '
this has no supporting index, causing a scan.
There are a lot more issues in there, with a considerable number of sort's appearing that probably will disappear with suitable indexes - You will have to sort out some of the basic indexing on the tables and then get a new query plan / set of problems and iteratively fix the plan - you will not fix it in a single 'silver bullet' shot.

Do you want help writing the query, or help in fixing the performance? The query itself should be relatively simple. That's not where you're going to get the most bang for your buck.
SQL Server comes with tools for analyzing queries and boosting performance by tuning your indexes. That's where you're going to see the biggest help in getting it to run quickly.

First step would be appropriate indexes. With the where criteria being the primary contenders and then items not used in where but selected can simply be included in the index. As mentioned there are standard tools and queries to find these.
To focus on the query in hand run the Query with "Query | Include Execution Plan" (Ctrl + M) turned on. This should show up any obvious bottlenecks.

Given the beforementioned conditions
This is the query I have created but it takes 3 minutes (So long time for a web page response):
CREATE PROC [dbo].[GetAuthorForMailing]
(
#classNumber INT,
#noPapers int,
#year int,
#percent int
)
AS
BEGIN
CREATE TABLE #TT
(
Aid bigint,
allPapers int,
classPapers int,
perc as CEILING(CAST(classPapers AS DECIMAL) / CAST(allPapers AS DECIMAL) * 100)
)
INSERT INTO #TT(Aid,allPapers,classPapers)
SELECT [YearAuthors].Aid,
(
SELECT COUNT(EA.Aid)
FROM Authors EA
WHERE EA.Aid =[YearAuthors].Aid) AS [AllPapers],
(
SELECT COUNT(*)
FROM Articles ED INNER JOIN Authors EAD ON EAD.ArId = ED.ArId
WHERE EAD.Aid = [YearAuthors].Aid AND ED.ClassNumber = #classNumber) AS [ClassPapers]
FROM
(
SELECT DISTINCT Aid
FROM Authors EAE
WHERE EAE.[Year] >= #year AND EAE.Email IS NOT NULL AND EAE.Email != ' '
)AS [YearAuthors]
SELECT DISTINCT EA.Aid,EA.Surname,EA.Email,[Year]
FROM #TT INNER JOIN Authors EA ON EA.Aid = #TT.Aid
AND allPapers > #noPapers
AND perc > #percent
AND EA.[Year] = (SELECT MAX([Year]) FROM Authors WHERE Aid = EA.Aid)
WHERE EA.Email IS NOT NULL AND EA.Email != ' '
DROP TABLE #TT

Related

Cursor based approach vs set based approach

I need to optimize a slow running stored procedure that uses a cursor-based approach to a set-based approach.
In principle, I have to compare records from a transient table (up to 300 records) against records from a "master table" (approx. half a million records and steadily growing). The matching is to be performed by comparing 20 varchar(11) columns of the two records. If at least 6 of these columns between the two records match (i.e. same data) that is considered to be a "sufficient match" and a record is to be created into a match table storing the IDs of the transient record and the master record, the total number of matches and the total number of mismatches.
Note that the number of mismatches is not equal to the balance of 20 minus the number-of-matches. That's because if any of the columns in either of the two records contains a null, it is not counted as a match nor a mismatch; it is simply ignored. Thus the need to capture the two counts (business requirement).
The current implementation uses an outer FAST_FORWARD cursor for the master table and an inner FAST_FORWARD cursor for the transient table. Within the inner cursor it has the following simple comparison logic applied to the 20 columns:
IF #newResults.data1 IS NOT NULL AND #results.data1 IS NOT NULL
BEGIN
IF #newResults.data1 = #results.data1
SET #matchCount = #matchCount + 1
ELSE
SET #mismatchCount = #mismatchCount + 1
END
Then, if the total number of matching columns (i.e. #matchCount) is >= 6, a "match record" is written to a "match table" capturing the primary keys of the two records and the number of matches and mismatches.
What I'm hoping to achieve: rather than looping through the two nested cursors and process one record at a time, use a set-based implementation to process the above. One simple solution, I could think of, would be to do an:
INSERT INTO MatchingResults (ResultID, NewResultID, matchCount, mismatchCount)
SELECT (...) WHERE (...)
...and put the whole matching enchilada in the SELECT statement. But, this is the difficult part... Would anyone be able to give me some pointers here? Or suggest a better performing solution? Many thanks!
Updated with table structures:
--
-- Transient table:
--
table NewResults
(
NewResultID int identity(1,1),
Data1 varchar(11),
Data2 varchar(11),
...
Data20 varchar(11),
SampleDate datetime
)
--
-- Master table:
--
table Results
(
ResultID int identity(1,1),
Data1 varchar(11),
Data2 varchar(11),
...
Data20 varchar(11),
SampleDate datetime
)
--
-- Match table:
--
table MatchingResults
(
ResultID int,
NewResultID int,
MatchCount int,
MismatchCount int
)

A big 'like' matching query

I've got 2 tables,
'[Item] with field [name] nvarchar(255)
'[Transaction] with field [short_description] nvarchar(3999)
And I need to do thus :
Select [Transaction].id, [Item].id
From [Transaction] inner join [Item]
on [Transaction].[short_description] like ('%' + [Item].[name] + '%')
The above works if limited to a handful of items, but unfiltered is just going over 20 mins and I cancel.
I have a NC index on [name], but I cannot index [short_description] due to its length.
[Transaction] has 320,000 rows
[Items] has 42,000.
That's 13,860,000,000 combinations.
Is there a better way to perform this query ?
I did poke at full-text, but I'm not really that familiar, the answer was not jumping out at me there.
Any advice appreciated !!
Starting a comparison string with a wildcard (% or _) will NEVER use an index, and will typically be disastrous for performance. Your query will need to scan indexes rather than seek through them, so indexing won't help.
Ideally, you should have a third table that would allow a many-to-many relationship between Transaction and Item based on IDs. The design is the issue here.
After some more sleuthing I have utilized some Fulltext features.
sp_fulltext_keymappings
gives me my transaction table id, along with the FT docID
(I found out that 'doc' = text field)
sys.dm_fts_index_keywords_by_document
gives me FT documentId along with the individual keywords within it
Once I had that, the rest was simple.
Although, I do have to look into the term 'keyword' a bit more... seems that definition can be variable.
This only works because the text I am searching for has no white space.
I believe that you could tweak the FTI configuration to work with other scenarios... but I couldn't promise.
I need to look more into Fulltext.
My current 'beta' code below.
CREATE TABLE #keyMap
(
docid INT PRIMARY KEY ,
[key] varchar(32) NOT NULL
);
DECLARE #db_id int = db_id(N'<database name>');
DECLARE #table_id int = OBJECT_ID(N'Transactions');
INSERT INTO #keyMap
EXEC sp_fulltext_keymappings #table_id;
select km.[key] as transaction_id, i.[id] as item_id
from
sys.dm_fts_index_keywords_by_document ( #db_id, #table_id ) kbd
INNER JOIN
#keyMap km ON km.[docid]=kbd.document_id
inner join [items] i
on kdb.[display_term] = i.name
;
My actual version of the code includes inserting the data into a final table.
Execution time is coming in at 30 seconds, which serves my needs for now.

An Alternative to OFFSET... FETCH NEXT in SQL Server 2008

I have a posts table in SQL Server and I need to select (say) the first 10 rows ordered by the count of their upvotes, here is the DB script:
create database someforum
go
use someforum
go
create table users (
user_id int identity primary key,
username varchar(80) unique not null
);
create table posts (
post_id int identity primary key,
post_time datetime,
post_title nvarchar(32),
post_body nvarchar(255),
post_user int foreign key references users(user_id)
);
create table votes (
vote_id int identity primary key,
user_id int foreign key references users(user_id),
vote_type bit, --upvote=true downvote=false
post_id int foreign key references posts(post_id)
);
insert into users values ('foo'),('bar')
insert into posts values
(getdate(),N'a post by foo',N'hey',1),
(getdate(),N'a post by bar',N'hey!',2)
insert into votes values (1,0,1),(2,0,1),(1,1,2),(2,1,2) --first post downvoted by its poster (foo) and bar, second post was upvoted by both users
I need an efficient query to select the next top 10 rows from Posts based on count of upvotes. How can I achieve this in SQL Server 2008?
Important edit: I stupidly forgot to mention that I was using SQL Server 2008 R2 to which OFFSET... FETCH NEXT wasn't introduced yet. I also edited out what is currently irrelevant to my needs.
Here's what I wanted (without using the score column):
select top 10 p.post_title,sum(case when vote_type=1 then 1 else -1 end) as score
from posts p join votes v on p.post_id = v.post_id
group by p.post_title
And as to the alternative to OFFSET… FETCH NEXT, I found a great solution in DBA
There is no "best"; but a working command might involve select top 10 ... order by Score desc. I realize that your posts tables doesn't have a Score column (that aggregates and denormalizes the votes), but: you can change that
the OFFSET / FETCH clause
You could use a gridView object to display the results: this will allow users to sort on various columns with the minimum of code on your part and also allow pagination, the inclusion of numbered links at the bottom of the gridView allowing users to move through the list of results.
Using a gridView with 10 rows will allow the display of your top 10 and users will also have the option of moving through the rest of the sorted list.
1) You can calcucalte and filter by this query
SELECT * FROM (
SELECT *, COUNT(*) as upvotes FROM posts AS p INNER JOIN votes AS v ON (p.post_id = v.post_id) WHERE v.type = true
) as v_post OFFSET 10 ROWS
2) You can shift post by step count (10 at now) in the end of query: FETCH NEXT 10, FETCH NEXT 20 etc.

Using SQLServer contains for partial words

We are running many products search on a huge catalog with partially matched barcodes.
We started with a simple like query
select * from products where barcode like '%2345%'
But that takes way too long since it requires a full table scan.
We thought a fulltext search will be able to help us here using contains.
select * from products where contains(barcode, '2345')
But, it seems like contains doesn't support finding words that partially contains a text but, only full a word match or a prefix. (But in this example we're looking for '123456').
My answer is: #DenisReznik was right :)
ok, let's take a look.
I have worked with barcodes and big catalogs for many years and I was curious about this question.
So I have made some tests on my own.
I have created a table to store test data:
CREATE TABLE [like_test](
[N] [int] NOT NULL PRIMARY KEY,
[barcode] [varchar](40) NULL
)
I know that there are many types of barcodes, some contains only numbers, other contains also letters, and other can be even much complex.
Let's assume our barcode is a random string.
I have filled it with 10 millions records of random alfanumeric data:
insert into like_test
select (select count(*) from like_test)+n, REPLACE(convert(varchar(40), NEWID()), '-', '') barcode
from FN_NUMBERS(10000000)
FN_NUMBERS() is just a function I use in my DBs (a sort of tally_table)
to get records quick.
I got 10 million records like that:
N barcode
1 1C333262C2D74E11B688281636FAF0FB
2 3680E11436FC4CBA826E684C0E96E365
3 7763D29BD09F48C58232C7D33551E6C9
Let's declare a var to search for:
declare #s varchar(20) = 'D34F15' -- random alfanumeric string
Let's take a base try with LIKE to compare results to:
select * from like_test where barcode like '%'+#s+'%'
On my workstation it takes 24.4 secs for a full clustered index scan. Very slow.
SSMS suggests to add an index on barcode column:
CREATE NONCLUSTERED INDEX [ix_barcode] ON [like_test] ([barcode]) INCLUDE ([N])
500Mb of index, I retry the select, this time 24.0 secs for the non clustered index seek.. less than 2% better, almost the same result. Very far from the 75% supposed by SSMS. It seems to me this index really doesn't worth it. Maybe my SSD Samsung 840 is making the difference..
For the moment I let the index active.
Let's try the CHARINDEX solution:
select * from like_test where charindex(#s, barcode) > 0
This time it took 23.5 second to complete, not really so much better than LIKE.
Now let's check the #DenisReznik 's suggestion that using the Binary Collation should speed up things.
select * from like_test
where barcode collate Latin1_General_BIN like '%'+#s+'%' collate Latin1_General_BIN
WOW, it seems to work! Only 4.5 secs this is impressive! 5 times better..
So, what about CHARINDEX and Collation toghether? Let's try it:
select * from like_test
where charindex(#s collate Latin1_General_BIN, barcode collate Latin1_General_BIN)>0
Unbelivable! 2.4 secs, 10 times better..
Ok, so far I have realized that CHARINDEX is better than LIKE, and that Binary Collation is better than normal string collation, so from now on I will go on only with CHARINDEX and Collation.
Now, can we do anything else to get even better results? Maybe we can try reduce our very long strings.. a scan is always a scan..
First try, a logical string cut using SUBSTRING to virtually works on barcodes of 8 chars:
select * from like_test
where charindex(
#s collate Latin1_General_BIN,
SUBSTRING(barcode, 12, 8) collate Latin1_General_BIN
)>0
Fantastic! 1.8 seconds.. I have tried both SUBSTRING(barcode, 1, 8) (head of the string) and SUBSTRING(barcode, 12, 8) (middle of the string) with same results.
Then I have tried to phisically reduce the size of the barcode column, almost no difference than using SUBSTRING()
Finally I have tried to drop the index on barcode column and repeated ALL above tests...
I was very surprised to get almost same results, with very little differences.
Index performs 3-5% better, but at cost of 500Mb of disk space and and maintenance cost if the catalog is updated.
Naturally, for a direct key lookup like where barcode = #s with the index it takes 20-50 millisecs, without index we can't get less than 1.1 secs using Collation syntax where barcode collate Latin1_General_BIN = #s collate Latin1_General_BIN
This was interesting.
I hope this helps
I often use charindex and just as often have this very debate.
As it turns out, depending on your structure you may actually have a substantial performance boost.
http://cc.davelozinski.com/sql/like-vs-substring-vs-leftright-vs-charindex
The good option here for your case - creating your FTS index. Here is how it could be implemented:
1) Create table Terms:
CREATE TABLE Terms
(
Id int IDENTITY NOT NULL,
Term varchar(21) NOT NULL,
CONSTRAINT PK_TERMS PRIMARY KEY (Term),
CONSTRAINT UK_TERMS_ID UNIQUE (Id)
)
Note: index declaration in the table definition is a feature of 2014. If you have a lower version, just bring it out of CREATE TABLE statement and create separately.
2) Cut barcodes to grams, and save each of them to a table Terms. For example: barcode = '123456', your table should have 6 rows for it: '123456', '23456', '3456', '456', '56', '6'.
3) Create table BarcodeIndex:
CREATE TABLE BarcodesIndex
(
TermId int NOT NULL,
BarcodeId int NOT NULL,
CONSTRAINT PK_BARCODESINDEX PRIMARY KEY (TermId, BarcodeId),
CONSTRAINT FK_BARCODESINDEX_TERMID FOREIGN KEY (TermId) REFERENCES Terms (Id),
CONSTRAINT FK_BARCODESINDEX_BARCODEID FOREIGN KEY (BarcodeId) REFERENCES Barcodes (Id)
)
4) Save a pair (TermId, BarcodeId) for the barcode into the table BarcodeIndex. TermId was generated on the second step or exists in the Terms table. BarcodeId - is an identifier of the barcode, stored in Barcodes (or whatever name you use for it) table. For each of the barcodes, there should be 6 rows in the BarcodeIndex table.
5) Select barcodes by their parts using the following query:
SELECT b.* FROM Terms t
INNER JOIN BarcodesIndex bi
ON t.Id = bi.TermId
INNER JOIN Barcodes b
ON bi.BarcodeId = b.Id
WHERE t.Term LIKE 'SomeBarcodePart%'
This solution force all similar parts of barcodes to be stored nearby, so SQL Server will use Index Range Scan strategy to fetch data from the Terms table. Terms in the Terms table should be unique to make this table as small as possible. This could be done in the application logic: check existence -> insert new if a term doesn't exist. Or by setting option IGNORE_DUP_KEY for clustered index of the Terms table. BarcodesIndex table is used to reference Terms and Barcodes.
Please note that foreign keys and constraints in this solution are the points of consideration. Personally, I prefer to have foreign keys, until they hurt me.
After further testing and reading and talking with #DenisReznik I think the best option could be to add virtual columns to barcode table to split barcode.
We only need columns for start positions from 2nd to 4th because for the 1st we will use original barcode column and the last I think it is not useful at all (what kind of partial match is 1 char on 6 when 60% of records will match?):
CREATE TABLE [like_test](
[N] [int] NOT NULL PRIMARY KEY,
[barcode] [varchar](6) NOT NULL,
[BC2] AS (substring([BARCODE],(2),(5))),
[BC3] AS (substring([BARCODE],(3),(4))),
[BC4] AS (substring([BARCODE],(4),(3))),
[BC5] AS (substring([BARCODE],(5),(2)))
)
and then to add indexes on this virtual columns:
CREATE NONCLUSTERED INDEX [IX_BC2] ON [like_test2] ([BC2]);
CREATE NONCLUSTERED INDEX [IX_BC3] ON [like_test2] ([BC3]);
CREATE NONCLUSTERED INDEX [IX_BC4] ON [like_test2] ([BC4]);
CREATE NONCLUSTERED INDEX [IX_BC5] ON [like_test2] ([BC5]);
CREATE NONCLUSTERED INDEX [IX_BC6] ON [like_test2] ([barcode]);
now we can simply find partial matches with this query
declare #s varchar(40)
declare #l int
set #s = '654'
set #l = LEN(#s)
select N from like_test
where 1=0
OR ((barcode = #s) and (#l=6)) -- to match full code (rem if not needed)
OR ((barcode like #s+'%') and (#l<6)) -- to match strings up to 5 chars from beginning
or ((BC2 like #s+'%') and (#l<6)) -- to match strings up to 5 chars from 2nd position
or ((BC3 like #s+'%') and (#l<5)) -- to match strings up to 4 chars from 3rd position
or ((BC4 like #s+'%') and (#l<4)) -- to match strings up to 3 chars from 4th position
or ((BC5 like #s+'%') and (#l<3)) -- to match strings up to 2 chars from 5th position
this is HELL fast!
for search strings of 6 chars 15-20 milliseconds (full code)
for search strings of 5 chars 25 milliseconds (20-80)
for search strings of 4 chars 50 milliseconds (40-130)
for search strings of 3 chars 65 milliseconds (50-150)
for search strings of 2 chars 200 milliseconds (190-260)
There will be no additional space used for table, but each index will take up to 200Mb (for 1 million barcodes)
PAY ATTENTION
Tested on a Microsoft SQL Server Express (64-bit) and Microsoft SQL Server Enterprise (64-bit) the optimizer of the latter is slight better but the main difference is that:
on express edition you have to extract ONLY the primary key when searching your string, if you add other columns in the SELECT, the optimizer will not use indexes anymore but it will go for full clustered index scan so you will need something like
;with
k as (-- extract only primary key
select N from like_test
where 1=0
OR ((barcode = #s) and (#l=6))
OR ((barcode like #s+'%') and (#l<6))
or ((BC2 like #s+'%') and (#l<6))
or ((BC3 like #s+'%') and (#l<5))
or ((BC4 like #s+'%') and (#l<4))
or ((BC5 like #s+'%') and (#l<3))
)
select N
from like_test t
where exists (select 1 from k where k.n = t.n)
on standard (enterprise) edition you HAVE to go for
select * from like_test -- take a look at the star
where 1=0
OR ((barcode = #s) and (#l=6))
OR ((barcode like #s+'%') and (#l<6))
or ((BC2 like #s+'%') and (#l<6))
or ((BC3 like #s+'%') and (#l<5))
or ((BC4 like #s+'%') and (#l<4))
or ((BC5 like #s+'%') and (#l<3))
You do not include many constraints, which means you want to search for string in a string -- and if there was a way to optimized an index to search a string in a string, it would be just built in!
Other things that make it hard to give a specific answer:
It's not clear what "huge" and "too long" mean.
It's not clear as to how your application works. Are you searching in batch as you add a 1,000 new products? Are you allowing a user to enter a partial barcode in a search box?
I can make some suggestions that may or may not be helpful in your case.
Speed up some of the queries
I have a database with lots of licence plates; sometimes an officer wants to search by the last 3-characters of the plate. To support this I store the license plate in reverse, then use LIKE ('ZYX%') to match ABCXYZ. When doing the search, they have the option of a 'contains' search (like you have) which is slow, or an option of doing 'Begins/Ends with' which is super because of the index. This would solve your problem some of the time (which may be good enough), especially if this is a common need.
Parallel Queries
An index works because it organizes data, an index cannot help with a string within a string because there is no organization. Speed seems to be your focus of optimization, so you could store/query your data in a way that searches in parallel. Example: if it takes 10-seconds to sequentially search 10-million rows, then having 10-parallel processes (so process searches 1-million) will take you from 10-seconds to 1-second (kind'a-sort'a). Think of it as scaling out. There are various options for this, within your single SQL Instance (try data partitioning) or across multiple SQL Servers (if that's an option).
BONUS: If you're not on a RAID setup, that can help with reads since it's a effectively of reading in parallel.
Reduce a bottleneck
One reason searching "huge" datasets take "too long" is because all that data needs to be read from the disk, which is always slow. You can skip-the-disk, and use InMemory Tables. Since "huge" isn't defined, this may not work.
UPDATED:
We know from that FULL-TEXT searches can be used for the following:
Full-Text Search -
MSDN
One or more specific words or phrases (simple term)
A word or a phrase where the words begin with specified text (prefix term)
Inflectional forms of a specific word (generation term)
A word or phrase close to another word or phrase (proximity term)
Synonymous forms of a specific word (thesaurus)
Words or phrases using weighted values (weighted term)
Are any of these fulfilled by your query requirements? If you are having to search for patterns as you described, without an consistent pattern (such as '1%'), then there may not be a way for SQL to use a SARG.
You could use Boolean statements
Coming from a C++ perspective, B-Trees are accessed from Pre-Order, In-Order, and Post-Order traversals and utilize Boolean statements to search the B-Tree. Processed much faster than string comparisons, booleans offer at the least an improved performance.
We can see this in the following two options:
PATINDEX
Only if your column is not numeric, as PATINDEX is designed for strings.
Returns an integer (like CHARINDEX) which is easier to process than strings.
CHARINDEX is a solution
CHARINDEX has no problem searching INTs and again, returns a number.
May require some extra cases built in (i.e. first number is always ignored), but you can add them like so: CHARINDEX('200', barcode) > 1.
Proof of what I am saying, let us go back to the old [AdventureWorks2012].[Production].[TransactionHistory]. We have TransactionID which contains the number of the items we want, and lets for fun assume you want every transactionID that has 200 at the end.
-- WITH LIKE
SELECT TOP 1000 [TransactionID]
,[ProductID]
,[ReferenceOrderID]
,[ReferenceOrderLineID]
,[TransactionDate]
,[TransactionType]
,[Quantity]
,[ActualCost]
,[ModifiedDate]
FROM [AdventureWorks2012].[Production].[TransactionHistory]
WHERE TransactionID LIKE '%200'
-- WITH CHARINDEX(<delimiter>, <column>) > 3
SELECT TOP 1000 [TransactionID]
,[ProductID]
,[ReferenceOrderID]
,[ReferenceOrderLineID]
,[TransactionDate]
,[TransactionType]
,[Quantity]
,[ActualCost]
,[ModifiedDate]
FROM [AdventureWorks2012].[Production].[TransactionHistory]
WHERE CHARINDEX('200', TransactionID) > 3
Note CHARINDEX removes the value 200200 in the search, so you may need to adjust your code appropriately. But look at the results:
Clearly, booleans and numbers are faster comparisons.
LIKE uses string comparisons, which again is much slower to process.
I was a bit surprised at the size of the difference, but the fundamentals are the same. Integers and Boolean statements are always faster to process than string comparisons.
I'm late to the game but here's another way to get a full-text like index in the spirit of #MtwStark's second answer.
This is a solution using a search table join
drop table if exists #numbers
select top 10000 row_number() over(order by t1.number) as n
into #numbers
from master..spt_values t1
cross join master..spt_values t2
drop table if exists [like_test]
create TABLE [like_test](
[N] INT IDENTITY(1,1) not null,
[barcode] [varchar](40) not null,
constraint pk_liketest primary key ([N])
)
insert into dbo.like_test (barcode)
select top (1000000) replace(convert(varchar(40), NEWID()), '-', '') barcode
from #numbers t,#numbers t2
drop table if exists barcodesearch
select distinct ps.n, trim(substring(ps.barcode,ty.n,100)) as searchstring
into barcodesearch
from like_test ps
inner join #numbers ty on ty.n < 40
where len(ps.barcode) > ty.n
create clustered index idx_barcode_search_index on barcodesearch (searchstring)
The final search should look like this:
declare #s varchar(20) = 'D34F15'
select distinct lt.* from dbo.like_test lt
inner join barcodesearch bs on bs.N = lt.N
where bs.searchstring like #s+'%'
If you have the option of full-text searching, you can speed this up even further by adding the full-text search column directly to the barcode table
drop table if exists #liketestupdates
select n, string_agg(searchstring, ' ')
within group (order by reverse(searchstring)) as searchstring
into #liketestupdates
from barcodesearch
group by n
alter table dbo.like_test add search_column varchar(559)
update lt
set search_column = searchstring
from like_test lt
inner join #liketestupdates lu on lu.n = lt.n
CREATE FULLTEXT CATALOG ftcatalog as default;
create fulltext index on dbo.like_test ( search_column )
key index pk_liketest
The final full-text search would look like this:
declare #s varchar(20) = 'D34F15'
set #s = '"*' + #s + '*"'
select n,barcode from dbo.like_test where contains(search_column, #s)
I understand that Estimated Costs aren't the best measure of expected performance but the number's aren't wildly off here.
With the search table join, the Estimated Subtree Cost is 2.13
With the full-text search, the Estimated Subtree Cost is 0.008
Full-text is aimed for bigger texts, let's say texts with more than about 100 chars. You can use LIKE '%string%'. (However it depends how the barcode column is defined.) Do you have an index for barcode? If not, then create one and it will improve your query.
First make the index on column on which you have to put as where clause .
Secondly for the datatype of the column which are used in where clause make them as Char in place of Varchar which will save you some space,in the table and in the indexes that will include that column.
varchar(1) column needs one more byte over char(1)
Do pull only the number of columns you need try to avoid * , be specific to number of columns you wish to select.
Don't write as
select * from products
In place of it write as
Select Col1, Col2 from products with (Nolock)

Creating a partitioned view of detail tables when the CHECK is on the header tables

I've been reading documentation and looking at FAQs and haven't found an answer for this one which probably means it can't be done. My actual situation is a little more complex, but I'll try to simplify it for this question. For each of the past years, I have a header/detail tables with a foreign key linking them. The year datum is in the header records! I want to be able to query all tables concatenated across years.
I have set up views that follows a 'SELECT + UNION ALL' format. I've also put check constraints on the header tables to restrict their values to their respective year. This allows the SQL server query optimizer to only query specific tables when running a query that is restricted with a WHERE clause. Awesome. Up to this point, this information can be found anywhere and everywhere by searching for Partitioned Views.
I want to do the same sort of query optimization with the detail tables but can't figure it out. There is nothing in the detail record that indicates what year it belongs to without joining with the header record; Meaning, the foreign key constraint is the only constraint I have to go off of.
The only solution I've thought of is adding a 'year' column to the detail tables and then adding another where sub clause to the queries. Is there any thing I can do to create a partitioned view of the detail tables using the existing foreign key constraint?
Here is some DDL for reference:
CREATE TABLE header2008 (
hid INT PRIMARY KEY,
dt DATE CHECK ('2008-01-01' <= dt AND dt < '2009-01-01')
)
CREATE TABLE header2009 (
hid INT PRIMARY KEY,
dt DATE CHECK ('2009-01-01' <= dt AND dt < '2010-01-01')
)
CREATE TABLE detail2008 (
did INT PRIMARY KEY,
hid INT FOREIGN KEY REFERENCES header2008(hid),
value INT
)
CREATE TABLE detail2009 (
did INT PRIMARY KEY,
hid INT FOREIGN KEY REFERENCES header2009(hid),
value INT
)
GO
CREATE VIEW headerAll AS
SELECT * FROM header2008 UNION ALL
SELECT * FROM header2009
GO
CREATE VIEW detailAll AS
SELECT * FROM detail2008 UNION ALL
SELECT * FROM detail2009
GO
--This only hits the header2008 table (GOOD)
SELECT *
FROM headerAll h
WHERE dt = '2008-04-04'
--This hits the header2008, detail2008, and detail 2009 tables. (BAD)
SELECT *
FROM headerAll h
INNER JOIN detailAll d ON h.hid = d.hid
WHERE dt = '2008-04-04'
Since you're not going for partitioned tables, I'm assuming you can't target 2005+ Enterprise Edition or higher.
Here is an alternative to adding a new physical column to your tables:
CREATE VIEW detailAll AS
SELECT 2008 AS Year, * FROM detail2008
UNION ALL
SELECT 2009, * FROM detail2009
then,
SELECT *
FROM headerAll h
INNER JOIN detailAll d ON h.hid = d.hid
WHERE dt = '2008-04-04' AND d.Year = 2008
Before you run off and implement this, there is a catch; well, two catches actually.
This solution, like the headerAll view as it's written, cannot accommodate parameters on the partitioning column and still do partition elimination. Using a search predicate of WHERE dt = #date AND d.Year = YEAR(#date) causes table scans across all tables in both views because the query optimizer assumes #date is an arbitrary value (and there's no way to fix that). This is a recipe for a performance disaster if the view is exposed publicly in your database API: there is no restriction on parameterization in queries, and most query authors and ORMs tend to use parameterized queries wherever possible (it's almost always a good thing!).
To get the views to do partition elimination in a real application, you will have to resort to dynamic string execution. How you accomplish this will depend on your business requirements, data requirements, and application architecture. It will be a bit trickier if you're grabbing data from multiple years.
Note also that using dynamic string execution would allow you to write queries directly against the base tables instead of introducing a UNIONed view for each "table". I don't think there's anything wrong with the latter, but this is an option you may not have considered.

Resources