Related
In the database on which I am attempting to create a FullText Search I need to construct a table with its column names coming from one column in a previous table. In my current implementation attempt the FullText indexing is completed on the first table Data and the search for the phrase is done there, then the second table with the search results is made.
The schema for the database is
**Players**
Id
PlayerName
Blacklisted
...
**Details**
Id
Name -> FirstName, LastName, Team, Substitute, ...
...
**Data**
Id
DetailId
PlayerId
Content
DetailId in the table Data relates to Id in Details, and PlayerId relates to Id in Players. If there are 1k rows in Players and 20 rows in Details, then there are 20k rows in Data.
WITH RankedPlayers AS
(
SELECT PlayerID, SUM(KT.[RANK]) AS Rnk
FROM Data c
INNER JOIN FREETEXTTABLE(dbo.Data, Content, '"Some phrase like team name and player name"')
AS KT ON c. DataID = KT.[KEY]
GROUP BY c.PlayerID
)
…
Then a table is made by selecting the rows in one column. Similar to a pivot.
…
SELECT rc.Rnk,
c.PlayerID,
PlayerName,
TeamID,
…
(SELECT Content FROM dbo.Data data WHERE DetailID = 1 AND data.PlayerID = c.PlayerID) AS [TeamName],
…
FROM dbo.Players c
JOIN RankedPlayers rc ON c. PlayerID = rc. PlayerID
ORDER BY rc.Rnk DESC
I can return a ranked table with this implementation, the aim however is to be able to produce results from weighted columns, so say the column Playername contributes to the rank more than say TeamName.
I have tried making a schema bound view with a pivot, but then I cannot index it because of the pivot. I have tried making a view of that view, but it seems the metadata is inherited, plus that feels like a clunky method.
I then tried to do it as a straight query using sub queries in the select statement, but cannot due to indexing not liking sub queries.
I then tried to join multiple times, again the index on the view doesn't like self-referencing joins.
How to do this?
I have come across this article http://developmentnow.com/2006/08/07/weighted-columns-in-sql-server-2005-full-text-search/ , and other articles here on weighted columns, however nothing as far as I can find addresses weighting columns when the columns were initially row data.
A simple solution that works really well. Put weight on the rows containing the required IDs in another table, left join that table to the table to which the full text search had been applied, and multiply the rank by the weight. Continue as previously implemented.
In code that comes out as
DECLARE #Weight TABLE
(
DetailID INT,
[Weight] FLOAT
);
INSERT INTO #Weight VALUES
(1, 0.80),
(2, 0.80),
(3, 0.50);
WITH RankedPlayers AS
(
SELECT PlayerID, SUM(KT.[RANK] * ISNULL(cw.[Weight], 0.10)) AS Rnk
FROM Data c
INNER JOIN FREETEXTTABLE(dbo.Data, Content, 'Karl Kognition C404') AS KT ON c.DataID = KT.[KEY]
LEFT JOIN #Weight cw ON c.DetailID = cw.DetailID
GROUP BY c.PlayerID
)
SELECT rc.Rnk,
...
I'm using a temporary table here for evidence of concept. I am considering adding a column Weights to the table Details to avoid an unnecessary table and left join.
We are running many products search on a huge catalog with partially matched barcodes.
We started with a simple like query
select * from products where barcode like '%2345%'
But that takes way too long since it requires a full table scan.
We thought a fulltext search will be able to help us here using contains.
select * from products where contains(barcode, '2345')
But, it seems like contains doesn't support finding words that partially contains a text but, only full a word match or a prefix. (But in this example we're looking for '123456').
My answer is: #DenisReznik was right :)
ok, let's take a look.
I have worked with barcodes and big catalogs for many years and I was curious about this question.
So I have made some tests on my own.
I have created a table to store test data:
CREATE TABLE [like_test](
[N] [int] NOT NULL PRIMARY KEY,
[barcode] [varchar](40) NULL
)
I know that there are many types of barcodes, some contains only numbers, other contains also letters, and other can be even much complex.
Let's assume our barcode is a random string.
I have filled it with 10 millions records of random alfanumeric data:
insert into like_test
select (select count(*) from like_test)+n, REPLACE(convert(varchar(40), NEWID()), '-', '') barcode
from FN_NUMBERS(10000000)
FN_NUMBERS() is just a function I use in my DBs (a sort of tally_table)
to get records quick.
I got 10 million records like that:
N barcode
1 1C333262C2D74E11B688281636FAF0FB
2 3680E11436FC4CBA826E684C0E96E365
3 7763D29BD09F48C58232C7D33551E6C9
Let's declare a var to search for:
declare #s varchar(20) = 'D34F15' -- random alfanumeric string
Let's take a base try with LIKE to compare results to:
select * from like_test where barcode like '%'+#s+'%'
On my workstation it takes 24.4 secs for a full clustered index scan. Very slow.
SSMS suggests to add an index on barcode column:
CREATE NONCLUSTERED INDEX [ix_barcode] ON [like_test] ([barcode]) INCLUDE ([N])
500Mb of index, I retry the select, this time 24.0 secs for the non clustered index seek.. less than 2% better, almost the same result. Very far from the 75% supposed by SSMS. It seems to me this index really doesn't worth it. Maybe my SSD Samsung 840 is making the difference..
For the moment I let the index active.
Let's try the CHARINDEX solution:
select * from like_test where charindex(#s, barcode) > 0
This time it took 23.5 second to complete, not really so much better than LIKE.
Now let's check the #DenisReznik 's suggestion that using the Binary Collation should speed up things.
select * from like_test
where barcode collate Latin1_General_BIN like '%'+#s+'%' collate Latin1_General_BIN
WOW, it seems to work! Only 4.5 secs this is impressive! 5 times better..
So, what about CHARINDEX and Collation toghether? Let's try it:
select * from like_test
where charindex(#s collate Latin1_General_BIN, barcode collate Latin1_General_BIN)>0
Unbelivable! 2.4 secs, 10 times better..
Ok, so far I have realized that CHARINDEX is better than LIKE, and that Binary Collation is better than normal string collation, so from now on I will go on only with CHARINDEX and Collation.
Now, can we do anything else to get even better results? Maybe we can try reduce our very long strings.. a scan is always a scan..
First try, a logical string cut using SUBSTRING to virtually works on barcodes of 8 chars:
select * from like_test
where charindex(
#s collate Latin1_General_BIN,
SUBSTRING(barcode, 12, 8) collate Latin1_General_BIN
)>0
Fantastic! 1.8 seconds.. I have tried both SUBSTRING(barcode, 1, 8) (head of the string) and SUBSTRING(barcode, 12, 8) (middle of the string) with same results.
Then I have tried to phisically reduce the size of the barcode column, almost no difference than using SUBSTRING()
Finally I have tried to drop the index on barcode column and repeated ALL above tests...
I was very surprised to get almost same results, with very little differences.
Index performs 3-5% better, but at cost of 500Mb of disk space and and maintenance cost if the catalog is updated.
Naturally, for a direct key lookup like where barcode = #s with the index it takes 20-50 millisecs, without index we can't get less than 1.1 secs using Collation syntax where barcode collate Latin1_General_BIN = #s collate Latin1_General_BIN
This was interesting.
I hope this helps
I often use charindex and just as often have this very debate.
As it turns out, depending on your structure you may actually have a substantial performance boost.
http://cc.davelozinski.com/sql/like-vs-substring-vs-leftright-vs-charindex
The good option here for your case - creating your FTS index. Here is how it could be implemented:
1) Create table Terms:
CREATE TABLE Terms
(
Id int IDENTITY NOT NULL,
Term varchar(21) NOT NULL,
CONSTRAINT PK_TERMS PRIMARY KEY (Term),
CONSTRAINT UK_TERMS_ID UNIQUE (Id)
)
Note: index declaration in the table definition is a feature of 2014. If you have a lower version, just bring it out of CREATE TABLE statement and create separately.
2) Cut barcodes to grams, and save each of them to a table Terms. For example: barcode = '123456', your table should have 6 rows for it: '123456', '23456', '3456', '456', '56', '6'.
3) Create table BarcodeIndex:
CREATE TABLE BarcodesIndex
(
TermId int NOT NULL,
BarcodeId int NOT NULL,
CONSTRAINT PK_BARCODESINDEX PRIMARY KEY (TermId, BarcodeId),
CONSTRAINT FK_BARCODESINDEX_TERMID FOREIGN KEY (TermId) REFERENCES Terms (Id),
CONSTRAINT FK_BARCODESINDEX_BARCODEID FOREIGN KEY (BarcodeId) REFERENCES Barcodes (Id)
)
4) Save a pair (TermId, BarcodeId) for the barcode into the table BarcodeIndex. TermId was generated on the second step or exists in the Terms table. BarcodeId - is an identifier of the barcode, stored in Barcodes (or whatever name you use for it) table. For each of the barcodes, there should be 6 rows in the BarcodeIndex table.
5) Select barcodes by their parts using the following query:
SELECT b.* FROM Terms t
INNER JOIN BarcodesIndex bi
ON t.Id = bi.TermId
INNER JOIN Barcodes b
ON bi.BarcodeId = b.Id
WHERE t.Term LIKE 'SomeBarcodePart%'
This solution force all similar parts of barcodes to be stored nearby, so SQL Server will use Index Range Scan strategy to fetch data from the Terms table. Terms in the Terms table should be unique to make this table as small as possible. This could be done in the application logic: check existence -> insert new if a term doesn't exist. Or by setting option IGNORE_DUP_KEY for clustered index of the Terms table. BarcodesIndex table is used to reference Terms and Barcodes.
Please note that foreign keys and constraints in this solution are the points of consideration. Personally, I prefer to have foreign keys, until they hurt me.
After further testing and reading and talking with #DenisReznik I think the best option could be to add virtual columns to barcode table to split barcode.
We only need columns for start positions from 2nd to 4th because for the 1st we will use original barcode column and the last I think it is not useful at all (what kind of partial match is 1 char on 6 when 60% of records will match?):
CREATE TABLE [like_test](
[N] [int] NOT NULL PRIMARY KEY,
[barcode] [varchar](6) NOT NULL,
[BC2] AS (substring([BARCODE],(2),(5))),
[BC3] AS (substring([BARCODE],(3),(4))),
[BC4] AS (substring([BARCODE],(4),(3))),
[BC5] AS (substring([BARCODE],(5),(2)))
)
and then to add indexes on this virtual columns:
CREATE NONCLUSTERED INDEX [IX_BC2] ON [like_test2] ([BC2]);
CREATE NONCLUSTERED INDEX [IX_BC3] ON [like_test2] ([BC3]);
CREATE NONCLUSTERED INDEX [IX_BC4] ON [like_test2] ([BC4]);
CREATE NONCLUSTERED INDEX [IX_BC5] ON [like_test2] ([BC5]);
CREATE NONCLUSTERED INDEX [IX_BC6] ON [like_test2] ([barcode]);
now we can simply find partial matches with this query
declare #s varchar(40)
declare #l int
set #s = '654'
set #l = LEN(#s)
select N from like_test
where 1=0
OR ((barcode = #s) and (#l=6)) -- to match full code (rem if not needed)
OR ((barcode like #s+'%') and (#l<6)) -- to match strings up to 5 chars from beginning
or ((BC2 like #s+'%') and (#l<6)) -- to match strings up to 5 chars from 2nd position
or ((BC3 like #s+'%') and (#l<5)) -- to match strings up to 4 chars from 3rd position
or ((BC4 like #s+'%') and (#l<4)) -- to match strings up to 3 chars from 4th position
or ((BC5 like #s+'%') and (#l<3)) -- to match strings up to 2 chars from 5th position
this is HELL fast!
for search strings of 6 chars 15-20 milliseconds (full code)
for search strings of 5 chars 25 milliseconds (20-80)
for search strings of 4 chars 50 milliseconds (40-130)
for search strings of 3 chars 65 milliseconds (50-150)
for search strings of 2 chars 200 milliseconds (190-260)
There will be no additional space used for table, but each index will take up to 200Mb (for 1 million barcodes)
PAY ATTENTION
Tested on a Microsoft SQL Server Express (64-bit) and Microsoft SQL Server Enterprise (64-bit) the optimizer of the latter is slight better but the main difference is that:
on express edition you have to extract ONLY the primary key when searching your string, if you add other columns in the SELECT, the optimizer will not use indexes anymore but it will go for full clustered index scan so you will need something like
;with
k as (-- extract only primary key
select N from like_test
where 1=0
OR ((barcode = #s) and (#l=6))
OR ((barcode like #s+'%') and (#l<6))
or ((BC2 like #s+'%') and (#l<6))
or ((BC3 like #s+'%') and (#l<5))
or ((BC4 like #s+'%') and (#l<4))
or ((BC5 like #s+'%') and (#l<3))
)
select N
from like_test t
where exists (select 1 from k where k.n = t.n)
on standard (enterprise) edition you HAVE to go for
select * from like_test -- take a look at the star
where 1=0
OR ((barcode = #s) and (#l=6))
OR ((barcode like #s+'%') and (#l<6))
or ((BC2 like #s+'%') and (#l<6))
or ((BC3 like #s+'%') and (#l<5))
or ((BC4 like #s+'%') and (#l<4))
or ((BC5 like #s+'%') and (#l<3))
You do not include many constraints, which means you want to search for string in a string -- and if there was a way to optimized an index to search a string in a string, it would be just built in!
Other things that make it hard to give a specific answer:
It's not clear what "huge" and "too long" mean.
It's not clear as to how your application works. Are you searching in batch as you add a 1,000 new products? Are you allowing a user to enter a partial barcode in a search box?
I can make some suggestions that may or may not be helpful in your case.
Speed up some of the queries
I have a database with lots of licence plates; sometimes an officer wants to search by the last 3-characters of the plate. To support this I store the license plate in reverse, then use LIKE ('ZYX%') to match ABCXYZ. When doing the search, they have the option of a 'contains' search (like you have) which is slow, or an option of doing 'Begins/Ends with' which is super because of the index. This would solve your problem some of the time (which may be good enough), especially if this is a common need.
Parallel Queries
An index works because it organizes data, an index cannot help with a string within a string because there is no organization. Speed seems to be your focus of optimization, so you could store/query your data in a way that searches in parallel. Example: if it takes 10-seconds to sequentially search 10-million rows, then having 10-parallel processes (so process searches 1-million) will take you from 10-seconds to 1-second (kind'a-sort'a). Think of it as scaling out. There are various options for this, within your single SQL Instance (try data partitioning) or across multiple SQL Servers (if that's an option).
BONUS: If you're not on a RAID setup, that can help with reads since it's a effectively of reading in parallel.
Reduce a bottleneck
One reason searching "huge" datasets take "too long" is because all that data needs to be read from the disk, which is always slow. You can skip-the-disk, and use InMemory Tables. Since "huge" isn't defined, this may not work.
UPDATED:
We know from that FULL-TEXT searches can be used for the following:
Full-Text Search -
MSDN
One or more specific words or phrases (simple term)
A word or a phrase where the words begin with specified text (prefix term)
Inflectional forms of a specific word (generation term)
A word or phrase close to another word or phrase (proximity term)
Synonymous forms of a specific word (thesaurus)
Words or phrases using weighted values (weighted term)
Are any of these fulfilled by your query requirements? If you are having to search for patterns as you described, without an consistent pattern (such as '1%'), then there may not be a way for SQL to use a SARG.
You could use Boolean statements
Coming from a C++ perspective, B-Trees are accessed from Pre-Order, In-Order, and Post-Order traversals and utilize Boolean statements to search the B-Tree. Processed much faster than string comparisons, booleans offer at the least an improved performance.
We can see this in the following two options:
PATINDEX
Only if your column is not numeric, as PATINDEX is designed for strings.
Returns an integer (like CHARINDEX) which is easier to process than strings.
CHARINDEX is a solution
CHARINDEX has no problem searching INTs and again, returns a number.
May require some extra cases built in (i.e. first number is always ignored), but you can add them like so: CHARINDEX('200', barcode) > 1.
Proof of what I am saying, let us go back to the old [AdventureWorks2012].[Production].[TransactionHistory]. We have TransactionID which contains the number of the items we want, and lets for fun assume you want every transactionID that has 200 at the end.
-- WITH LIKE
SELECT TOP 1000 [TransactionID]
,[ProductID]
,[ReferenceOrderID]
,[ReferenceOrderLineID]
,[TransactionDate]
,[TransactionType]
,[Quantity]
,[ActualCost]
,[ModifiedDate]
FROM [AdventureWorks2012].[Production].[TransactionHistory]
WHERE TransactionID LIKE '%200'
-- WITH CHARINDEX(<delimiter>, <column>) > 3
SELECT TOP 1000 [TransactionID]
,[ProductID]
,[ReferenceOrderID]
,[ReferenceOrderLineID]
,[TransactionDate]
,[TransactionType]
,[Quantity]
,[ActualCost]
,[ModifiedDate]
FROM [AdventureWorks2012].[Production].[TransactionHistory]
WHERE CHARINDEX('200', TransactionID) > 3
Note CHARINDEX removes the value 200200 in the search, so you may need to adjust your code appropriately. But look at the results:
Clearly, booleans and numbers are faster comparisons.
LIKE uses string comparisons, which again is much slower to process.
I was a bit surprised at the size of the difference, but the fundamentals are the same. Integers and Boolean statements are always faster to process than string comparisons.
I'm late to the game but here's another way to get a full-text like index in the spirit of #MtwStark's second answer.
This is a solution using a search table join
drop table if exists #numbers
select top 10000 row_number() over(order by t1.number) as n
into #numbers
from master..spt_values t1
cross join master..spt_values t2
drop table if exists [like_test]
create TABLE [like_test](
[N] INT IDENTITY(1,1) not null,
[barcode] [varchar](40) not null,
constraint pk_liketest primary key ([N])
)
insert into dbo.like_test (barcode)
select top (1000000) replace(convert(varchar(40), NEWID()), '-', '') barcode
from #numbers t,#numbers t2
drop table if exists barcodesearch
select distinct ps.n, trim(substring(ps.barcode,ty.n,100)) as searchstring
into barcodesearch
from like_test ps
inner join #numbers ty on ty.n < 40
where len(ps.barcode) > ty.n
create clustered index idx_barcode_search_index on barcodesearch (searchstring)
The final search should look like this:
declare #s varchar(20) = 'D34F15'
select distinct lt.* from dbo.like_test lt
inner join barcodesearch bs on bs.N = lt.N
where bs.searchstring like #s+'%'
If you have the option of full-text searching, you can speed this up even further by adding the full-text search column directly to the barcode table
drop table if exists #liketestupdates
select n, string_agg(searchstring, ' ')
within group (order by reverse(searchstring)) as searchstring
into #liketestupdates
from barcodesearch
group by n
alter table dbo.like_test add search_column varchar(559)
update lt
set search_column = searchstring
from like_test lt
inner join #liketestupdates lu on lu.n = lt.n
CREATE FULLTEXT CATALOG ftcatalog as default;
create fulltext index on dbo.like_test ( search_column )
key index pk_liketest
The final full-text search would look like this:
declare #s varchar(20) = 'D34F15'
set #s = '"*' + #s + '*"'
select n,barcode from dbo.like_test where contains(search_column, #s)
I understand that Estimated Costs aren't the best measure of expected performance but the number's aren't wildly off here.
With the search table join, the Estimated Subtree Cost is 2.13
With the full-text search, the Estimated Subtree Cost is 0.008
Full-text is aimed for bigger texts, let's say texts with more than about 100 chars. You can use LIKE '%string%'. (However it depends how the barcode column is defined.) Do you have an index for barcode? If not, then create one and it will improve your query.
First make the index on column on which you have to put as where clause .
Secondly for the datatype of the column which are used in where clause make them as Char in place of Varchar which will save you some space,in the table and in the indexes that will include that column.
varchar(1) column needs one more byte over char(1)
Do pull only the number of columns you need try to avoid * , be specific to number of columns you wish to select.
Don't write as
select * from products
In place of it write as
Select Col1, Col2 from products with (Nolock)
This query runs very fast (<100 msec):
SELECT TOP (10)
[Extent2].[CompanyId] AS [CompanyId]
,[Extent1].[Id] AS [Id]
,[Extent1].[Status] AS [Status]
FROM [dbo].[SplittedSms] AS [Extent1]
INNER JOIN [dbo].[Sms] AS [Extent2]
ON [Extent1].[SmsId] = [Extent2].[Id]
WHERE [Extent2].[CompanyId] = 4563
AND ([Extent1].[NotifiedToClient] IS NULL)
If I add just a time filter, it takes too long (22 seconds!):
SELECT TOP (10)
[Extent2].[CompanyId] AS [CompanyId]
,[Extent1].[Id] AS [Id]
,[Extent1].[Status] AS [Status]
FROM [dbo].[SplittedSms] AS [Extent1]
INNER JOIN [dbo].[Sms] AS [Extent2]
ON [Extent1].[SmsId] = [Extent2].[Id]
WHERE [Extent2].Time > '2015-04-10'
AND [Extent2].[CompanyId] = 4563
AND ([Extent1].[NotifiedToClient] IS NULL)
I tried adding an index on the [Time] column of the Sms table, but the optimizer seems not using the index. Tried using With (index (Ix_Sms_Time)); but to my surprise, it takes even more time (29 seconds!).
Here is the actual execution plan:
The execution plan is same for both queries. Tables mentioned here have 5M to 8M rows (indices are < 1% fragmented and stats are updated). I am using MS SQL Server 2008R2 on a 16core 32GB memory Windows 2008 R2 machine)
Does it help when you force the time filter to kick in only after the client filter has run?
FI like in this example:
;WITH ClientData AS (
SELECT
[E2].[CompanyId]
,[E2].[Time]
,[E1].[Id]
,[E1].[Status]
FROM [dbo].[SplittedSms] AS [E1]
INNER JOIN [dbo].[Sms] AS [E2]
ON [E1].[SmsId] = [E2].[Id]
WHERE [E2].[CompanyId] = 4563
AND ([E1].[NotifiedToClient] IS NULL)
)
SELECT TOP 10
[CompanyId]
,[Id]
,[Status]
FROM ClientData
WHERE [Time] > '2015-04-10'
Create an index on Sms with the following Index Key Columns (in this order):
CompanyID
Time
You may or may not need to add Id as an Included Column.
What datatype is your Time column?
If it's datetime, try converting your '2015-04-10' into equivalent data-type, so that it can use the index.
Declare #test datetime
Set #test='2015-04-10'
Then modify your condition:
[Extent2].Time > #test
The sql server implicitly casts to matching data-type if there is a data-type mismatch. And any function or cast operation prevent using indexes.
I'm on the same track with #JonTirjan, the index with just Time results into a lot of key lookups, so you should try at least following:
create index xxx on Sms (Time, CompanyId) include (Id)
or
create index xxx on Sms (CompanyId, Time) include (Id)
If Id is your clustered index, then it's not needed in include clause. If significant part of your data belongs to CompanyID 4563, it might be ok to have it as include column too.
The percentages you see in actual plan are just estimates based on the row count assumptions, so those are sometimes totally wrong. Looking at actual number of rows / executions + statistics IO output should give you idea what's actually happening.
Two things come to mind:
By adding an extra restriction it will be 'harder' for the database to find the first 10 items that match your restrictions. Finding the first 10 rows from let's say 10.000 items (from a total of 1 milion) is a easier then finding the first 10 rows from maybe 100 items (from a total of 1 milion).
The index is not being used probably because the index is created on a datetime column, which is not very efficient if you are also storing the time in them. You might want to create a clustered index on the [time] column (but then you would have to remove the clustered index which is now on the [CompanyId] column or you could create a computed column which stores the date-part of the [time] column, create an index on this computed column and filter on this column.
I found out that there was no index on the foreign key column (SmsId) on the SplittedSms table. I made one and it seems the second query is almost as fast as the first one now.
The execution plan now:
Thanks everyone for the effort.
Good day Guys, Would you help me with my SQL Query. I have on proj in the web which I called INQUIRY good thing is I can store the log file of what data is being search to my project which they enter to my inquiry search box.
This is the table of Keyword have been searched in INQUIRY:
This Code :
Insert into #temptable
Select CaseNo from tblcrew
where Lastname like '%FABIANA%'
and firstname like '%MARLON%'
Insert into #temptable
Select CaseNo from tblcrew
where Lastname like '%DE JOAN%'
and firstname like '%ROLANDO%'
Insert into #temptable
Select CaseNo from tblcrew
where Lastname like '%ROSAS%'
and firstname like '%FRANCASIO%'
I want to repeat my query until all the rows in table of keyword is being search and save the result of each query into a temporary table. Is there a possibility to do that without typing all the value of in the columns of keyword.
Please anyone help me.. thanks!
All you need is join the two tables together without typing any values.
Insert into #temptable
Select c.CaseNo
from tblcrew c
inner join tblKeyword k
on c.Lastname like '%'+k.Lastname+'%'
and c.firstname like '%'+k.firstname +'%'
Usually start with the Adventure Works database for examples like this. I will be talking about exact matches with leverage an index seek, in-exact matches that leverage a index scan, and full text indexing in which you can do a in-exact match resulting in a seek.
The Person.Person table has both last and first name like your example. I keep just the primary key on business id and create one index on (last, first).
--
-- Just PK & One index for test
--
-- Sample database
use [AdventureWorks2012];
go
-- add the index
CREATE NONCLUSTERED INDEX [IX_Person_LastName_FirstName] ON [Person].[Person]
(
[LastName] ASC,
[FirstName] ASC
);
go
Run with wild card for inexact match. Run with just text for exact match. I randomly picked two names from the Person.Person table.
--
-- Run for match type
--
-- Sample database
use [AdventureWorks2012];
go
-- remove temp table
drop table #inquiry;
go
-- A table with first, last name combos to search
create table #inquiry
(
first_name varchar(50),
last_name varchar(50)
);
go
-- Add two person.person names
insert into #inquiry values
('%Cristian%', '%Petculescu%'),
('%John%', '%Kane%');
/*
('Cristian', 'Petculescu'),
('John', 'Kane');
*/
go
-- Show search values
select * from #inquiry;
go
The next step when examining run times is to clear the procedure cache and memory buffers. You do not want existing plans or data skew the numbers.
-- Remove clean buffers & clear plan cache
CHECKPOINT
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
GO
-- Show time & i/o
SET STATISTICS TIME ON
SET STATISTICS IO ON
GO
The first SQL statement will do a inner join between the temporary search values table and Person.Person.
-- Exact match
select *
from
[Person].[Person] p join #inquiry i
on p.FirstName = i.first_name and p.LastName = i.last_name
The statistics and run times.
Table 'Person'. Scan count 2, logical reads 16, physical reads 8, CPU time = 0 ms, elapsed time = 29 ms.
The resulting query plan does a table scan of the #inquiry table and a index seek of the index on a last and first name. It is a nice simple plan.
Lets retry this with a inexact match using wild cards and the LIKE operator.
-- In-Exact match
select *
from
[Person].[Person] p join #inquiry i
on p.FirstName like i.first_name and p.LastName like i.last_name
The statistics and run times.
Table 'Person'. Scan count 2, logical reads 219, CPU time = 32 ms, elapsed time = 58 ms.
The resulting query plan is a-lot more complicated. We are still doing a table scan of #inquiry since it does not have an index. However, there are a-lot of nested joins going on to used the index with a impartial match.
We added three more operators to the query and the execution time is twice that of the exact match.
In short, if you are doing inexact matches with the LIKE command, they will be more expensive.
If you are searching hundreds of thousands of records, use a FULL TEXT INDEX (FTI). I wrote two articles on this topic.
http://craftydba.com/?p=1421
http://craftydba.com/?p=1629
Every night, you will have to have a process that updates the FTI with any changes. After that one hit, you can use the CONTAINS() operator to leverage the index in fuzzy matches.
I hope I explained the differences. I have seen continued confusion on this topic and I wanted to put something out on Stack Overflow that I could reference.
Best of luck Juan.
I have a T-SQL query that takes data from one table and copies it into a new table but only rows meeting a certain condition:
SELECT VibeFGEvents.*
INTO VibeFGEventsAfterStudyStart
FROM VibeFGEvents
LEFT OUTER JOIN VibeFGEventsStudyStart
ON
CHARINDEX(REPLACE(REPLACE(REPLACE(logName, 'MyVibe ', ''), ' new laptop', ''), ' old laptop', ''), excelFilename) > 0
AND VibeFGEventsStudyStart.MIN_TitleInstID <= VibeFGEvents.TitleInstID
AND VibeFGEventsStudyStart.MIN_WinInstId <= VibeFGEvents.WndInstID
WHERE VibeFGEventsStudyStart.excelFilename IS NOT NULL
ORDER BY VibeFGEvents.id
The code using the table relies on its order, and the copy above does not preserve the order I expected. I.e. the rows in the new table VibeFGEventsAfterStudyStart are not monotonically increasing in the VibeFGEventsAfterStudyStart.id column copied from VibeFGEvents.id.
In T-SQL how might I preserve the ordering of the rows from VibeFGEvents in VibeFGEventsStudyStart?
I know this is a bit old, but I needed to do something similar. I wanted to insert the contents of one table into another, but in a random order. I found that I could do this by using select top n and order by newid(). Without the 'top n', order was not preserved and the second table had rows in the same order as the first. However, with 'top n', the order (random in my case) was preserved. I used a value of 'n' that was greater than the number of rows. So my query was along the lines of:
insert Table2 (T2Col1, T2Col2)
select top 10000 T1Col1, T1Col2
from Table1
order by newid()
What for?
Point is – data in a table is not ordered. In SQL Server the intrinsic storage order of a table is that of the (if defined) clustered index.
The order in which data is inserted is basically "irrelevant". It is forgotten the moment the data is written into the table.
As such, nothing is gained, even if you get this stuff. If you need an order when dealing with data, you HAVE To put an order by clause on the select that gets it. Anything else is random - i.e. the order you et data is not determined and may change.
So it makes no sense to have a specific order on the insert as you try to achieve.
SQL 101: sets have no order.
Just add top to your sql with a number that is greater than the actual number of rows:
SELECT top 25000 *
into spx_copy
from SPX
order by date
I've found a specific scenario where we want the new table to be created with a specific order in the columns' content:
Amount of rows is very big (from 200 to 2000 millions of rows), so we are using SELECT INTO instead of CREATE TABLE + INSERT because needs to be loaded as fast as possible (minimal logging). We have tested using the trace flag 610 for loading an already created empty table with a clustered index but still takes longer than the following approach.
We need the data to be ordered by specific columns for query performances, so we are creating a CLUSTERED INDEX just after the table is loaded. We discarded creating a non-clustered index because it would need another read for the data that's not included in the ordered columns from the index, and we discarded creating a full-covering non-clustered index because it would practically double the amount of space needed to hold the table.
It happens that if you manage to somehow create the table with columns already "ordered", creating the clustered index (with the same order) takes a lot less time than when the data isn't ordered. And sometimes (you will have to test your case), ordering the rows in the SELECT INTO is faster than loading without order and creating the clustered index later.
The problem is that SQL Server 2012+ will ignore the ORDER BY column list when doing INSERT INTO or when doing SELECT INTO. It will consider the ORDER BY columns if you specify an IDENTITY column on the SELECT INTO or if the inserted table has an IDENTITY column, but just to determine the identity values and not the actual storage order in the underlying table. In this case, it's likely that the sort will happen but not guaranteed as it's highly dependent on the execution plan.
A trick we have found is that doing a SELECT INTO with the result of a UNION ALL makes the engine perform a SORT (not always an explicit SORT operator, sometimes a MERGE JOIN CONCATENATION, etc.) if you have an ORDER BY list. This way the select into already creates the new table in the order we are going to create the clustered index later and thus the index takes less time to create.
So you can rewrite this query:
SELECT
FirstColumn = T.FirstColumn,
SecondColumn = T.SecondColumn
INTO
#NewTable
FROM
VeryBigTable AS T
ORDER BY -- ORDER BY is ignored!
FirstColumn,
SecondColumn
to
SELECT
FirstColumn = T.FirstColumn,
SecondColumn = T.SecondColumn
INTO
#NewTable
FROM
VeryBigTable AS T
UNION ALL
-- A "fake" row to be deleted
SELECT
FirstColumn = 0,
SecondColumn = 0
ORDER BY
FirstColumn,
SecondColumn
We have used this trick a few times, but I can't guarantee it will always sort. I'm just posting this as a possible workaround in case someone has a similar scenario.
You cannot do this with ORDER BY but if you create a Clustered Index on VibeFGEvents.id after your SELECT INTO the table will be sorted on disk by VibeFGEvents.id.
I'v made a test on MS SQL 2012, and it clearly shows me, that insert into ... select ... order by makes sense. Here is what I did:
create table tmp1 (id int not null identity, name sysname);
create table tmp2 (id int not null identity, name sysname);
insert into tmp1 (name) values ('Apple');
insert into tmp1 (name) values ('Carrot');
insert into tmp1 (name) values ('Pineapple');
insert into tmp1 (name) values ('Orange');
insert into tmp1 (name) values ('Kiwi');
insert into tmp1 (name) values ('Ananas');
insert into tmp1 (name) values ('Banana');
insert into tmp1 (name) values ('Blackberry');
select * from tmp1 order by id;
And I got this list:
1 Apple
2 Carrot
3 Pineapple
4 Orange
5 Kiwi
6 Ananas
7 Banana
8 Blackberry
No surprises here. Then I made a copy from tmp1 to tmp2 this way:
insert into tmp2 (name)
select name
from tmp1
order by id;
select * from tmp2 order by id;
I got the exact response like before. Apple to Blackberry.
Now reverse the order to test it:
delete from tmp2;
insert into tmp2 (name)
select name
from tmp1
order by id desc;
select * from tmp2 order by id;
9 Blackberry
10 Banana
11 Ananas
12 Kiwi
13 Orange
14 Pineapple
15 Carrot
16 Apple
So the order in tmp2 is reversed too, so order by made sense when there is a identity column in the target table!
The reason why one would desire this (a specific order) is because you cannot define the order in a subquery, so, the idea is that, if you create a table variable, THEN make a query from that table variable, you would think you would retain the order(say, to concatenate rows that must be in order- say for XML or json), but you can't.
So, what do you do?
The answer is to force SQL to order it by using TOP in your select (just pick a number high enough to cover all your rows).
I have run into the same issue and one reason I have needed to preserve the order is when I try to use ROLLUP to get a weighted average based on the raw data and not an average of what is in that column. For instance, say I want to see the average of profit based on number of units sold by four store locations? I can do this very easily by creating the equation Profit / #Units = Avg. Now I include a ROLLUP in my GROUP BY so that I can also see the average across all locations. Now I think to myself, "This is good info but I want to see it in order of Best Average to Worse and keep the Overall at the bottom (or top) of the list)." The ROLLUP will fail you in this so you take a different approach.
Why not create row numbers based on the sequence (order) you need to preserve?
SELECT OrderBy = ROW_NUMBER() OVER(PARTITION BY 'field you want to count' ORDER BY 'field(s) you want to use ORDER BY')
, VibeFGEvents.*
FROM VibeFGEvents
LEFT OUTER JOIN VibeFGEventsStudyStart
ON
CHARINDEX(REPLACE(REPLACE(REPLACE(logName, 'MyVibe ', ''), ' new laptop', ''), ' old laptop', ''), excelFilename) > 0
AND VibeFGEventsStudyStart.MIN_TitleInstID <= VibeFGEvents.TitleInstID
AND VibeFGEventsStudyStart.MIN_WinInstId <= VibeFGEvents.WndInstID
WHERE VibeFGEventsStudyStart.excelFilename IS NOT NULL
Now you can use the OrderBy field from your table to set the order of values. I removed the ORDER BY statement from the query above since it does not affect how the data is loaded to the table.
I found this approach helpful to solve this problem:
WITH ordered as
(
SELECT TOP 1000
[Month]
FROM SourceTable
GROUP BY [Month]
ORDER BY [Month]
)
INSERT INTO DestinationTable (MonthStart)
(
SELECT * from ordered
)
Try using INSERT INTO instead of SELECT INTO
INSERT INTO VibeFGEventsAfterStudyStart
SELECT VibeFGEvents.*
FROM VibeFGEvents
LEFT OUTER JOIN VibeFGEventsStudyStart
ON
CHARINDEX(REPLACE(REPLACE(REPLACE(logName, 'MyVibe ', ''), ' new laptop', ''), ' old laptop', ''), excelFilename) > 0
AND VibeFGEventsStudyStart.MIN_TitleInstID <= VibeFGEvents.TitleInstID
AND VibeFGEventsStudyStart.MIN_WinInstId <= VibeFGEvents.WndInstID
WHERE VibeFGEventsStudyStart.excelFilename IS NOT NULL
ORDER BY VibeFGEvents.id`