SQL Server : Text Search Pattern for Performance - sql-server

I have a requirement in which I periodically have to check 40k names against table of 70k names (on Azure SQL Server).
Table has 2 relevant columns
FIRSTNAME (nvarchar(15))
LASTNAME (nvarchar(20))
Name matches must be exact first and last name match.
Naively, my first approach would be to run 40k select/where firstname='xxx' and lastname='yyy' queries, but I have to believe there is a more performant way of doing it. I guess, on the surface, it sounds like about 280k text-based queries. Obviously, the column is short enough to where I can index it, but surely there is something more I could do?
My first question is, what's the most efficient way to handle a problem like this in SQL Server?
My second question is, does anyone with experience with something like this have any idea of how long a 40k text searches across 70k rows query would take, even just on order of magnitude? I.e. am I looking at minutes, hours, days, etc?
Thanks in advance for any insights.

An index which contains both FIRSTNAME and LASTNAME columns should be enough, if possible, make it clustered.
CREATE CLUSTERED INDEX [idx_yourTable] ON yourTable (
FIRSTNAME ASC,
LASTNAME ASC
)
If you are not able to create an index on your table, then you can retrive all the data to a temp table and make an index on the temp table.
DROP TABLE IF EXISTS #T_Local
DROP TABLE IF EXISTS #T_Azure
SELECT
ID
-- A seperator is used to avoid case like
-- 'FirstName' + 'LastName' = 'FirstNameLast' + 'Name'
,FIRSTNAME + '|' + LASTNAME AS [FULL_NAME]
,FIRSTNAME
,LASTNAME
INTO #T_Local
FROM server1.DB1.dbo.YourTable
SELECT
ID
,FIRSTNAME + '|' + LASTNAME AS [FULL_NAME]
,FIRSTNAME
,LASTNAME
INTO #T_Azure
FROM server2.DB1.dbo.YourTable
CREATE CLUSTERED INDEX [idx_t_local] ON #T_Local (
[FULL_NAME] ASC)
CREATE CLUSTERED INDEX [idx_t_azure] ON #T_Azure (
[FULL_NAME] ASC)
SELECT
tl.ID AS [ID_Local]
,tl.FIRSTNAME AS [FIRSTNAME_Local]
,tl.LASTNAME AS [LASTNAME_Local]
,ta.ID AS [ID_Azure]
,ta.FIRSTNAME AS [FIRSTNAME_Azure]
,ta.LASTNAME AS [FIRSTNAME_Azure]
FROM #T_Local tl
INNER JOIN #T_Azure ta
ON tl.FULL_NAME = ta.FULL_NAME
At last, 40k to 70k records are not that much data to cause any performance issues even without a proper index.

Related

A big 'like' matching query

I've got 2 tables,
'[Item] with field [name] nvarchar(255)
'[Transaction] with field [short_description] nvarchar(3999)
And I need to do thus :
Select [Transaction].id, [Item].id
From [Transaction] inner join [Item]
on [Transaction].[short_description] like ('%' + [Item].[name] + '%')
The above works if limited to a handful of items, but unfiltered is just going over 20 mins and I cancel.
I have a NC index on [name], but I cannot index [short_description] due to its length.
[Transaction] has 320,000 rows
[Items] has 42,000.
That's 13,860,000,000 combinations.
Is there a better way to perform this query ?
I did poke at full-text, but I'm not really that familiar, the answer was not jumping out at me there.
Any advice appreciated !!
Starting a comparison string with a wildcard (% or _) will NEVER use an index, and will typically be disastrous for performance. Your query will need to scan indexes rather than seek through them, so indexing won't help.
Ideally, you should have a third table that would allow a many-to-many relationship between Transaction and Item based on IDs. The design is the issue here.
After some more sleuthing I have utilized some Fulltext features.
sp_fulltext_keymappings
gives me my transaction table id, along with the FT docID
(I found out that 'doc' = text field)
sys.dm_fts_index_keywords_by_document
gives me FT documentId along with the individual keywords within it
Once I had that, the rest was simple.
Although, I do have to look into the term 'keyword' a bit more... seems that definition can be variable.
This only works because the text I am searching for has no white space.
I believe that you could tweak the FTI configuration to work with other scenarios... but I couldn't promise.
I need to look more into Fulltext.
My current 'beta' code below.
CREATE TABLE #keyMap
(
docid INT PRIMARY KEY ,
[key] varchar(32) NOT NULL
);
DECLARE #db_id int = db_id(N'<database name>');
DECLARE #table_id int = OBJECT_ID(N'Transactions');
INSERT INTO #keyMap
EXEC sp_fulltext_keymappings #table_id;
select km.[key] as transaction_id, i.[id] as item_id
from
sys.dm_fts_index_keywords_by_document ( #db_id, #table_id ) kbd
INNER JOIN
#keyMap km ON km.[docid]=kbd.document_id
inner join [items] i
on kdb.[display_term] = i.name
;
My actual version of the code includes inserting the data into a final table.
Execution time is coming in at 30 seconds, which serves my needs for now.

Why this query is running so slow?

This query runs very fast (<100 msec):
SELECT TOP (10)
[Extent2].[CompanyId] AS [CompanyId]
,[Extent1].[Id] AS [Id]
,[Extent1].[Status] AS [Status]
FROM [dbo].[SplittedSms] AS [Extent1]
INNER JOIN [dbo].[Sms] AS [Extent2]
ON [Extent1].[SmsId] = [Extent2].[Id]
WHERE [Extent2].[CompanyId] = 4563
AND ([Extent1].[NotifiedToClient] IS NULL)
If I add just a time filter, it takes too long (22 seconds!):
SELECT TOP (10)
[Extent2].[CompanyId] AS [CompanyId]
,[Extent1].[Id] AS [Id]
,[Extent1].[Status] AS [Status]
FROM [dbo].[SplittedSms] AS [Extent1]
INNER JOIN [dbo].[Sms] AS [Extent2]
ON [Extent1].[SmsId] = [Extent2].[Id]
WHERE [Extent2].Time > '2015-04-10'
AND [Extent2].[CompanyId] = 4563
AND ([Extent1].[NotifiedToClient] IS NULL)
I tried adding an index on the [Time] column of the Sms table, but the optimizer seems not using the index. Tried using With (index (Ix_Sms_Time)); but to my surprise, it takes even more time (29 seconds!).
Here is the actual execution plan:
The execution plan is same for both queries. Tables mentioned here have 5M to 8M rows (indices are < 1% fragmented and stats are updated). I am using MS SQL Server 2008R2 on a 16core 32GB memory Windows 2008 R2 machine)
Does it help when you force the time filter to kick in only after the client filter has run?
FI like in this example:
;WITH ClientData AS (
SELECT
[E2].[CompanyId]
,[E2].[Time]
,[E1].[Id]
,[E1].[Status]
FROM [dbo].[SplittedSms] AS [E1]
INNER JOIN [dbo].[Sms] AS [E2]
ON [E1].[SmsId] = [E2].[Id]
WHERE [E2].[CompanyId] = 4563
AND ([E1].[NotifiedToClient] IS NULL)
)
SELECT TOP 10
[CompanyId]
,[Id]
,[Status]
FROM ClientData
WHERE [Time] > '2015-04-10'
Create an index on Sms with the following Index Key Columns (in this order):
CompanyID
Time
You may or may not need to add Id as an Included Column.
What datatype is your Time column?
If it's datetime, try converting your '2015-04-10' into equivalent data-type, so that it can use the index.
Declare #test datetime
Set #test='2015-04-10'
Then modify your condition:
[Extent2].Time > #test
The sql server implicitly casts to matching data-type if there is a data-type mismatch. And any function or cast operation prevent using indexes.
I'm on the same track with #JonTirjan, the index with just Time results into a lot of key lookups, so you should try at least following:
create index xxx on Sms (Time, CompanyId) include (Id)
or
create index xxx on Sms (CompanyId, Time) include (Id)
If Id is your clustered index, then it's not needed in include clause. If significant part of your data belongs to CompanyID 4563, it might be ok to have it as include column too.
The percentages you see in actual plan are just estimates based on the row count assumptions, so those are sometimes totally wrong. Looking at actual number of rows / executions + statistics IO output should give you idea what's actually happening.
Two things come to mind:
By adding an extra restriction it will be 'harder' for the database to find the first 10 items that match your restrictions. Finding the first 10 rows from let's say 10.000 items (from a total of 1 milion) is a easier then finding the first 10 rows from maybe 100 items (from a total of 1 milion).
The index is not being used probably because the index is created on a datetime column, which is not very efficient if you are also storing the time in them. You might want to create a clustered index on the [time] column (but then you would have to remove the clustered index which is now on the [CompanyId] column or you could create a computed column which stores the date-part of the [time] column, create an index on this computed column and filter on this column.
I found out that there was no index on the foreign key column (SmsId) on the SplittedSms table. I made one and it seems the second query is almost as fast as the first one now.
The execution plan now:
Thanks everyone for the effort.

Improving SQL Server query performance

I have a written a query in which I create a string and take distinct values from the table based on conditions.
The table has some 5000 rows. This query takes almost 20 second to execute.
I believe that the string comparisons have made the query so slow. But I wonder what are my alternatives.
Query:
select distinct
Convert(nvarchar, p1.trafficSerial) +' ('+ p1.sourceTraffic + ' - ' + p1.sinkTraffic + ' )' as traffic
from
portList as p1
inner join
portList as p2 ON p1.pmId = p2.sinkId
AND P1.trafficSerial IS NOT NULL
AND (p1.trafficSerial = p2.trafficSerial)
AND (P1.sourceTraffic = P2.sourceTraffic)
AND (P1.sinkTraffic = P2.sinkTraffic)
where
p1.siteCodeID = #SiteId
One option is to create a computed column and create an index on that
This article discusses it http://blog.sqlauthority.com/2010/08/22/sql-server-computed-columns-index-and-performance/
ALTER TABLE dbo.portList ADD
traffic AS Convert(nvarchar,trafficSerial) +' ('+ sourceTraffic + ' - ' + sinkTraffic + ' )' PERSISTED
GO
CREATE NONCLUSTERED INDEX IX_portList_traffic
ON dbo.portList (traffic)
GO
select distinct traffic from dbo.portList
You should also make sure each of the columns in your join relationships have indexes on them:
p1.trafficSerial & p2.trafficSerial
P1.sourceTraffic & P2.sourceTraffic
P1.sinkTraffic & P2.sinkTraffic
and the column for your filter: p1.siteCodeID
this for surely help:
if you can remove distinct from your select statement this query will speed up quite a bit. many times, i handle distinct values over the client or web part of the system, like visual basic, php, c#, etc.
please, remove distinct keyword and time again your query after executing it at least twice.
however, if you cannot remove distinct then simply leave it there.
this is important: convert clustered index scan to clustered index seek or only index seek. that will speed up your query quite a bit. you acquire an index seek from an index scan by modifying the index. typically, a clustered index scan comes from a comparison on a column with a clustered index, which many times is a primary key. i suspect this column is portside.sitecodeid
best regards,
tonci korsano

SQL Server index - very large table with where clause against a very small range of values - do I need an index for the where clause?

I am designing a database with a single table for a special scenario I need to implement a solution for. The table will have several hundred million rows after a short time, but each row will be fairly compact. Even when there are a lot of rows, I need insert, update and select speeds to be nice and fast, so I need to choose the best indexes for the job.
My table looks like this:
create table dbo.Domain
(
Name varchar(255) not null,
MetricType smallint not null, -- very small range of values, maybe 10-20 at most
Priority smallint not null, -- extremely small range of values, generally 1-4
DateToProcess datetime not null,
DateProcessed datetime null,
primary key(Name, MetricType)
);
A select query will look like this:
select Name from Domain
where MetricType = #metricType
and DateProcessed is null
and DateToProcess < GETUTCDATE()
order by Priority desc, DateToProcess asc
The first type of update will look like this:
merge into Domain as target
using #myTablePrm as source
on source.Name = target.Name
and source.MetricType = target.MetricType
when matched then
update set
DateToProcess = source.DateToProcess,
Priority = source.Priority,
DateProcessed = case -- set to null if DateToProcess is in the future
when DateToProcess < DateProcessed then DateProcessed
else null end
when not matched then
insert (Name, MetricType, Priority, DateToProcess)
values (source.Name, source.MetricType, source.Priority, source.DateToProcess);
The second type of update will look like this:
update Domain
set DateProcessed = source.DateProcessed
from #myTablePrm source
where Name = source.Name and MetricType = #metricType
Are these the best indexes for optimal insert, update and select speed?
-- for the order by clause in the select query
create index IX_Domain_PriorityQueue
on Domain(Priority desc, DateToProcess asc)
where DateProcessed is null;
-- for the where clause in the select query
create index IX_Domain_MetricType
on Domain(MetricType asc);
Observations:
Your updates should use the PK
Why not use tinyint (range 0-255) to make the rows even narrower?
Do you need datetime? Can you use smalledatetime?
Ideas:
Your SELECT query doesn't have an index to cover it. You need one on (DateToProcess, MetricType, Priority DESC) INCLUDE (Name) WHERE DateProcessed IS NULL
`: you'll have to experiment with key column order to get the best one
You could extent that index to have a filtered indexes per MetricType too (keeping DateProcessed IS NULL filter). I'd do this after the other one when I do have millions of rows to test with
I suspect that your best performance will come from having no indexes on Priority and MetricType. The cardinality is likely too low for the indexes to do much good.
An index on DateToProcess will almost certainly help, as there is lilely to be high cardinality in that column and it is used in a WHERE and ORDER BY clause. I would start with that first.
Whether an index on DateProcessed will help is up for debate. That depends on what percentage of NULL values you expect for this column. Your best bet, as usual, is to examine the query plan with some real data.
In the table schema section, you have highlighted that 'MetricType' is one of two Primary keys, therefore this should definately be indexed along with the Name column. As for the 'Priority' and 'DateToProcess' fields as these will be present in a where clause it can't hurt to have them indexed also but I don't recommend the where clause you have on that index of 'DateProcessed' is null, indexing just a set of the data is not a good idea, remove this and index the whole of both those columns.

adding an index to sql server

I have a query that gets run often. its a dynmaic sql query because the sort by changes.
SELECT userID, ROW_NUMBER(OVER created) as rownumber
from users
where
divisionID = #divisionID and isenrolled=1
the OVER part of the query can be:
userid
created
Should I create an index for:
divisionID + isenrolled
divisionID + isenrolled + each_sort_by_option ?
Where should I put indexes for this table?
I'd start with
CREATE INDEX IX_SOQuestion ON dbo.users (divisionID, isenrolled) INCLUDE (userID, created)
The created ranking is unrelated to the WHERE clause, so may as well just INCLUDE it (so it's covered) rather that in the key columns. An internal sort would be needed anyway, so why make the key bigger?
Other sort columns could be included too
userid is only needed for output, so INCLUDE it
perhaps take isenrolled into INCLUDE if it's bit. Only 2 states (OK, 3 with NULL), so kinda pointless to add to the key columns
Start with divisionID + isenrolled + userID as it will always be used

Resources