I have two tables
existing_bactria (may contain millions of rows)
new_bactria (may contain millions of rows)
sample tables:
CREATE TABLE [dbo].[existing_bacteria](
[bacteria_name] [nchar](10) NULL,
[bacteria_type] [nchar](10) NULL,
[bacteria_sub_type] [nchar](10) NULL,
[bacteria_size] [nchar](10) NULL,
[bacteria_family] [nchar](10) NULL,
[bacteria_discovery_year] [date] NOT NULL
)
CREATE TABLE [dbo].[new_bacteria](
[existing_bacteria_name] [nchar](10) NULL,
[bacteria_type] [nchar](10) NULL,
[bacteria_sub_type] [nchar](10) NULL,
[bacteria_size] [nchar](10) NULL,
[bacteria_family] [nchar](10) NULL,
[bacteria_discovery_year] [date] NOT NULL
)
I need to create a stored proc to update new_bactria table with a possible match from existing_bactria (update field new_bactria.existing_bacteria_name
By finding a match on the other fields from [existing_bacteria] (assuming only single record in existing_bacteria)
Since the tables are massive (millions of records each) I would like your opinion on how to go about the solution, here is what I got so far:
Solution 1:
the obvious solution is to fetch all into a cursor and iterate over the results and update existing_bacteria
But since there are million records - its not an optimal solution
-- pseudo code
db_cursor as select * from new_bacteria
OPEN db_cursor
FETCH NEXT FROM db_cursor INTO #row
WHILE ##FETCH_STATUS = 0
BEGIN
IF EXISTS (
SELECT
#bacteria_name = [bacteria_name]
,#bacteria_type = [bacteria_type]
,#bacteria_size = [bacteria_size]
FROM [dbo].[existing_bacteria]
where [bacteria_type] = #row.[bacteria_type] and #row.[bacteria_size] = [bacteria_size]
)
BEGIN
PRINT 'update new_bacteria.existing_bacteria_name with [bacteria_name] we found.';
END
-- go to next record
FETCH NEXT FROM db_cursor INTO #name
END
Solution 2:
solution2 is to Join both tables in the mssql procedure
and iterate on the results but this is also
-- pseudo code
select * from [new_bacteria]
inner join [existing_bacteria]
on [new_bacteria].bacteria_size = [existing_bacteria].bacteria_size
and [new_bacteria].bacteria_family = [existing_bacteria].bacteria_family
for each result update [existing_bacteria]
I am sure this is not an optimal because of the table size and the iteration
Solution 3:
solution3 is to let the db handle the data and update the tables directly using inner Join:
-- pseudo code
UPDATE R
SET R.existing_bacteria_name = p.[bacteria_name]
FROM [new_bacteria] AS R
inner join [existing_bacteria] P
on R.bacteria_size = P.bacteria_size
and R.bacteria_family = P.bacteria_family
I am not sure about this solution.
Based on your pseudo code, I'd go with solution 3 because it is a set based operation and should be much quicker than using a cursor or other loop.
If you are having issues with performance with solution 3...
and you don't have indexes on those tables, particularly those columns you are using to join the two tables, creating those would help.
create unique index uix_new_bacteria_bacteria_size_bacteria_family
on [new_bacteria] (bacteria_size,bacteria_family);
create unique index uix_existing_bacteria_bacteria_size_bacteria_family
on [existing_bacteria] (bacteria_size,bacteria_family) include (bacteria_name);
and then try:
update r
set r.existing_bacteria_name = p.[bacteria_name]
from [new_bacteria] AS R
inner join [existing_bacteria] P on R.bacteria_size = P.bacteria_size
and R.bacteria_family = P.bacteria_family;
Updating a few million rows should not be a problem with the right indexes.
This section is no longer relevant after an update to the question
Another issue possibly exists in that if bacteria_size and bacteria_family are not unique sets, you could have multiple matches.
(since they are nullable I would imagine they aren't unique unless you're using a filtered index)
In that case, before moving forward, I'd create a table to investigate multiple matches like this:
create table [dbo].[new_and_existing_bacteria_matches](
[existing_bacteria_name] [nchar](10) not null,
rn int not null,
[bacteria_type] [nchar](10) null,
[bacteria_sub_type] [nchar](10) null,
[bacteria_size] [nchar](10) null,
[bacteria_family] [nchar](10) null,
[bacteria_discovery_year] [date] not null,
constraint pk_new_and_existing primary key clustered ([existing_bacteria_name], rn)
);
insert into [new_and_existing_bacteria_matches]
([existing_bacteria_name],rn,[bacteria_type],[bacteria_sub_type],[bacteria_size],[bacteria_family],[bacteria_discovery_year])
select
e.[existing_bacteria_name]
, rn = row_number() over (partition by e.[existing_bacteria_name] order by n.[bacteria_type], n.[bacteria_sub_type])
, n.[bacteria_type]
, n.[bacteria_sub_type]
, n.[bacteria_size]
, n.[bacteria_family]
, n.[bacteria_discovery_year]
from [new_bacteria] as n
inner join [existing_bacteria] e on n.bacteria_size = e.bacteria_size
and n.bacteria_family = e.bacteria_family;
-- and query multiple matches with something like this:
select *
from [new_and_existing_bacteria_matches] n
where exists (
select 1
from [new_and_existing_bacteria_matches] i
where i.[existing_bacteria_name]=n.[existing_bacteria_name]
and rn>1
);
On the subject of performance I'd look at:
The "Recovery Model" of the database, if your DBA says you can have it in "simple mode" then do it, you want to have as little logging as possible.
Consider Disabling some Indexes on the TARGET table, and then rebuilding them when you've finished. On large scale operations the modifications to the index will lead to extra logging, and the manipulation of the index will take up space in your Buffer Pool.
Can you convert the NCHAR to CHAR, it will require less storage space consequently reducing IO, freeing up buffer space and reducing Logging.
If your target table has no Clustered index then try activating 'TraceFlag 610' (warning this is an Instance-wide setting so talk to your DBA)
If your environment allows it, the use of the TABLOCKX hint can remove locking overhead and also help meet the criteria for reduced logging.
For anyone who has to perform Bulk Inserts or Large scale updates, this white paper from Microsoft is a valuable read:
You can try a MERGE statement. It will perform the operation in a single pass of the data. (the problem with a merge is that it tries to do everything in one Transaction and you can end up with an unwanted Spool in the execution plan. I'd then move towards a Batch process looping through maybe 100,000 records at a time.)
(It will need some minor changes to suit your column matching/update requirements)
MERGE [dbo].[new_bacteria] T --TARGET TABLE
USING [dbo].[existing_bacteria] S --SOURCE TABLE
ON
S.[bacteria_name] = T.[existing_bacteria_name] --FILEDS TO MATCH ON
AND S.[bacteria_type] = T.[bacteria_type]
WHEN MATCHED
AND
ISNULL(T.[bacteria_sub_type],'') <> ISNULL(S.[bacteria_sub_type],'') --FIELDS WHERE YOURE LOOKING FOR A CHANGE
OR ISNULL(T.[bacteria_size],'') <> ISNULL(S.[bacteria_size],'')
THEN --UPDATE RECORDS THAT HAVE CHANGED
UPDATE
SET T.[bacteria_sub_type] = S.[bacteria_sub_type]
WHEN NOT MATCHED BY TARGET THEN --ANY NEW RECORDS IN THE SOURCE TABLE WILL BE INSERTED
INSERT(
[existing_bacteria_name],
[bacteria_type],
[bacteria_sub_type],
[bacteria_size],
[bacteria_family],
[bacteria_discovery_year]
)
VALUES(
s.[bacteria_name],
s.[bacteria_type],
s.[bacteria_sub_type],
s.[bacteria_size],
s.[bacteria_family],
s.[bacteria_discovery_year]
);
If the Single MERGE is too much for your system to handle, here's a method for embedding it in a loop that updates large batches. You can modify the batch size to match your Server's capabilities.
It works by using a couple of staging tables that ensure if anything goes wrong (i.e. server agent restart), the process can continue from where it left off. (If you have any question please ask).
--CAPTURE WHAT HAS CHANGED SINCE THE LAST TIME THE SP WAS RUN
--EXCEPT is a usefull command because it can compare NULLS, this removes the need for ISNULL or COALESCE
INSERT INTO [dbo].[existing_bacteria_changes]
SELECT
*
FROM
[dbo].[existing_bacteria]
EXCEPT
SELECT
*
FROM
[dbo].[new_bacteria]
--RUN FROM THIS POINT IN THE EVENT OF A FAILURE
DECLARE #R INT = 1
DECLARE #Batch INT = 100000
WHILE #R > 0
BEGIN
BEGIN TRAN --CARRY OUT A TRANSACTION WITH A SUBSET OF DATA
--USE DELETE WITH OUTPUT TO MOVE A BATCH OF RECORDS INTO A HOLDING AREA.
--The holding area will provide a rollback point so if the job fails at any point it will restart from where it last was.
DELETE TOP (#Batch)
FROM [dbo].[existing_bacteria_changes]
OUTPUT DELETED.* INTO [dbo].[existing_bacteria_Batch]
##ROWCOUNT
--LOG THE NUMBER OF RECORDS IN THE UPDATE SET, THIS WILL ENSURE THE NEXT ITTERATION
SET #R = ISNULL(##ROWCOUNT,0)
--RUN THE MERGE STATEMENT WITH THE SUBSET OF UPDATES
MERGE [dbo].[new_bacteria] T --TARGET TABLE
USING [dbo].[existing_bacteria_Batch] S --SOURCE TABLE
ON
S.[bacteria_name] = T.[existing_bacteria_name] --FILEDS TO MATCH ON
AND S.[bacteria_type] = T.[bacteria_type]
WHEN MATCHED
AND
ISNULL(T.[bacteria_sub_type],'') <> ISNULL(S.[bacteria_sub_type],'') --FIELDS WHERE YOURE LOOKING FOR A CHANGE
OR ISNULL(T.[bacteria_size],'') <> ISNULL(S.[bacteria_size],'')
THEN --UPDATE RECORDS THAT HAVE CHANGED
UPDATE
SET T.[bacteria_sub_type] = S.[bacteria_sub_type]
WHEN NOT MATCHED BY TARGET THEN --ANY NEW RECORDS IN THE SOURCE TABLE WILL BE INSERTED
INSERT(
[existing_bacteria_name],
[bacteria_type],
[bacteria_sub_type],
[bacteria_size],
[bacteria_family],
[bacteria_discovery_year]
)
VALUES(
s.[bacteria_name],
s.[bacteria_type],
s.[bacteria_sub_type],
s.[bacteria_size],
s.[bacteria_family],
s.[bacteria_discovery_year]
);
COMMIT;
--No point in logging this action
TRUNCATE [dbo].[existing_bacteria_Batch]
END
Definitely option 3. SET-based always wins from anything loopy.
That said, the biggest 'risk' might be that the amount of updated data 'overwhelms' your machine. More specific, it could happen that the transaction becomes so big that the system takes forever to finish it. To avoid this you could try splitting the one big UPDATE into multiple smaller UPDATEs and still work mostly set-based. Good indexing and knowing your data is key here.
For instance, starting from
UPDATE R
SET R.existing_bacteria_name = p.[bacteria_name]
FROM [new_bacteria] AS R
INNER JOIN [existing_bacteria] P
ON R.bacteria_size = P.bacteria_size
AND R.bacteria_family = P.bacteria_family
You might try 'chunk' the (target) table into smaller parts. E.g. by making a loop over the bacteria_discovery_year field, assuming that said column splits the table into e.g. 50 more or less equally sized parts. (BTW: I'm no biologist so I might be totally wrong there =)
You'd then get something along the lines of:
DECLARE #c_bacteria_discovery_year date
DECLARE year_loop CURSOR LOCAL STATIC
FOR SELECT DISTINCT bacteria_discovery_year
FROM [new_bacteria]
ORDER BY bacteria_discovery_year
OPEN year_loop
FETCH NEXT FROM year_loop INTO #c_bacteria_discovery_year
WHILE ##FETCH_STATUS = 0
BEGIN
UPDATE R
SET R.existing_bacteria_name = p.[bacteria_name]
FROM [new_bacteria] AS R
INNER JOIN [existing_bacteria] P
ON R.bacteria_size = P.bacteria_size
AND R.bacteria_family = P.bacteria_family
WHERE R.bacteria_discovery_year = #c_bacteria_discovery_year
FETCH NEXT FROM year_loop INTO #c_bacteria_discovery_year
END
CLOSE year_loop
DEALLOCATE year_loop
Some remarks:
Like I said, I don't know the distribution of the bacteria_discovery_year values, if 3 years make up 95% of the data it might not be such a great choice.
This will only work if there is an index on the bacteria_discovery_year column, preferably with bacteria_size and bacteria_family included.
You could add some PRINT inside the loop to see the progress and rows affected... it won't speed up anything, but it feels better if you know it's doing something =)
All in all, don't overdo it, if you split it into too many small chunks you'll end up with something that takes forever too.
PS: in any case you'll also need an index on the 'source' table that indexes the bacteria_size and bacteria_family column, preferably including the bacteria_name if the latter is not the (clustered) PK of the table.
Related
First post here. What a great resource. Hoping someone can help....
I have a character field that contains mostly numeric values but not all. The field, lets call it diag, is formatted as varchar(8). It contains diagnosis codes and they have been entered inconsistently at times. So I might see 29001 in the diag field. Or I might see 290.001. Sometimes people will code it as 290.00 other times 29000 and yet other times 290. To make it more complicated, I may have alpha characters in that field so the field could contain something like V700.00 or H601. Using these as examples, but it's indicative of what's in the field.
I am trying to find a range of values....for instance diagnosis codes between 29001 to 29999. Taking into account the inconsistencies in coding entry, I also want to return any records that have a diag value of 290.01 to 299.99 I am just at a loss. Searched here for hours and found a lot of info... but couldn't seem to answer my question. I am somewhat new to SQL and can't figure out how to return records that match the range of values I am looking for. There are 40-some million records so it is a lot of data. Trying to pare it down to something I can work with. I am using an older version of SQL Server...2005 in case it matters.
Any help would be most appreciated. I really don't even know where to start.
Thank you!
you can use this T-SQL to remove all non wanted characters in your numbers.
declare #strText varchar(50)
--set #strText = '23,112'
--set #strText = '23Ass112'
set #strText = '2.3.1.1.2'
WHILE PATINDEX('%[^0-9]%', #strText) > 0
BEGIN
SET #strText = STUFF(#strText, PATINDEX('%[^0-9]%', #strText), 1, '')
END
select #strText
on your case I suggest you to create a function
CREATE Function CleanNumbers(#strText VARCHAR(1000))
RETURNS VARCHAR(1000)
AS
WHILE PATINDEX('%[^0-9]%', #strText) > 0
BEGIN
SET #strText = STUFF(#strText, PATINDEX('%[^0-9]%', #strText), 1, '')
END
return #strText
END
Then you'll have to create a normal query calling the function.
WITH CTE as
(
SELECT dbo.CleanNumbers(yourtable.YourFakeNumber) as Number, yourtable.*
FROM yourtable
WHERE YourCriteria = 1
)
Select * from CTE where CAST(Number as int) between 29001 and 29999
Or easier
Select * from yourtable where CAST(dbo.CleanNumbers(YourFakeNumber) as int) between 29001 and 29999
I hope I haven't done any spelling mistakes ;)
It sounds like you have a little bit of a mess. If you know the rules for the variances, then you could build an automated script to update. But it sounds like it's pretty loose, so you might want to start by deciding what are valid values for the fields, making a table of them to validate against, and then identifying and classifying the invalid data.
First step, you need to get a list of valid diagnosis codes and get them into a table. Something like:
CREATE TABLE [dbo].[DiagnosticCodes](
[DiagnosticCode] [varchar](8) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[DiagnosticDescription] [varchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
CONSTRAINT [PK_DiagnosticCodes] PRIMARY KEY CLUSTERED
(
[DiagnosticCode] ASC
)
)
Then get a list of the valid codes and import them into this table.
Then you need to find data in your table that is invalid. Something like this query will give you all the invalid codes in your database:
CREATE TABLE [dbo].[DiagnosticCodesMapping](
[Diag] [varchar](8) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[NewCode] [varchar](8) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
CONSTRAINT [PK_DiagnosticCodesMapping] PRIMARY KEY CLUSTERED
(
[Diag] ASC
)
)
insert into [dbo].[DiagnosticCodesMapping]
Select distinct MyDataTable.Diag, null NewCode
from MyDataTable
left join DiagnosticCodes
on MyDataTable.Diag = DiagnosticCodes.DiagnosticCode
where DiagnosticCodes.DiagnosticCode is null
This creates a table of all the invalid codes and also includes a field called NewCode, which you will populate a mapping from the invalid code to a new valid code. Hopefully this list will not be ridiculously long. Then you hand it over to someone for review and to enter the NewCode field to be one of the valid codes. Once you have your DiagnosticCodesMapping table completely filled in, you can then do an update to get all your fields to have valid codes:
update MyDataTable
set Diag=NewCode
from MyDataTable
join DiagnosticCodesMapping
where MyDataTable.Diag = DiagnosticCodesMapping.Diag
Doing it this way has the added advantage that you can now start validating all data entry in the future and you'll never have to do this cleanup again. You can create a constraint that ensures only valid codes from the DiagnosticCode table can be entered into the Diag field of your data table. You should check your interface to use the new lookup table as well. You'll also have to create a data maintenance interface to the DiagnosticCode table if you need to have super users with the ability to add new codes.
We are using the technique outlined here to generate random record IDs without collisions. In short, we create a randomly-ordered table of every possible ID, and mark each record as 'Taken' as it is used.
I use the following Stored Procedure to obtain an ID:
ALTER PROCEDURE spc_GetId #retVal BIGINT OUTPUT
AS
DECLARE #curUpdate TABLE (Id BIGINT);
SET NOCOUNT ON;
UPDATE IdMasterList SET Taken=1
OUTPUT DELETED.Id INTO #curUpdate
WHERE ID=(SELECT TOP 1 ID FROM IdMasterList WITH (INDEX(IX_Taken)) WHERE Taken IS NULL ORDER BY SeqNo);
SELECT TOP 1 #retVal=Id FROM #curUpdate;
RETURN;
The retrieval of the ID must be an atomic operation, as simultaneous inserts are possible.
For large inserts (10+ million), the process is quite slow, as I must pass through the table to be inserted via a cursor.
The IdMasterList has a schema:
SeqNo (BIGINT, NOT NULL) (PK) -- sequence of ordered numbers
Id (BIGINT) -- sequence of random numbers
Taken (BIT, NULL) -- 1 if taken, NULL if not
The IX_Taken index is:
CREATE NONCLUSTERED INDEX (IX_Taken) ON IdMasterList (Taken ASC)
I generally populate a table with Ids in this manner:
DECLARE #recNo BIGINT;
DECLARE #newId BIGINT;
DECLARE newAdds CURSOR FOR SELECT recNo FROM Adds
OPEN newAdds;
FETCH NEXT FROM newAdds INTO #recNo;
WHILE ##FETCH_STATUS=0 BEGIN
EXEC spc_GetId #newId OUTPUT;
UPDATE Adds SET id=#newId WHERE recNo=#recNo;
FETCH NEXT FROM newAdds INTO #id;
END;
CLOSE newAdds;
DEALLOCATE newAdds;
Questions:
Is there any way I can improve the SP to extract Ids faster?
Would a conditional index improve peformance (I've yet to test, as
IdMasterList is very big)?
Is there a better way to populate a table with these Ids?
As with most things in SQL Server, if you are using cursors, you are doing it wrong.
Since you are using SQL Server 2012, you can use a SEQUENCE to keep track of what random value you already used and effectively replace the Taken column.
CREATE SEQUENCE SeqNoSequence
AS bigint
START WITH 1 -- Start with the first SeqNo that is not taken yet
CACHE 1000; -- Increase the cache size if you regularly need large blocks
Usage:
CREATE TABLE #tmp
(
recNo bigint,
SeqNo bigint
)
INSERT INTO #tmp (recNo, SeqNo)
SELECT recNo,
NEXT VALUE FOR SeqNoSequence
FROM Adds
UPDATE Adds
SET id = m.id
FROM Adds a
INNER JOIN #tmp tmp ON a.recNo = tmp.recNo
INNER JOIN IdMasterList m ON tmp.SeqNo = m.SeqNo
SEQUENCE is atomic. Subsequent calls to NEXT VALUE FOR SeqNoSequence are guaranteed to return unique values, even for parallel processes. Note that there can be gaps in SeqNo, but it's a very small trade off for the huge speed increase.
Put a PK inden of BigInt on each table
insert into user (name)
values ().....
update user set = user.ID = id.ID
from id
left join usr
on usr.PK = id.PK
where user.ID = null;
one
insert into user (name) value ("justsaynotocursor");
set #PK = select select SCOPE_IDENTITY();
update user set ID = (select ID from id where PK = #PK);
Few ideas that came to my mind:
Try if removing the top, inner select etc. helps to improve the performance of the ID fetching (look at statistics io & query plan):
UPDATE top(1) IdMasterList
SET #retVal = Id, Taken=1
WHERE Taken IS NULL
Change the index to be a filtered index, since I assume you don't need to fetch numbers that are taken. If I remember correctly, you can't do this for NULL values, so you would need to change the Taken to be 0/1.
What actually is your problem? Fetching single IDs or 10+ million IDs? Is the problem CPU / I/O etc. caused by the cursor & ID fetching logic, or are the parallel processes being blocked by other processes?
Use sequence object to get the SeqNo. and then fetch the Id from idMasterList using the value returned by it. This could work if you don't have gaps in IdMasterList sequences.
Using READPAST hint could help in blocking, for CPU / I/O issues, you should try to optimize the SQL.
If the cause is purely the table being a hotspot, and no other easy solutions seem to help, split it into several tables and use some kind of simple logic (even ##spid, rand() or something similar) to decide from which table the ID should be fetched. You would need more checking if all tables have free numbers, but it shouldn't be that bad.
Create different procedures (or even tables) to handle fetching of single ID, hundreds of IDs and millions of IDs.
This query runs very fast (<100 msec):
SELECT TOP (10)
[Extent2].[CompanyId] AS [CompanyId]
,[Extent1].[Id] AS [Id]
,[Extent1].[Status] AS [Status]
FROM [dbo].[SplittedSms] AS [Extent1]
INNER JOIN [dbo].[Sms] AS [Extent2]
ON [Extent1].[SmsId] = [Extent2].[Id]
WHERE [Extent2].[CompanyId] = 4563
AND ([Extent1].[NotifiedToClient] IS NULL)
If I add just a time filter, it takes too long (22 seconds!):
SELECT TOP (10)
[Extent2].[CompanyId] AS [CompanyId]
,[Extent1].[Id] AS [Id]
,[Extent1].[Status] AS [Status]
FROM [dbo].[SplittedSms] AS [Extent1]
INNER JOIN [dbo].[Sms] AS [Extent2]
ON [Extent1].[SmsId] = [Extent2].[Id]
WHERE [Extent2].Time > '2015-04-10'
AND [Extent2].[CompanyId] = 4563
AND ([Extent1].[NotifiedToClient] IS NULL)
I tried adding an index on the [Time] column of the Sms table, but the optimizer seems not using the index. Tried using With (index (Ix_Sms_Time)); but to my surprise, it takes even more time (29 seconds!).
Here is the actual execution plan:
The execution plan is same for both queries. Tables mentioned here have 5M to 8M rows (indices are < 1% fragmented and stats are updated). I am using MS SQL Server 2008R2 on a 16core 32GB memory Windows 2008 R2 machine)
Does it help when you force the time filter to kick in only after the client filter has run?
FI like in this example:
;WITH ClientData AS (
SELECT
[E2].[CompanyId]
,[E2].[Time]
,[E1].[Id]
,[E1].[Status]
FROM [dbo].[SplittedSms] AS [E1]
INNER JOIN [dbo].[Sms] AS [E2]
ON [E1].[SmsId] = [E2].[Id]
WHERE [E2].[CompanyId] = 4563
AND ([E1].[NotifiedToClient] IS NULL)
)
SELECT TOP 10
[CompanyId]
,[Id]
,[Status]
FROM ClientData
WHERE [Time] > '2015-04-10'
Create an index on Sms with the following Index Key Columns (in this order):
CompanyID
Time
You may or may not need to add Id as an Included Column.
What datatype is your Time column?
If it's datetime, try converting your '2015-04-10' into equivalent data-type, so that it can use the index.
Declare #test datetime
Set #test='2015-04-10'
Then modify your condition:
[Extent2].Time > #test
The sql server implicitly casts to matching data-type if there is a data-type mismatch. And any function or cast operation prevent using indexes.
I'm on the same track with #JonTirjan, the index with just Time results into a lot of key lookups, so you should try at least following:
create index xxx on Sms (Time, CompanyId) include (Id)
or
create index xxx on Sms (CompanyId, Time) include (Id)
If Id is your clustered index, then it's not needed in include clause. If significant part of your data belongs to CompanyID 4563, it might be ok to have it as include column too.
The percentages you see in actual plan are just estimates based on the row count assumptions, so those are sometimes totally wrong. Looking at actual number of rows / executions + statistics IO output should give you idea what's actually happening.
Two things come to mind:
By adding an extra restriction it will be 'harder' for the database to find the first 10 items that match your restrictions. Finding the first 10 rows from let's say 10.000 items (from a total of 1 milion) is a easier then finding the first 10 rows from maybe 100 items (from a total of 1 milion).
The index is not being used probably because the index is created on a datetime column, which is not very efficient if you are also storing the time in them. You might want to create a clustered index on the [time] column (but then you would have to remove the clustered index which is now on the [CompanyId] column or you could create a computed column which stores the date-part of the [time] column, create an index on this computed column and filter on this column.
I found out that there was no index on the foreign key column (SmsId) on the SplittedSms table. I made one and it seems the second query is almost as fast as the first one now.
The execution plan now:
Thanks everyone for the effort.
Good day Guys, Would you help me with my SQL Query. I have on proj in the web which I called INQUIRY good thing is I can store the log file of what data is being search to my project which they enter to my inquiry search box.
This is the table of Keyword have been searched in INQUIRY:
This Code :
Insert into #temptable
Select CaseNo from tblcrew
where Lastname like '%FABIANA%'
and firstname like '%MARLON%'
Insert into #temptable
Select CaseNo from tblcrew
where Lastname like '%DE JOAN%'
and firstname like '%ROLANDO%'
Insert into #temptable
Select CaseNo from tblcrew
where Lastname like '%ROSAS%'
and firstname like '%FRANCASIO%'
I want to repeat my query until all the rows in table of keyword is being search and save the result of each query into a temporary table. Is there a possibility to do that without typing all the value of in the columns of keyword.
Please anyone help me.. thanks!
All you need is join the two tables together without typing any values.
Insert into #temptable
Select c.CaseNo
from tblcrew c
inner join tblKeyword k
on c.Lastname like '%'+k.Lastname+'%'
and c.firstname like '%'+k.firstname +'%'
Usually start with the Adventure Works database for examples like this. I will be talking about exact matches with leverage an index seek, in-exact matches that leverage a index scan, and full text indexing in which you can do a in-exact match resulting in a seek.
The Person.Person table has both last and first name like your example. I keep just the primary key on business id and create one index on (last, first).
--
-- Just PK & One index for test
--
-- Sample database
use [AdventureWorks2012];
go
-- add the index
CREATE NONCLUSTERED INDEX [IX_Person_LastName_FirstName] ON [Person].[Person]
(
[LastName] ASC,
[FirstName] ASC
);
go
Run with wild card for inexact match. Run with just text for exact match. I randomly picked two names from the Person.Person table.
--
-- Run for match type
--
-- Sample database
use [AdventureWorks2012];
go
-- remove temp table
drop table #inquiry;
go
-- A table with first, last name combos to search
create table #inquiry
(
first_name varchar(50),
last_name varchar(50)
);
go
-- Add two person.person names
insert into #inquiry values
('%Cristian%', '%Petculescu%'),
('%John%', '%Kane%');
/*
('Cristian', 'Petculescu'),
('John', 'Kane');
*/
go
-- Show search values
select * from #inquiry;
go
The next step when examining run times is to clear the procedure cache and memory buffers. You do not want existing plans or data skew the numbers.
-- Remove clean buffers & clear plan cache
CHECKPOINT
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
GO
-- Show time & i/o
SET STATISTICS TIME ON
SET STATISTICS IO ON
GO
The first SQL statement will do a inner join between the temporary search values table and Person.Person.
-- Exact match
select *
from
[Person].[Person] p join #inquiry i
on p.FirstName = i.first_name and p.LastName = i.last_name
The statistics and run times.
Table 'Person'. Scan count 2, logical reads 16, physical reads 8, CPU time = 0 ms, elapsed time = 29 ms.
The resulting query plan does a table scan of the #inquiry table and a index seek of the index on a last and first name. It is a nice simple plan.
Lets retry this with a inexact match using wild cards and the LIKE operator.
-- In-Exact match
select *
from
[Person].[Person] p join #inquiry i
on p.FirstName like i.first_name and p.LastName like i.last_name
The statistics and run times.
Table 'Person'. Scan count 2, logical reads 219, CPU time = 32 ms, elapsed time = 58 ms.
The resulting query plan is a-lot more complicated. We are still doing a table scan of #inquiry since it does not have an index. However, there are a-lot of nested joins going on to used the index with a impartial match.
We added three more operators to the query and the execution time is twice that of the exact match.
In short, if you are doing inexact matches with the LIKE command, they will be more expensive.
If you are searching hundreds of thousands of records, use a FULL TEXT INDEX (FTI). I wrote two articles on this topic.
http://craftydba.com/?p=1421
http://craftydba.com/?p=1629
Every night, you will have to have a process that updates the FTI with any changes. After that one hit, you can use the CONTAINS() operator to leverage the index in fuzzy matches.
I hope I explained the differences. I have seen continued confusion on this topic and I wanted to put something out on Stack Overflow that I could reference.
Best of luck Juan.
I am designing a database with a single table for a special scenario I need to implement a solution for. The table will have several hundred million rows after a short time, but each row will be fairly compact. Even when there are a lot of rows, I need insert, update and select speeds to be nice and fast, so I need to choose the best indexes for the job.
My table looks like this:
create table dbo.Domain
(
Name varchar(255) not null,
MetricType smallint not null, -- very small range of values, maybe 10-20 at most
Priority smallint not null, -- extremely small range of values, generally 1-4
DateToProcess datetime not null,
DateProcessed datetime null,
primary key(Name, MetricType)
);
A select query will look like this:
select Name from Domain
where MetricType = #metricType
and DateProcessed is null
and DateToProcess < GETUTCDATE()
order by Priority desc, DateToProcess asc
The first type of update will look like this:
merge into Domain as target
using #myTablePrm as source
on source.Name = target.Name
and source.MetricType = target.MetricType
when matched then
update set
DateToProcess = source.DateToProcess,
Priority = source.Priority,
DateProcessed = case -- set to null if DateToProcess is in the future
when DateToProcess < DateProcessed then DateProcessed
else null end
when not matched then
insert (Name, MetricType, Priority, DateToProcess)
values (source.Name, source.MetricType, source.Priority, source.DateToProcess);
The second type of update will look like this:
update Domain
set DateProcessed = source.DateProcessed
from #myTablePrm source
where Name = source.Name and MetricType = #metricType
Are these the best indexes for optimal insert, update and select speed?
-- for the order by clause in the select query
create index IX_Domain_PriorityQueue
on Domain(Priority desc, DateToProcess asc)
where DateProcessed is null;
-- for the where clause in the select query
create index IX_Domain_MetricType
on Domain(MetricType asc);
Observations:
Your updates should use the PK
Why not use tinyint (range 0-255) to make the rows even narrower?
Do you need datetime? Can you use smalledatetime?
Ideas:
Your SELECT query doesn't have an index to cover it. You need one on (DateToProcess, MetricType, Priority DESC) INCLUDE (Name) WHERE DateProcessed IS NULL
`: you'll have to experiment with key column order to get the best one
You could extent that index to have a filtered indexes per MetricType too (keeping DateProcessed IS NULL filter). I'd do this after the other one when I do have millions of rows to test with
I suspect that your best performance will come from having no indexes on Priority and MetricType. The cardinality is likely too low for the indexes to do much good.
An index on DateToProcess will almost certainly help, as there is lilely to be high cardinality in that column and it is used in a WHERE and ORDER BY clause. I would start with that first.
Whether an index on DateProcessed will help is up for debate. That depends on what percentage of NULL values you expect for this column. Your best bet, as usual, is to examine the query plan with some real data.
In the table schema section, you have highlighted that 'MetricType' is one of two Primary keys, therefore this should definately be indexed along with the Name column. As for the 'Priority' and 'DateToProcess' fields as these will be present in a where clause it can't hurt to have them indexed also but I don't recommend the where clause you have on that index of 'DateProcessed' is null, indexing just a set of the data is not a good idea, remove this and index the whole of both those columns.